 Research
 Open Access
 Published:
Analyzing temporal dynamics of cell deformation and intracellular movement with video feature aggregation
BioMedical Engineering OnLinevolume 18, Article number: 20 (2019)
Abstract
Background
The research and analysis of cellular physiological properties has been an essential approach to studying some biological and biomedical problems. Temporal dynamics of cells therein are used as a quantifiable indicator of cellular response to extracellular cues and physiological stimuli.
Methods
This work presents a novel imagebased framework to profile and model the cell dynamics in livecell videos. In the framework, the cell dynamics between frames are represented as framelevel features from cell deformation and intracellular movement. On the one hand, shape context is introduced to enhance the robustness of measuring the deformation of cellular contours. On the other hand, we employ ScaleInvariant Feature Transform (SIFT) flow to simultaneously construct the complementary movement field and appearance change field for the cytoplasmic streaming. Then, time series modeling is performed on these framelevel features. Specifically, temporal feature aggregation is applied to capture the videowide temporal evolution of cell dynamics.
Results
Our results demonstrate that the proposed cell dynamic features can effectively capture the cell dynamics in videos. They also prove that the Movement Field and Appearance Change Field Feature (MFAFF) can more precisely model the cytoplasmic streaming. Besides, temporal aggregation of cell dynamic features brings a substantial absolute increase of classification performance.
Conclusion
Experimental results demonstrate that the proposed framework outperforms competing mainstreaming approaches on the aforementioned datasets. Thus, our method has potential for cell dynamics analysis in videos.
Background
Imagebased cell profiling provides quantitative information about cell state and paves the way to studying biological and biomedical problems [1,2,3,4]. As one of most significant aspects therein, characterizing temporal dynamics of cells is used to model cell cycle, analyze migratory phenotypes, and unravel cellular response to physiological stimuli [5,6,7,8,9]. Because of the ability to capture spatiotemporal data, livecell imaging technology facilitates the analysis of cell dynamics based on image processing and machine learning [10, 11].
To obtain the features for temporal dynamics of cells, cell profiling methods need to precisely characterize the visual appearance of cells and its change on consecutive frames. These methods are divided into two categories according to the cell dynamics they adopted (the deformation of cell contour and the active or directed intracellular movement). Some shape parameters, such as the area or volume, centroid, and circularity, are computed as the global features of cell contour, and the variance of these features is regarded as the index of cell dynamics [5, 12]. However, shape parameters cannot precisely characterize cell morphologies and cell morphology dynamics. The radial distance of cellular contours is employed to preserve more subtle structures of cellular morphology. Similarly, tree graph (TG), a variant of radial distance, is designed for arbitrary cell contours, especially those with cell protrusions [13]. Then the length variation of protrusions (or the number variation of protrusions^{Footnote 1}) between frames is calculated as the feature of cell contour dynamics.
These methods based on the cell contour dynamics make use of some straightforward shape matching strategies. As the accurate correspondence between two cell contours benefits the subsequent deformation measurement, the learningbased shape matching strategy might be a better choices. Shape context measures the shape deformation by optimizing a shape matching problem and is applied to assessing the deformation of anatomical tissue and falling human silhouettes [14,15,16]. Thus it is suitable to be introduced into our framework to quantize the deformation of cellular contours.
Another category of cell dynamics, or alternatively, the intracellular movement is also relevant to the cell dynamics. Image crosscorrelation is employed to obtain timedependent speckle pattern derived from optical coherence microscopy images as the representation of the cell dynamics [17]. Furthermore, cytoplasmic streaming is modeled to construct the movement field (or displacement field) between a pair of frames based on optical flow, and then the average horizontal velocity and vertical velocity are concatenated into the feature vector according to temporal order [18]. In fact, besides intracellular movement, there is the phenomenon of splitting, merging and disappearing of them during cytoplasmic streaming [19]. And this phenomenon corresponds to the change of intensity or texture around the moving particles, i.e., the changes of image local properties. Hence, it is reasonable to construct an additional appearance change field for cytoplasmic streaming as a complement to the original movement field.
Nevertheless, optical flow is based on the brightness constancy assumption, which cannot construct a meaningful appearance change field and is sensitive to the variance of light, perspective, and noise. ScaleInvariant Feature Transform (SIFT) flow can obtain the robust semanticlevel correspondence between two images. In this paper, we employ SIFT flow to establish the movement field and appearance change field for cytoplasmic streaming. Herein the appearance change field is constructed by computing the discrepancy of the corresponding SIFT descriptors.
Although the aforementioned methods successfully capture the cell dynamics from shortterm video segments, the subsequent videorange aggregation of these features is not considered indepth. They only adopt the concatenation or accumulation strategy for the features along temporal dimension [18, 20,21,22]. To preserve more temporal structures, hidden Markov models (HMM) are introduced to represent the cell shape dynamics in time series as predefined morphological states. This process condenses the temporal dynamics into a simpler representation, which enhances discriminative power for profiling temporal dynamics [23]. Therefore, HMM is applied to recognize the cellular phases during mitosis and cellularresponsebased drug classification [6, 24]. Similarly, in the previous work of this study, a temporal bag of words (TBoW) model is utilized to fuse cell dynamic features between frames [25]. The TBoW model learns a codebook containing the typical modes of cell shortterm dynamics and encodes these shortterm dynamics as visual words in a codebook. Finally, the word frequency of the codebook is defined as the videorange cell dynamics.
HMM and TBoW only transform the primary features into predefined states (or visual words) or sample statistics, i.e., the number of states. Compact encoding, by contrast, exploits more statistics, such as mean, variance, as well as even skewness and kurtosis, which leads to its great advantage over HMM and TBoW. In this paper, we introduce the compact encoding for the sake of modeling the temporal dynamics in livecell videos. We further compare Fisher vector (FV) [26, 27], vector of locally aggregated descriptors (VLAD) [28, 29] and higherorder VLAD (HVLAD) [30] to find out the best one for our application.
This paper mainly proposes a novel framework to evaluate temporal dynamics of cells as shown in Fig. 1, and its contributions are threefold. First, shape context is introduced to measure the deformation of cellular contours. Second, SIFT flow is utilized to model the complementary movement field and appearance change field for the cytoplasmic streaming simultaneously. Finally, we introduce and compare three mainstreaming compact encoding approaches to temporal aggregation of dynamic features, and discover the most suitable encoding strategy for the whole framework.
Materials and methods
In this section, we first describe two livecell video datasets for evaluating the utility of the proposed framework. These two datasets individually contain 80 video clips in two classes and 120 video clips in four classes. Then the cell dynamic features between frames are presented, which include the contour feature using shapecontext and the cytoplasm feature based on SIFT flow. Finally, we present the temporal aggregation strategy for these dynamic features to generate the videowide representation.
Data
To validate the proposed approach, there are two datasets of video clips about lymphocytes established by the collaboration hospital, Beijing You’an Hospital. The lymphocytes were from the blood samples that were collected from the tails of the mice (6–8 weeks, 20–22 g) after the skin transplantation, and the video clips (20–30 s, 25 frames per second, 288 × 352 video resolution, AVI format) were recorded with the phase contrast microscopy (Olympus BX51) at a magnification of 1000. After the video clips were obtained, they are further enlarged 16 times by upsampling. Each time only one target lymphocyte was observed and manually positioned in the center of the field. Then a quality control step was conducted beforehand to filter out the video clips containing only one lymphocyte. And it also guarantees that there is no overlap and trajectory cross between the lymphocyte and red blood cells. There are two types of skin transplantation, i.e., the selfskin transplantation group (SST group) and the allergenicskin transplantation group (AST group). In the SST group, a healthy Balb/C male mouse was used as both the host and the donor, while in the AST group a pair of healthy Balb/C male mouse and healthy C57BL/6 male mouse were used as the host and the donor, respectively. Several video clips of the datasets are available at http://isip.bit.edu.cn/kyxz/xzlw/77051.htm.
For Dataset I, there are 80 video clips in total (40 video clips for each class) obtained from a contrast experiment, in which both the SST group and AST group have 20 hosts and 20 donors. On the fourteenth day after the surgery, the lymphocytes in Dataset I were obtained from the blood samples collected from the tail vein. The lymphocytes in the second group showed irregular dynamic behavior, such as the cell elongation from different angles and the obvious movement of intracellular cargoes compared with the first group. Consequently, the videos from the first group and the second group were categorized as normal and abnormal, respectively.
Dataset II is composed of 120 video clips equally divided into four classes, which is derived from an AST group experiment with 25 pairs of hosts and donors. On the seventh day after the skin transplantation, the lymphocytes of Dataset II were obtained from the blood samples collected from the tail vein. The videos were divided into four classes (normal, slight activation, moderate activation, and drastic activation) according to the cellular deformation by three experts with a voting protocol. For these two datasets, we use 30 random splits of the data, while considering 20 random video clips per class for training and the rest for testing.
Preprocessing
For the effectiveness of feature extraction, some preprocessing procedures need to be adopted, containing cell segmentation and cell tracking in each frame, as well as cell alignment among the sequence of frames. In Fig. 2, each row corresponds to a video clip in the datasets in “Data” section, and the target cells are lymphocytes in the red/blue dashed box. We employ an active contour model designed for live cells in phasecontrast images to automatically segment and track cells [31]. To further eliminate the impact of compulsory movement, video stabilization algorithm is introduced to perform a nonrigid alignment for the cell sequences in frames [32]. Besides, manual validation is exploited to eliminate the ambiguity of cell segmentation and tracking by human eye if necessary. In detail, we can specified the initial contours for the lymphocytes to make sure the accuracy of cell segmentation.
Dynamic features between frames
This subsection mainly describes the features of cell dynamics between frames in the image sequence. Specifically, the dynamic features can be extracted from the deformation of cellular contours and intracellular movements. The former is captured by the shape context while the latter is modeled with SIFT flow. Then the corresponding contour feature and cytoplasm feature are combined to form a robust feature vector of cell dynamics.
Contour feature using shapecontext
In the field of object recognition and shape matching, shape context was first proposed by [14], and then has been widely used in digit recognition, trademark search, and image registration. Shape context is introduced into the framework of deformation assessment for anatomical tissue to preserve and discriminate tiny deformation [15]. While shape context also has the potential to match the silhouettes of the falling human body and take the mean matching cost as a crucial index to quantify the deformation [16]. Therefore, this paper adopts shape context for the sake of generating the cellular contour deformation feature.^{Footnote 2}
Shape context
In shape context, a shape is sampled into a discrete set of points from its contour, which will finally accumulate a logpolar histogram \(h_{i}\):
where \(p_{i}\) is a point in the given npoints shape, and its shape context \(h_{i}\) records the relative coordinates of the remaining \(n1\) points as shown in Fig. 3. bin(k) stands for the kth bin in histogram \(h_{i}\). Suppose \(p_{i}\) and \(q_{j}\) are from two shapes P and Q, respectively, therefore the matching cost \(C_{ij}\) for each pair of points \((p_{i},\, q_{j})\) is computed with the \(\chi ^{2}\) statistic:
where \(h_{i}(k)\) and \(h_{j}(k)\) denote the Kbin histograms for \(p_{i}\) and \(q_{j}\), separately.
Contour feature based on shape distance
Hungarian algorithm [33] can find the best matching by minimizing the total cost \(H(\pi )=\sum C(p_{i},\, q_{\pi (i)})\) given a permutation \(\pi (i)\). With the permutation, a series of transformations \(T=\left\{ T_{k}\right\} _{k=1\ldots u}\) for each point can be computed using the thin plate spline model (TPS). Then several iterations of shape context matching and TPS reestimation are implemented, and the shape context distances \(D_{sc}^{1}, \ldots , D_{sc}^{L}\) in L iterations are concatenated as the feature vector of cellular contour deformation \(F_{DCS}=\{D_{sc}^{1}, \ldots ,\, D_{sc}^{l}, \ldots , D_{sc}^{L}\}\).
where \(T(\cdot )\) denotes the TPS shape transformation.
Cytoplasm feature based on SIFT flow
In SIFT flow, the SIFT descriptor, as a type of middlelevel representation, is incorporated into the computational framework of optical flow. It establishes a robust semanticlevel correspondence through matching these image structure [34].^{Footnote 3} Based on the semanticlevel correspondence, the movement field and the appearance change field are constructed by computing the displacement of the corresponding points and the discrepancy of the corresponding SIFT descriptors, respectively. Then histograms of oriented SIFT flow is employed to characterize multioriented dynamic information from both the movement field and the appearance change field.
SIFT flow
Instead of matching raw pixels in optical flow, SIFT flow searches for the correspondences of SIFT descriptors on the grid coordinate \(p=(x,\, y)\) of images. The dense correspondence map, or the movement field, can be obtained by minimizing an objective function E(w):
where \(s_{p}^{1}\) and \(s_{p}^{2}\) individually denote the SIFT descriptor at position p in two SIFT images, and \(w_{p}=(u_{p},\, v_{p})\) presents the flow vector at p. The parameters t and d are the thresholds of the data term and the smoothness term, respectively. The set \(\varepsilon\) contains all spatial fourneighborhoods.
After obtaining the correspondence map upon the sequential SIFT images, the appearance change field can be implemented by computing the difference of SIFT features between the corresponding points.
Histograms of oriented SIFT flow
Due to the susceptibility to scale changes and directionality of movement, the raw SIFT flow cannot obtain a good performance if applied as features directly. Inspired by the histograms of oriented optical flow, SIFT flow is binned according to its primary angle from the horizontal axis and weighted according to its magnitude or appearance difference, as shown in Fig. 4. \(F_{MDF}=\left\{ f_{MDF}^{1},\,\ldots ,\, f_{MDF}^{R}\right\}\) and \(F_{ACF}=\left\{ f_{ACF}^{1},\,\ldots ,\, f_{ACF}^{R}\right\}\) are obtained to characterize the movement and appearance variation of the cytoplasm, respectively.
where \(f_{MDF}^{r}\) denotes the accumulation of displacement magnitude belonging to the rth (\(1\le r\le R\)) bin in the movement field, and \(f_{ACF}^{r}\) means the sum of the appearance difference in the rth (\(1\le r\le R\)) bin.
Combination of features
The robustness of feature representation can be enhanced by combining the complementary features. To sum up, the aforementioned \(F_{DCS}\), \(F_{MDF}\) and \(F_{ACF}\) between frames are concatenated to form a feature vector:
The computing of \(F_{i}\) is the key step to extracting the features of cell dynamics in the whole framework. Then “Temporal aggregation of dynamic features” section is mainly about encoding the chronological structure of the cell dynamic features in a particular video.
Temporal aggregation of dynamic features
For the videowide cell dynamics, it is essential to aggregate a series of framelevel dynamic features along temporal extent in a rational way. That is to say, it needs to consider how dynamic features evolve over time in a video. In this section, we present three compact encoding methods, including FV, VLAD, and HVLAD, to capture the temporal information of cell sequences. The pipeline for the temporal aggregation strategy in this paper is depicted in Fig. 5. It can be summarized as the following two phases: (1) In the training phase, the samples in the cell video dataset are transformed into the dynamic features by the aid of algorithms in “Dynamic features between frames” section. Then the compact dictionary with K visual words is learned based on these features by means of KMeans or Gaussian mixture model (GMM). (2) In the testing phase, the features of cell dynamics are obtained similarly and assigned to the K visual words. Then, the residuals between the visual words and the dynamic features belonging to them are encoded into the temporalfeatureaggregated vector.
Fisher vector encoding
In FV encoding [26, 27], a GMM with K components can be learned from the training dynamic features between frames, and denoted as \(\varTheta =\left\{ (\varvec{\mu }_{k},\varvec{\sigma }_{k},\pi _{k}),\, k=1,2,\ldots ,K\right\}\), where \(\varvec{\mu }_{k}\), \(\varvec{\sigma }_{k}\), \(\pi _{k}\) are the mean vector, variance matrix (assumed diagonal) and mixture weight of the kth component, respectively. Given \({\varvec{X}}=({\varvec{x}}_{1},{\varvec{x}}_{2}\ldots \,,{\varvec{x}}_{N})\) of dynamic features extracted from a testing cell image sequence, we have mean and covariance deviation vectors for the kth component as:
where \(q_{ik}\) is the soft assignment of feature \({\varvec{x}}_{i}\) to the kth Gaussian component. By concatenation of \({\varvec{u}}_{k}\) and \({\varvec{v}}_{k}\) of all the K components, FV for the testing sample is formed with size \(2D^{'}K\), where \(D^{'}\) is the dimension of the dynamic feature after principal component analysis (PCA) preprocessing [27]. Power normalization using signed square root (SSR) with \(z=sign(z)\sqrt{\left z\right }\) and \(\ell _{2}\) normalization are then applied to the FVs [26, 27].
VLAD encoding
As a nonprobabilistic version of FV encoding, VLAD encoding [28, 29] simply utilizes Kmeans instead of GMM to generate K coarse centers \(\left\{ {\varvec{c}}_{1},{\varvec{c}}_{2},\ldots \,,{\varvec{c}}_{K}\right\}\). Then we can obtain the difference vector \({\varvec{u}}_{k}\) with respective to the kth center \({\varvec{c}}_{k}\) for the testing dynamic feature set by:
where \(NN({\varvec{x}}_{i})\) indicates \({\varvec{x}}_{i}\)’s nearest neighbors among K coarse centers.
The VLAD encoding vector concatenates \({\varvec{u}}_{k}\) over all the K centers with size \(D^{'}K\), and the postprocessing employs the power and \(\ell _{2}\) normalization. Besides, the intranormalization [35] is also applied to add normalization on each \({\varvec{u}}_{k}\). The proposed framework prefer to VLADk (k = 5), a variant of VLAD, which extends the nearest neighbor with the knearest neighbors, because of its good performance in contrast to the original VLAD [36].
Highorder VLAD encoding
In order to keep both high performance and high extraction speed, the HVLAD [30] augments the original VLAD with highorder statistics, e.g., diagonal covariance and skewness. The K clusters are first learned by Kmeans, regarded as the visual words \(\left\{ {\varvec{w}}_{1},{\varvec{w}}_{1},\ldots \,,{\varvec{w}}_{K}\right\}\), and the corresponding firstorder, secondorder, and thirdorder statistics are denoted as \(\left\{ \varvec{\mu }_{1},\varvec{\mu }_{2},\ldots \,,\varvec{\mu }_{K}\right\}\), \(\left\{ \varvec{\sigma }_{1},\varvec{\sigma }_{2},\ldots \,,\varvec{\sigma }_{K}\right\}\) and \(\left\{ \varvec{\gamma }_{1},\varvec{\gamma }_{2},\ldots \,,\varvec{\gamma }_{K}\right\}\), respectively. The technical details of HVLAD can be summarized as:
where \({\varvec{X}}_{k}=\left\{ {\varvec{x}}_{1},{\varvec{x}}_{2},\ldots \,,{\varvec{x}}_{N_{k}}\right\}\) is the testing dynamic features belonging to the kth visual word \({\varvec{w}}_{k}\), and \({\varvec{m}}_{k}\) stands for the mean of these dynamic features. Therefore, \({\varvec{u}}_{k}\), \({\varvec{v}}_{k}\) and \({\varvec{s}}_{k}\) are the residual vectors of the firstorder, secondorder and thirdorder, respectively. Similar to the original VLAD, the final representation of HVLAD is concatenated as \(\left\{ {\varvec{u}}_{1},{\varvec{v}}_{1},{\varvec{s}}_{1},{\varvec{u}}_{2},{\varvec{v}}_{2},{\varvec{s}}_{2}\ldots \,,{\varvec{u}}_{K},{\varvec{v}}_{K},{\varvec{s}}_{K}\right\}\), and the postprocessing operation also adopts the power, \(\ell _{2}\) and intranormalization [35].
Analysis of temporal feature aggregation methods
Given the above three compact encoding approaches to temporal feature aggregation, we need to find out which one is the most appropriate for our application. For this purpose, we conduct an experiment on Dataset I (for details see “Data” section ) to analyze the discrimination of these encoding strategies: FV, VLAD, and HVLAD. Specifically, we calculate the histogram distribution of classification scores from positive exemplars and negative exemplars, respectively. The positive and negative exemplars individually correspond to 20 training samples from SST group and the AST group. Note that the classifier is the linear SVM (the parameters are the same as “Experimental setup” section ), the dictionary size is 64 for FV, VLAD, and HVLAD, and the encoding vector is not followed by the temporal pyramid pooling (TPP). From Fig. 6, we can find that VLAD encoding and HVLAD encoding have the similar discrimination while the FV encoding has better performance. It shows that the FV encoding is most suitable for the temporal aggregation of cell dynamic features.
Temporal pyramid pooling
To preserve much more temporal discrimination, we add the TPP, regarded as a onedimensional version of spatial pyramid pooling [37]. For a particular video, we suppose that its dynamic features between frames is denoted as \({\varvec{Z}}\) and the temporal aggregation operation is defined as \({\varPhi }(\cdot )\). TPP is to organize the dynamic features \({\varvec{Z}}\) into three level of subsets: \(Z_{1}^{1}\), \(Z_{2}^{1}\), \(Z_{2}^{2}\), \(Z_{3}^{1}\), \(Z_{3}^{2}\) and \(Z_{3}^{3}\), which have 1, 2 and 3 averagepartitioned subwindows along temporal dimension, respectively. Therefore, the TPP of \({\varvec{Z}}\) can be written as follows:
Experimental results
In this section, we present a detailed experimental evaluation of our proposed framework based on the cellvideo datasets in “Data” section. Several exploration experiments were conducted to determine the crucial parameters of the proposed approach. Moreover, the proposed approach is compared with several existing methods.
Experimental setup
The parameters used in our approach can be divided into three parts: the parameters in feature extraction, feature encoding, and classifier. Firstly, both shapecontext for contour deformation and SIFT flow for cytoplasmic streaming adopt the default parameters as reported in the literatures [14, 34]. And the number of bins in histograms of oriented SIFT flow is setting to 36. As the frame interval has direct relationship with cell dynamics, we conduct a contrast experiment about different frame intervals. As shown in Fig. 7, it is able to achieve the best performance with different encoding methods or based on various vocabulary sizes, when the frame interval equals 30 (as default value without assignment) (Additional file 1: Figure S1). Secondly, there is an important parameter, vocabulary size, which is related to not only the encoding discrimination but also the classifier overfitting (the discussion in “Performance evaluation of temporal aggregation” section). Finally, the linear classifier in LibSVM toolkit [38] is adopted. After the parameters are chosen, we retrain the classifier based on the 30 random splits of two Dataset (refer to “Data” section), and the penalty coefficient is determined on the training set using fivefold crossvalidation.
Validation of dynamic features between frames
In this paper, the framelevel dynamic features are extracted in two aspects: contour deformation and cytoplasmic streaming. Based on Dataset II, we first extract various kinds of dynamic features between frames, which contain Zernike moment (ZM) [7], TG [13], radialdistance feature (RDF) [20], shapecontext feature (SCF), FV (OFF) [18], SIFT flow feature (SFF), the complementary Movement Field and Appearance change Field Feature (MFAFF), and the combined feature vector (CF) in “Combination of features” section. For a fair comparison, we make the dimension of all the framelevel dynamic features maintain 30 by choosing appropriate parameters for each feature. ZM with 30 orders is captured from the samples. TG and RDF are both sampled into 30 discrete points. TG samples the perimeter of the cell contour according to equal interval strategy, while RDF is based on equal angleinterval sampling principle. The number of iterations L for SCF is specified as 30. The dimensions of OFF, SFF and MFAFF are decided by the histogram of oriented optical/SIFT flow, i.e. the number of histogram bins R. In detail, the R for OFF and SFF is set to 30, while the R for MFAFF is chosen as 15. CF is the combination of SCF and MFAFF, thus L and R are both set to 10. Then for all kinds of dynamic features, a videorange aggregation strategy is implemented by the average pooling, i.e., averaging framelevel features of each video along the temporal dimension. At last, we perform the classification with the aggregated features using SVM.
As shown in Fig. 8, SCF achieves a better performance than ZM, TG, and RDF, which proves the effectiveness of our proposed contour feature. TG and RDF both belong to the radialdistancebased feature, but TG obtains 8.1% lower accuracy than RDF. The reason might be that TG is designed for the dynamics of cell protrusions and lymphocytes in our dataset does not have the explicit protrusions. In the aspect of cytoplasm motion features, SFF achieves 0.625% higher accuracy than OFF. Moreover, the features from the MFAFF further improve the performance in contrast to SFF. Comparing the dynamic features in two aspects, we can find out that the contour deformation features play a more dominant role in the process of characterizing cell dynamics. Finally, the CF reaches the highest classification accuracy (67.50%), which illustrates the significance of combining the cellular contour and cytoplasm streaming dynamics.
Performance evaluation of temporal aggregation
The effectiveness of the temporal aggregation can be validated by the following experiment. In addition, the vocabulary size is an important parameter. Intuitively, if the codebook size is too small, the histogram feature may lose the discriminative power, while if the codebook size is too large, the histograms from the same class may not possess enough similarity. Fisher vector encoding, as the most suitable encoding strategy for our application, makes use of GMM to generate the compact dictionary. In this section, FV with different vocabulary sizes is applied on Dataset II, and the classification results are shown in Table 1. Combined feature vector (CF) in “Combination of features” section is employed as the framelevel dynamic feature. Fisher vector brings a substantial increase of classification accuracy when compared with the result of CF in “Validation of dynamic features between frames” section. We try five vocabulary size (denoted as K): 16, 32, 64, 128 and 256, and the performance of FV encoding increases initially and then decreases as the vocabulary size grows. When the parameter K equals 64, it reaches the peak of the performance. However, \(K=128,\,256\) could make the encoding vector too sparse, which is somewhat detrimental to performance.
Effectiveness of the proposed framework
Finally, we evaluate the performance of our proposed framework in comparison with several existing algorithms. These algorithms are divided into two groups. On the one hand, The first one corresponds to the first five rows in Tables 2 and 3. Moreover, it contains shape parameters, ZM, TG, DynamicTimeWarpingbased Radial distance (DTWRadial distance) as well as radial distance and optical flow combined features (RDOF feature).^{Footnote 4} [5, 7, 13, 18, 20]. These five algorithms mainly focus on modeling the cell dynamic between frames without emphasizing temporal aggregation. Specifically, a subsequence is sampled from a particular videoclip with fixed frame interval (specified as 20) except that DTWRadial distance obtains the subsequence using dynamic time warping.^{Footnote 5} Then cell dynamic features are extracted on the subsequence and concatenated into the videorange feature of cell dynamics. On the other hand, the last four rows in Tables 2 and 3 belong to the other group. In this group, not only the shortterm cell dynamics but temporal aggregation are indepth considered. Stochastic annotation of phenotypic individualcell responses (SAPHIRE) framework only employs shape parameters as the descriptors of cell shape dynamics, and models videorange cell dynamics with HMM [24]. Local deformation pattern (LDP) framework employs radial distance to characterize cell deformation and accumulates the continuous deformation along the radial direction [21]. Temporal bagofword (TBoW) framework was reported in our previous work [25], and our proposed framework is denoted as VFA.
The experiments are conducted on the Dataset I and Dataset II, and the experimental results (classification precision, recall, and Fscore measures) are summarized in Tables 2 and 3, respectively. The cell dynamics in Dataset I is categorized into two class, normal and abnormal, while in Dataset II the cell dynamics of abnormal are further annotated as three subcategorization (slight, moderate and drastic activation). As shown in Tables 2 and 3, RDOF feature achieves a better performance than other methods in group one. It indicates that the dynamic features from cell contour and cytoplasmic streaming are complementary to each other. Compared with DTWRadial distance, the RDOF feature improves \(0.87\%\) Fscore in Dataset I, but \(8.82\%\) Fscore in Dataset II. This illustrates that integrating cytoplasm streaming dynamics brings more improvement for the complex situation, i.e., refined categorization of abnormal cell dynamics.
Because of modeling the videorange temporal dynamics, the frameworks in the second group bring a substantial absolute increase over the corresponding features they used. For example, SAPHIRE benefits \(13.75\%\) and \(26.85\%\) Fscore increases over shape parameters in Tables 2 and 3, separately. Similarly, LDP also obtains a better performance in contrast with DTWRadial distance. The fact that these two methods obtain better performance on two datasets proves the significance of the temporal aggregation of the cell dynamics. TBoW and VFA are based on the same primary feature (CF in “Combination of features” section), but VFA achieves a better performance (\(93.70\%\) precision, \(89.70\%\) recall rate and \(91.41\%\) Fscore in Table 2, \(82.34\%\) precision, \(81.65\%\) recall rate and \(81.27\%\) Fscore in Table 3). It manifests that it is wise to introduce the FV encoding for the temporal aggregation of cell dynamics. At last, VFA reaches the peak of performance in both Tables 2 and 3. These results show that our proposed framework outperforms other existing algorithms.
Discussion
The proposed framework is convenient to extend to other applications about cell temporal dynamics or cell deformation estimation. The whole framework is theoretically compatible with the classification tasks based on cell temporal dynamics. For example, the cellularresponsebased drug classification tasks focus on exploring how the cellular response is variation with different drug stimulus, which is able to be captured as cell temporal dynamics in videos. The proposed framework can be considered as a scheme for these tasks. Moreover, part of the framework may also benefit the living cell study. There are some other applications in need of cell temporal dynamics. Modeling the cell cycle, for instance, incorporates the temporal information into the annotation strategy of cellular states in timelapse movies. The existing methods, in general, exploit the static cell morphology as framelevel features and HMM as feature aggregation strategy. Our framelevel cell dynamic features might serve as the complement of cell morphology feature for cell cycle modeling.
In addition, there are some limits and assumptions in our proposed framework. The shapecontext and SIFT flow both assume the time interval between frames should short enough relative to the cell temporal dynamics. And we use SIFT flow to approximately describe 3D cytoplasmic streaming. Although this method is effective for modeling intracellular movement to some extent, we plan to investigate how to model cytoplasmic streaming in 3D space in the future work.
Conclusion
We have presented a novel framework to evaluate the cell dynamics in videoclips, which first extracts framelevel cell dynamic features based on both contour deformation and cytoplasmic streaming, and then leverages compact encoding to aggregate these shortterm features into a videorange cell dynamics. A series of experiments are conducted to evaluate the proposed framework. The first experiment not only verifies the effectiveness of the proposed cell dynamic features, but proves that the MFAFF can more precisely model the cytoplasmic streaming. The second experiment about temporal aggregation figures out the most suitable encoding strategy and its corresponding best parameters. Finally, the proposed framework has been compared with the existing mainstreaming approaches on two datasets, and experimental results show its outperformance in the assessment and classification of cell dynamics.
Notes
 1.
Tree graph is suitable to compute the number and length of cell protrusions.
 2.
The code for shape context can be found in https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/sc_digits.html.
 3.
The code of SIFT flow is available in https://people.csail.mit.edu/celiu/SIFTflow/.
 4.
RDOF feature employs a fusion strategy of multiple features, including radial distance feature and optical flow feature in “Validation of dynamic features between frames” section [18] And the fusion strategy brings an improvement for profiling cell dynamics.
 5.
Dynamic time warping can obtain a more suitable subsequence for analyzing cell dynamics, which makes DTWRadial distance achieve a better performance.
Abbreviations
 SIFT:

ScaleInvariant Feature Transform
 MFAFF:

Movement Field and Appearance Change Field Feature
 HMM:

hidden Markov models
 TBoW:

temporal bag of words
 FV:

Fisher vector
 VLAD:

vector of locally aggregated descriptors
 HVLAD:

higherorder VLAD
 SST:

selfskin transplantation
 AST:

allergenicskin transplantation
 TPS:

thin plate spline model
 HOSF:

histograms of oriented SIFT flow
 GMM:

Gaussian mixture model
 PCA:

principal component analysis
 SSR:

signed square root
 SVM:

support vector machine
 TPP:

temporal pyramid pooling
 ZM:

Zernike moment
 TG:

tree graph
 RDF:

radialdistance feature
 SCF:

shapecontext feature
 OFF:

optical flow feature
 SFF:

SIFT flow feature
 CF:

combined feature vector
 DTW:

dynamic time warping
 RDOF:

radial distance and optical flow
 SAPHIRE:

stochastic annotation of phenotypic individualcell responses
 LDP:

local deformation pattern
 VFA:

video feature aggregation
References
 1.
Caicedo JC, Cooper S, Heigwer F, Warchal S, Qiu P, Molnar C, Vasilevich AS, Barry JD, Bansal HS, Kraus O, et al. Dataanalysis strategies for imagebased cell profiling. Nat Methods. 2017;14(9):849.
 2.
Peixoto HM, Munguba H, Cruz RM, Guerreiro AM, Leao RN. Automatic tracking of cells for video microscopy in patch clamp experiments. Biomed Eng Online. 2014;13(1):78.
 3.
Prinyakupt J, Pluempitiwiriyawej C. Segmentation of white blood cells and comparison of cell morphology by linear and naïve Bayes classifiers. Biomed Eng Online. 2015;14(1):63.
 4.
Koprowski R. Quantitative assessment of the impact of biomedical image acquisition on the results obtained from image analysis and processing. Biomed Eng Online. 2014;13(1):93.
 5.
Xiong Y, Iglesias PA. Tools for analyzing cell shape changes during chemotaxis. Integr Biol. 2010;2(11–12):561–7.
 6.
Zhong Q, Busetto AG, Fededa JP, Buhmann JM, Gerlich DW. Unsupervised modeling of cell morphology dynamics for timelapse microscopy. Nat Methods. 2012;9(7):711–3.
 7.
Alizadeh E, Lyons SM, Castle JM, Prasad A. Measuring systematic changes in invasive cancer cell shape using Zernike moments. Integr Biol. 2016;8(11):1183–93.
 8.
Li H, Pang F, Shi Y, Liu Z. Cell dynamic morphology classification using deep convolutional neural networks. Cytom Part A. 2018;93A(6):628–38.
 9.
Kachouie NN, Fieguth P, Jervis E. A probabilistic cell model in background corrected image sequences for single cell analysis. Biomed Eng Online. 2010;9(1):57.
 10.
Wang K, Sun W, Richie CT, Harvey BK, Betzig E. Direct wavefront sensing for highresolution in vivo imaging in scattering tissue. Nat Commun. 2015;6:7276.
 11.
Li D, Shao L, Chen BC, Zhang X, Zhang M, Moses B, Milkie DE, Beach JR, Hammer JA, Pasham M, et al. Extendedresolution structured illumination imaging of endocytic and cytoskeletal dynamics. Science. 2015;349(6251):3500.
 12.
Kotyk T, Dey N, Ashour AS, Drugarin CVA, Gaber T, Hassanien AE, Snasel V. Detection of dead stained microscopic cells based on color intensity and contrast. In: The 1st international conference on advanced intelligent system and informatics (AISI). Berlin: Springer; 2016. p. 57–68.
 13.
Tsygankov D, Bilancia CG, Vitriol EA, Hahn KM, Peifer M, Elston TC. Cellgeo: a computational platform for the analysis of shape changes in cells with complex geometries. J Cell Biol. 2014;204(3):443–60.
 14.
Belongie S, Malik J, Puzicha J. Shape matching and object recognition using shape contexts. IEEE Trans Pattern Anal Mach Intell. 2002;24(4):509–22.
 15.
Chen W, Liang X, Maciejewski R, Ebert DS. Shape context preserving deformation of 2D anatomical illustrations. In: Computer graphics forum, vol. 28. Wiley Online Library; 2009. p. 114–26.
 16.
Rougier C, Meunier J, StArnaud A, Rousseau J. Robust video surveillance for fall detection based on human shape deformation. IEEE Trans Circuits Syst Video Technol. 2011;21(5):611–22.
 17.
Dunkers JP, Lee YJ, Chatterjee K. Single cell viability measurements in 3D scaffolds using in situ label free imaging by optical coherence microscopy. Biomaterials. 2012;33(7):2119–26.
 18.
Huang Y, Liu Z, Shi Y, Li N, An X, Gou X. Quantitative analysis of lymphocytes morphology and motion in intravital microscopic images. In: 35th annual international conference of the IEEE engineering in medicine and biology society (EMBC). New York: IEEE; 2013. p. 3686–89.
 19.
Yuan L, Zheng YF, Zhu J, Wang L, Brown A. Object tracking with particle filtering in fluorescence microscopy images: application to the motion of neurofilaments in axons. IEEE Trans Med Imaging. 2012;31(1):117–30.
 20.
An X, Liu Z, Shi Y, Li N, Wang Y, Joshi SH. Modeling dynamic cellular morphology in images. In: International conference on medical image computing and computerassisted intervention (MICCAI). Berlin: Springer; 2012. p. 340–7.
 21.
Li H, Liu Z, Pang F, Fan Z, Shi Y. Analyzing dynamic cellular morphology in timelapsed images enabled by cellular deformation pattern recognition. In: 37th annual international conference of the IEEE engineering in medicine and biology society (EMBC). 2015. New YorK: IEEE; p. 7478–81.
 22.
Pang F, Li H, Shi Y, Liu Z. Computational analysis of cell dynamics in videos with hierarchicalpooled deepconvolutional features. J Comput Biol. 2018;25(8):934–53.
 23.
Held M, Schmitz MH, Fischer B, Walter T, Neumann B, Olma MH, Peter M, Ellenberg J, Gerlich DW. Cellcognition: timeresolved phenotype annotation in highthroughput live cell imaging. Nat Methods. 2010;7(9):747–54.
 24.
Gordonov S, Hwang MK, Wells A, Gertler FB, Lauffenburger DA, Bathe M. Time series modeling of livecell shape dynamics for imagebased phenotypic profiling. Integr Biol. 2016;8(1):73–90.
 25.
Pang F, Liu Z, Li H, Shi Y. The measurement of cell viability based on temporal bag of words for image sequences. In: IEEE international conference on image processing (ICIP). New York City: IEEE; 2015. p. 4185–9.
 26.
Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for largescale image classification. In: European conference on computer vision (ECCV). Berlin: Springer; 2010. p. 143–56.
 27.
Sánchez J, Perronnin F, Mensink T, Verbeek J. Image classification with the fisher vector: theory and practice. Int J Comput Vis. 2013;105(3):222–45.
 28.
Jégou H, Douze M, Schmid C, Pérez P. Aggregating local descriptors into a compact image representation. In: IEEE conference on computer vision and pattern recognition (CVPR). 2010. New York: IEEE; p. 3304–11.
 29.
Jégou H, Perronnin F, Douze M, Sánchez J, Perez P, Schmid C. Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell. 2012;34(9):1704–16.
 30.
Peng X, Wang L, Qiao Y, Peng Q. Boosting vlad with supervised dictionary learning and highorder statistics. In: European conference on computer vision (ECCV). Berlin: Springer; 2014. p. 660–74.
 31.
Seroussi I, Veikherman D, Ofer N, YEHUDAIRESHEFF S, Keren K. Segmentation and tracking of live cells in phasecontrast images using directional gradient vector flow for snakes. J Microsc. 2012;247(2):137–46.
 32.
Chang HC, Lai SH, Lu KR. A robust realtime video stabilization algorithm. J Vis Commun Image Represent. 2006;17(3):659–73.
 33.
Kuhn H. The hungarian method for the assignment problem. Naval Res Logist. 2005;52(1):7–21.
 34.
Liu C, Yuen J, Torralba A. Sift flow: dense correspondence across scenes and its applications. IEEE Trans Pattern Anal Mach Intell. 2011;33(5):978–94.
 35.
Arandjelovic R, Zisserman A. All about VLAD. In: IEEE conference on computer vision and pattern recognition (CVPR). 2013. p. 1578–85.
 36.
Kantorov V, Laptev I. Efficient feature extraction, encoding and classification for action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR). 2014. p. 2593–600.
 37.
Lazebnik S, Schmid C, Ponce J. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE conference on computer vision and pattern recognition (CVPR), vol. 2. New York: IEEE; 2006. p. 2169–78.
 38.
Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):27.
Authors' contributions
FP implemented the proposed framework, conducted the experiments and drafted the manuscript; ZL revised the manuscript critically and gave final approval of the version to be published. Both authors read and approved the final manuscript.
Acknowledgements
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
Several video clips of the datasets are available at http://isip.bit.edu.cn/kyxz/xzlw/77051.htm.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Funding
This work was supported in part by the National Natural Science Foundation of China (61271112).
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Additional file
12938_2019_638_MOESM1_ESM.pdf
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Cell deformation
 Intracellular movement
 Video feature aggregation
 Shape context
 SIFT flow