In this section, we first describe two live-cell video datasets for evaluating the utility of the proposed framework. These two datasets individually contain 80 video clips in two classes and 120 video clips in four classes. Then the cell dynamic features between frames are presented, which include the contour feature using shape-context and the cytoplasm feature based on SIFT flow. Finally, we present the temporal aggregation strategy for these dynamic features to generate the video-wide representation.

### Data

To validate the proposed approach, there are two datasets of video clips about lymphocytes established by the collaboration hospital, Beijing You’an Hospital. The lymphocytes were from the blood samples that were collected from the tails of the mice (6–8 weeks, 20–22 g) after the skin transplantation, and the video clips (20–30 s, 25 frames per second, 288 × 352 video resolution, AVI format) were recorded with the phase contrast microscopy (Olympus BX51) at a magnification of 1000. After the video clips were obtained, they are further enlarged 16 times by up-sampling. Each time only one target lymphocyte was observed and manually positioned in the center of the field. Then a quality control step was conducted beforehand to filter out the video clips containing only one lymphocyte. And it also guarantees that there is no overlap and trajectory cross between the lymphocyte and red blood cells. There are two types of skin transplantation, i.e., the self-skin transplantation group (SST group) and the allergenic-skin transplantation group (AST group). In the SST group, a healthy Balb/C male mouse was used as both the host and the donor, while in the AST group a pair of healthy Balb/C male mouse and healthy C57BL/6 male mouse were used as the host and the donor, respectively. Several video clips of the datasets are available at http://isip.bit.edu.cn/kyxz/xzlw/77051.htm.

For Dataset I, there are 80 video clips in total (40 video clips for each class) obtained from a contrast experiment, in which both the SST group and AST group have 20 hosts and 20 donors. On the fourteenth day after the surgery, the lymphocytes in Dataset I were obtained from the blood samples collected from the tail vein. The lymphocytes in the second group showed irregular dynamic behavior, such as the cell elongation from different angles and the obvious movement of intracellular cargoes compared with the first group. Consequently, the videos from the first group and the second group were categorized as normal and abnormal, respectively.

Dataset II is composed of 120 video clips equally divided into four classes, which is derived from an AST group experiment with 25 pairs of hosts and donors. On the seventh day after the skin transplantation, the lymphocytes of Dataset II were obtained from the blood samples collected from the tail vein. The videos were divided into four classes (normal, slight activation, moderate activation, and drastic activation) according to the cellular deformation by three experts with a voting protocol. For these two datasets, we use 30 random splits of the data, while considering 20 random video clips per class for training and the rest for testing.

### Preprocessing

For the effectiveness of feature extraction, some pre-processing procedures need to be adopted, containing cell segmentation and cell tracking in each frame, as well as cell alignment among the sequence of frames. In Fig. 2, each row corresponds to a video clip in the datasets in “Data” section, and the target cells are lymphocytes in the red/blue dashed box. We employ an active contour model designed for live cells in phase-contrast images to automatically segment and track cells [31]. To further eliminate the impact of compulsory movement, video stabilization algorithm is introduced to perform a non-rigid alignment for the cell sequences in frames [32]. Besides, manual validation is exploited to eliminate the ambiguity of cell segmentation and tracking by human eye if necessary. In detail, we can specified the initial contours for the lymphocytes to make sure the accuracy of cell segmentation.

### Dynamic features between frames

This subsection mainly describes the features of cell dynamics between frames in the image sequence. Specifically, the dynamic features can be extracted from the deformation of cellular contours and intracellular movements. The former is captured by the shape context while the latter is modeled with SIFT flow. Then the corresponding contour feature and cytoplasm feature are combined to form a robust feature vector of cell dynamics.

#### Contour feature using shape-context

In the field of object recognition and shape matching, shape context was first proposed by [14], and then has been widely used in digit recognition, trademark search, and image registration. Shape context is introduced into the framework of deformation assessment for anatomical tissue to preserve and discriminate tiny deformation [15]. While shape context also has the potential to match the silhouettes of the falling human body and take the mean matching cost as a crucial index to quantify the deformation [16]. Therefore, this paper adopts shape context for the sake of generating the cellular contour deformation feature.^{Footnote 2}

#### Shape context

In shape context, a shape is sampled into a discrete set of points from its contour, which will finally accumulate a log-polar histogram \(h_{i}\):

$$\begin{aligned} h_{i}(k) = \#\{q\ne p_{i}:\,(q-p_{i})\in bin(k)\}, \end{aligned}$$

(1)

where \(p_{i}\) is a point in the given *n*-points shape, and its shape context \(h_{i}\) records the relative coordinates of the remaining \(n-1\) points as shown in Fig. 3. *bin*(*k*) stands for the *k*-th bin in histogram \(h_{i}\). Suppose \(p_{i}\) and \(q_{j}\) are from two shapes *P* and *Q*, respectively, therefore the matching cost \(C_{ij}\) for each pair of points \((p_{i},\, q_{j})\) is computed with the \(\chi ^{2}\) statistic:

$$\begin{aligned} C_{ij} = \frac{1}{2}\sum _{k=1}^{K}\frac{\left[ h_{i}(k)-h_{j}(k)\right] ^{2}}{h_{i}(k)+h_{j}(k)}, \end{aligned}$$

(2)

where \(h_{i}(k)\) and \(h_{j}(k)\) denote the *K*-bin histograms for \(p_{i}\) and \(q_{j}\), separately.

#### Contour feature based on shape distance

Hungarian algorithm [33] can find the best matching by minimizing the total cost \(H(\pi )=\sum C(p_{i},\, q_{\pi (i)})\) given a permutation \(\pi (i)\). With the permutation, a series of transformations \(T=\left\{ T_{k}\right\} _{k=1\ldots u}\) for each point can be computed using the thin plate spline model (TPS). Then several iterations of shape context matching and TPS re-estimation are implemented, and the shape context distances \(D_{sc}^{1}, \ldots , D_{sc}^{L}\) in *L* iterations are concatenated as the feature vector of cellular contour deformation \(F_{DCS}=\{D_{sc}^{1}, \ldots ,\, D_{sc}^{l}, \ldots , D_{sc}^{L}\}\).

$$\begin{aligned} D_{SC}^{l}=\frac{1}{n}\sum _{p\in P}arg\underset{{\scriptstyle q\in Q}}{min}C(p,T(q))+\frac{1}{m}\sum _{q\in Q}arg\underset{{\scriptstyle p\in P}}{min}C(p,T(q)), \end{aligned}$$

(3)

where \(T(\cdot )\) denotes the TPS shape transformation.

#### Cytoplasm feature based on SIFT flow

In SIFT flow, the SIFT descriptor, as a type of middle-level representation, is incorporated into the computational framework of optical flow. It establishes a robust semantic-level correspondence through matching these image structure [34].^{Footnote 3} Based on the semantic-level correspondence, the movement field and the appearance change field are constructed by computing the displacement of the corresponding points and the discrepancy of the corresponding SIFT descriptors, respectively. Then histograms of oriented SIFT flow is employed to characterize multi-oriented dynamic information from both the movement field and the appearance change field.

#### SIFT flow

Instead of matching raw pixels in optical flow, SIFT flow searches for the correspondences of SIFT descriptors on the grid coordinate \(p=(x,\, y)\) of images. The dense correspondence map, or the movement field, can be obtained by minimizing an objective function *E*(*w*):

$$\begin{aligned} E(w)&=\sum _{p}min\left( \left\| s_{p}^{1}-s_{p+w_{p}}^{2}\right\| _{1},\, t\right) +\sum \eta \left( \left| u_{p}\right| +\left| v_{p}\right| \right) \nonumber \\&\quad +\sum _{(p,q)\in \varepsilon }min\left( \alpha \left| w_{p}-w_{q}\right| ,\, d\right) , \end{aligned}$$

(4)

where \(s_{p}^{1}\) and \(s_{p}^{2}\) individually denote the SIFT descriptor at position *p* in two SIFT images, and \(w_{p}=(u_{p},\, v_{p})\) presents the flow vector at *p*. The parameters *t* and *d* are the thresholds of the data term and the smoothness term, respectively. The set \(\varepsilon\) contains all spatial four-neighborhoods.

After obtaining the correspondence map upon the sequential SIFT images, the appearance change field can be implemented by computing the difference of SIFT features between the corresponding points.

#### Histograms of oriented SIFT flow

Due to the susceptibility to scale changes and directionality of movement, the raw SIFT flow cannot obtain a good performance if applied as features directly. Inspired by the histograms of oriented optical flow, SIFT flow is binned according to its primary angle from the horizontal axis and weighted according to its magnitude or appearance difference, as shown in Fig. 4. \(F_{MDF}=\left\{ f_{MDF}^{1},\,\ldots ,\, f_{MDF}^{R}\right\}\) and \(F_{ACF}=\left\{ f_{ACF}^{1},\,\ldots ,\, f_{ACF}^{R}\right\}\) are obtained to characterize the movement and appearance variation of the cytoplasm, respectively.

$$\begin{aligned} f_{MDF}^{r}= & {} \sum _{u,v\in bin(r)}\sqrt{u_{p}^{2}+v_{q}^{2}}, \end{aligned}$$

(5)

$$\begin{aligned} f_{ACF}^{r}= & {} \sum _{u,v\in bin(r)}\left\| s_{p}^{1}-s_{p}^{2}\right\| , \end{aligned}$$

(6)

where \(f_{MDF}^{r}\) denotes the accumulation of displacement magnitude belonging to the *r*-th (\(1\le r\le R\)) bin in the movement field, and \(f_{ACF}^{r}\) means the sum of the appearance difference in the *r*-th (\(1\le r\le R\)) bin.

#### Combination of features

The robustness of feature representation can be enhanced by combining the complementary features. To sum up, the aforementioned \(F_{DCS}\), \(F_{MDF}\) and \(F_{ACF}\) between frames are concatenated to form a feature vector:

$$\begin{aligned} F_{i}=\left\{ F_{DCF},\, F_{MDF},\, F_{ACF}\right\} . \end{aligned}$$

(7)

The computing of \(F_{i}\) is the key step to extracting the features of cell dynamics in the whole framework. Then “Temporal aggregation of dynamic features” section is mainly about encoding the chronological structure of the cell dynamic features in a particular video.

### Temporal aggregation of dynamic features

For the video-wide cell dynamics, it is essential to aggregate a series of frame-level dynamic features along temporal extent in a rational way. That is to say, it needs to consider how dynamic features evolve over time in a video. In this section, we present three compact encoding methods, including FV, VLAD, and H-VLAD, to capture the temporal information of cell sequences. The pipeline for the temporal aggregation strategy in this paper is depicted in Fig. 5. It can be summarized as the following two phases: (1) In the training phase, the samples in the cell video dataset are transformed into the dynamic features by the aid of algorithms in “Dynamic features between frames” section. Then the compact dictionary with *K* visual words is learned based on these features by means of K-Means or Gaussian mixture model (GMM). (2) In the testing phase, the features of cell dynamics are obtained similarly and assigned to the *K* visual words. Then, the residuals between the visual words and the dynamic features belonging to them are encoded into the temporal-feature-aggregated vector.

#### Fisher vector encoding

In FV encoding [26, 27], a GMM with *K* components can be learned from the training dynamic features between frames, and denoted as \(\varTheta =\left\{ (\varvec{\mu }_{k},\varvec{\sigma }_{k},\pi _{k}),\, k=1,2,\ldots ,K\right\}\), where \(\varvec{\mu }_{k}\), \(\varvec{\sigma }_{k}\), \(\pi _{k}\) are the mean vector, variance matrix (assumed diagonal) and mixture weight of the *k*-th component, respectively. Given \({\varvec{X}}=({\varvec{x}}_{1},{\varvec{x}}_{2}\ldots \,,{\varvec{x}}_{N})\) of dynamic features extracted from a testing cell image sequence, we have mean and covariance deviation vectors for the *k*-th component as:

$$\begin{aligned} {\varvec{u}}_{k}&=\frac{1}{N\sqrt{\pi _{k}}}\sum _{i=1}^{N}q_{ki}\left( \frac{{\varvec{x}}_{i}-\varvec{\mu }_{k}}{\varvec{\sigma }_{k}}\right) ,\nonumber \\ {\varvec{v}}_{k}&=\frac{1}{N\sqrt{2\pi _{k}}}\sum _{i=1}^{N}q_{ki}\left[ \left( \frac{{\varvec{x}}_{i}-\varvec{\mu }_{k}}{\varvec{\sigma }_{k}}\right) ^{2}-1\right] , \end{aligned}$$

(8)

where \(q_{ik}\) is the soft assignment of feature \({\varvec{x}}_{i}\) to the *k*-th Gaussian component. By concatenation of \({\varvec{u}}_{k}\) and \({\varvec{v}}_{k}\) of all the *K* components, FV for the testing sample is formed with size \(2D^{'}K\), where \(D^{'}\) is the dimension of the dynamic feature after principal component analysis (PCA) pre-processing [27]. Power normalization using signed square root (SSR) with \(z=sign(z)\sqrt{\left| z\right| }\) and \(\ell _{2}\) normalization are then applied to the FVs [26, 27].

#### VLAD encoding

As a non-probabilistic version of FV encoding, VLAD encoding [28, 29] simply utilizes K-means instead of GMM to generate *K* coarse centers \(\left\{ {\varvec{c}}_{1},{\varvec{c}}_{2},\ldots \,,{\varvec{c}}_{K}\right\}\). Then we can obtain the difference vector \({\varvec{u}}_{k}\) with respective to the *k*-th center \({\varvec{c}}_{k}\) for the testing dynamic feature set by:

$$\begin{aligned} {\varvec{u}}_{k}=\sum _{i:NN({\varvec{x}}_{i})={\varvec{c}}_{k}}\left( {\varvec{x}}_{i}-{\varvec{c}}_{k}\right) , \end{aligned}$$

(9)

where \(NN({\varvec{x}}_{i})\) indicates \({\varvec{x}}_{i}\)’s nearest neighbors among *K* coarse centers.

The VLAD encoding vector concatenates \({\varvec{u}}_{k}\) over all the *K* centers with size \(D^{'}K\), and the post-processing employs the power and \(\ell _{2}\) normalization. Besides, the intra-normalization [35] is also applied to add normalization on each \({\varvec{u}}_{k}\). The proposed framework prefer to VLAD-*k* (*k* = 5), a variant of VLAD, which extends the nearest neighbor with the *k*-nearest neighbors, because of its good performance in contrast to the original VLAD [36].

#### High-order VLAD encoding

In order to keep both high performance and high extraction speed, the H-VLAD [30] augments the original VLAD with high-order statistics, e.g., diagonal covariance and skewness. The *K* clusters are first learned by K-means, regarded as the visual words \(\left\{ {\varvec{w}}_{1},{\varvec{w}}_{1},\ldots \,,{\varvec{w}}_{K}\right\}\), and the corresponding first-order, second-order, and third-order statistics are denoted as \(\left\{ \varvec{\mu }_{1},\varvec{\mu }_{2},\ldots \,,\varvec{\mu }_{K}\right\}\), \(\left\{ \varvec{\sigma }_{1},\varvec{\sigma }_{2},\ldots \,,\varvec{\sigma }_{K}\right\}\) and \(\left\{ \varvec{\gamma }_{1},\varvec{\gamma }_{2},\ldots \,,\varvec{\gamma }_{K}\right\}\), respectively. The technical details of H-VLAD can be summarized as:

$$\begin{aligned} {\varvec{u}}_{k}&=N_{k}\left( \frac{1}{N_{k}}\sum _{i=1}^{N_{k}}{\varvec{x}}_{i}-\varvec{\mu }_{k}\right) =N_{k}({\varvec{m}}_{k}-\varvec{\mu }_{k}),\nonumber \\ {\varvec{v}}_{k}&=\frac{1}{N_{k}}\sum _{i=1}^{N_{k}}\left( {\varvec{x}}_{i}-{\varvec{m}}_{k}\right) ^{2}-\varvec{\sigma }_{k}^{2},\nonumber \\ {\varvec{s}}_{k}&=\frac{\frac{1}{N_{k}}\sum _{i=1}^{N_{k}}\left( {\varvec{x}}_{i}-{\varvec{m}}_{k}\right) ^{3}}{\left( \frac{1}{N_{k}}\sum _{i=1}^{N_{k}}\left( {\varvec{x}}_{i}-{\varvec{m}}_{k}\right) ^{2}\right) ^{\frac{3}{2}}}-\varvec{\gamma }_{k}, \end{aligned}$$

(10)

where \({\varvec{X}}_{k}=\left\{ {\varvec{x}}_{1},{\varvec{x}}_{2},\ldots \,,{\varvec{x}}_{N_{k}}\right\}\) is the testing dynamic features belonging to the *k*-th visual word \({\varvec{w}}_{k}\), and \({\varvec{m}}_{k}\) stands for the mean of these dynamic features. Therefore, \({\varvec{u}}_{k}\), \({\varvec{v}}_{k}\) and \({\varvec{s}}_{k}\) are the residual vectors of the first-order, second-order and third-order, respectively. Similar to the original VLAD, the final representation of H-VLAD is concatenated as \(\left\{ {\varvec{u}}_{1},{\varvec{v}}_{1},{\varvec{s}}_{1},{\varvec{u}}_{2},{\varvec{v}}_{2},{\varvec{s}}_{2}\ldots \,,{\varvec{u}}_{K},{\varvec{v}}_{K},{\varvec{s}}_{K}\right\}\), and the post-processing operation also adopts the power-, \(\ell _{2}\)- and intra-normalization [35].

#### Analysis of temporal feature aggregation methods

Given the above three compact encoding approaches to temporal feature aggregation, we need to find out which one is the most appropriate for our application. For this purpose, we conduct an experiment on Dataset I (for details see “Data” section ) to analyze the discrimination of these encoding strategies: FV, VLAD, and H-VLAD. Specifically, we calculate the histogram distribution of classification scores from positive exemplars and negative exemplars, respectively. The positive and negative exemplars individually correspond to 20 training samples from SST group and the AST group. Note that the classifier is the linear SVM (the parameters are the same as “Experimental setup” section ), the dictionary size is 64 for FV, VLAD, and H-VLAD, and the encoding vector is not followed by the temporal pyramid pooling (TPP). From Fig. 6, we can find that VLAD encoding and H-VLAD encoding have the similar discrimination while the FV encoding has better performance. It shows that the FV encoding is most suitable for the temporal aggregation of cell dynamic features.

#### Temporal pyramid pooling

To preserve much more temporal discrimination, we add the TPP, regarded as a one-dimensional version of spatial pyramid pooling [37]. For a particular video, we suppose that its dynamic features between frames is denoted as \({\varvec{Z}}\) and the temporal aggregation operation is defined as \({\varPhi }(\cdot )\). TPP is to organize the dynamic features \({\varvec{Z}}\) into three level of subsets: \(Z_{1}^{1}\), \(Z_{2}^{1}\), \(Z_{2}^{2}\), \(Z_{3}^{1}\), \(Z_{3}^{2}\) and \(Z_{3}^{3}\), which have 1, 2 and 3 average-partitioned subwindows along temporal dimension, respectively. Therefore, the TPP of \({\varvec{Z}}\) can be written as follows:

$$\begin{aligned} \varPhi ({\varvec{Z}})=[\varPhi (Z_{1}^{1}),\,\varPhi (Z_{2}^{1}),\,\varPhi (Z_{2}^{2}),\,\varPhi (Z_{3}^{1}),\,\varPhi (Z_{3}^{2}),\,\varPhi (Z_{3}^{3})]. \end{aligned}$$

(11)