Trajectory Statistics

multi_locus_analysis.stats

Statistical methods for trajectory analysis.

Moment-based calculations can found in the multi_locus_analysis.stats.moments submodule, and helper functions for dealing with distributional data can be found in multi_locus_analysis.stats.distributions.

Because the bulk of the distributional helper tools are demonstrated clearly in the Finite Window tutorial, we focus on an example of how to use the moment-based statistical methods here.

Tutorial

Tip

Like the rest of multi_locus_analysis, these functions are optimized for use with long-form pandas DataFrames. The following tutorial assumes that your data set is small enough for approximately \(n^3\) data points to fit in memory. In order to take up only linear space, simply compose the desired functions instead of applying them one-by-one (which keeps around unnecessary intermediates).

Example data

You can load in the data we will use in this tutorial as follows:

>>> import multi_locus_analysis as mla
>>> from multi_locus_analysis.examples import burgess
>>> df = mla.examples.burgess.df
>>> df.head()
                                                  X    Y    Z foci    t
locus genotype exp.rep meiosis cell frame spot
HET5  WT       2       t0      1    1     1     1.6  2.4  3.1  unp    0
                                    2     1     1.9  1.5  3.1  unp   30
                                    3     1     2.0  1.8  3.0  unp   60
                                    4     1     2.1  1.9  3.0  unp   90
                                    5     1     2.2  1.8  3.0  unp  120

As is typical in this kind of data, there are some number of columns holding the actual measurements, and some number which are simply record-keeping columns. In the Burgess data a uniquely-identifying-subset of the record-keepign columns have been pre-emptively made into an index.

We will be using named collections of columns throughout this tutorial to group together parts of the DataFrame as needed (for example, to get each continuous trajectory we use traj_cols). Please see multi_locus_analysis.examples.burgess for detailed exaplanations of the information in each column. For now, we need only point out that each unique combination of ['locus', 'genotype', 'exp.rep', 'meiosis', 'cell'] defines a movie, which contains two spots diffusing relative to each other: these are homologous loci at different times in meiosis. To save some typing, we extract the groupings we will use later:

>>> cell_cols = burgess.cell_cols
>>> traj_cols = burgess.traj_cols
>>> frame_cols = burgess.frame_cols

Different tasks will be easier or harder with the index set to identify a single spot at a single time, or a single time point with different columns for each spot. We can use the multi_locus_analysis.pivot_loci() helper to move from one representation to the other

>>> mla.pivot_loci(df, pivot_cols=['X', 'Y', 'Z']).head()
                                          foci    t   X1   Y1   Z1   X2   Y2   Z2
locus genotype exp.rep meiosis cell frame
HET5  WT       2       t0      1    1      unp    0  1.6  2.4  3.1  2.1  1.5  3.1
                                    2      unp   30  1.9  1.5  3.1  1.9  2.5  3.1
                                    3      unp   60  2.0  1.8  3.0  1.5  2.5  3.4
                                    4      unp   90  2.1  1.9  3.0  1.4  2.2  3.4
                                    5      unp  120  2.2  1.8  3.0  1.5  2.4  3.4

Velocities to MS(C)Ds

With almost any set of trajectories, we first extract all the velocities that we are interested in. If we have drift-free, individual trajectories, this can be done simply as

>>> # subsampling makes computation easy for demo purposes
>>> # group by remaining columns that uniquely id a trajectory
>>> vels = df.loc['URA3', 'WT', 9].groupby(traj_cols[3:]).apply(
>>>     mla.stats.pos_to_all_vel, xcol='X', ycol='Y', zcol='Z', framecol='t')
>>> vels.head()
                             tf   vx   vy   vz
meiosis cell spot ti delta
t0      2    1    0  0        0  0.0  0.0  0.0
                     30      30  0.0 -0.1  0.3
                     60      60  0.0  0.0  0.3
                     90      90  0.1 -0.1  0.2
                     120    120  0.2 -0.2  0.3

If there are sufficient trajectories at each time point, then sometimes it is possible to first subtract out the drift first

>>> raise NotImplementedError('Pandas shape error in following code')
>>> for i in ['X', 'Y', 'Z']:
>>>     df['dc'+i] = df[i] - df.groupby(frame_cols)[i].mean()
>>> # groupby remaining cols that uniquely id a trajectory
>>> dc_vels = df.loc['HET5', 'WT', 2].groupby(traj_cols[3:]).apply(
>>>     mla.stats.pos_to_all_vel, xcol='dcX', ycol='dcY', zcol='dcZ',
>>>     framecol='t')
>>> dc_vels.head()

However, in our example data, we have two trajectories, so it makes more sense to simply work directly with the distance between them as our drift-free measurement.

>>> df = mla.pivot_loci(df, pivot_cols=['X', 'Y', 'Z'])
>>> for i in ['X', 'Y', 'Z']:
>>>     df[i+'12'] = df[i+'2'] - df[i+'1']
>>> df.head()

We can then get all the \(V^\delta_{12}(t_k)\) by simply using these new \(X_{12}(t_k)\) columns:

>>> d_vels = df.groupby(cell_cols).apply(mla.stats.pos_to_all_vel, xcol='dX', ycol='dY', zcol='dZ', framecol='t')

By default, pos_to_all_vel() computes all velocities. If all non-overlapping velocities are required instead, such as when computing standard error in a time-averaged MS(c)D measurement, pass the force_independence kwarg.

>>> # first subsample so that this runs quickly on a single machine
>>> df = df.loc['URA3', 'WT', 9].copy() # one experiment
>>> remaining_index = cell_cols[3:]
>>> d_vels = df.groupby(remaining_index).apply(mla.stats.pos_to_all_vel, \
>>>     xcol='X12', ycol='Y12', zcol='Z12', framecol='t', \
>>>     force_independence=True)
>>> d_vels['v'] = np.sqrt(d_vels['vx']**2 + d_vels['vy']**2 + d_vels['vz']**2)
>>> mscds_per_stage = d_vels.groupby(['meiosis', 'delta'])['v'].agg(['mean', 'std', 'count'])

The MSD of the distance between two homolog pairs is often called the MSCD in the meiosis literature (standing for Mean Squared Change in Displacement). These MSCDs can be simply plotted via

>>> plt.figure(figsize=[4, 3])
>>> for label, data in mscds_per_stage.groupby('meiosis'):
>>>     data = data.reset_index()
>>>     data = data[data['delta'] > 0] # for log-log scale prettiness
>>>     # SEM bars
>>>     plt.errorbar(data['delta']/60, data['mean'], \
>>>                  data['std']/np.sqrt(data['count']), \
>>>                  label=label)
>>> plt.yscale('log')
>>> plt.xscale('log')
>>> plt.ylim([0.01, 1])
>>> plt.xlabel(r'time lag, $\delta$ (min)')
>>> plt.ylabel(r'MSCD, $<V_{12}^\delta(t)>$ ($\mu{}m^2/s$)')
>>> plt.legend()

Velocity correlations

In order to calculate the velocity autocorrelation, we include the function all_vel_to_corr(), which can be used to get numbers that can be groupby’d and averaged to get the velocity autocorrelation.

However, doing this naively on any reasonably-sized dataset will result in way too large of a DataFrame to fit in RAM. This is because this function is simply calculating \(V^\delta(t + t_k) \cdot V^\delta(t_k)\) for every single \(\delta\) and \(t_k\). While this works well for small datasets (like single movies), even clever use of Pandas will often make applying this function to a large dataset spicy. Therefore, we provide convenience functions vels_to_cvvs_by_hand() and vvc_stats_by_hand() to first write out all these products and then take the appropriate averages (respectively).

If we wanted to use them on the above velocities, we would simply run:

Warning

The following creates a temporary file of approximately 2.1GB.

>>> mla.stats.vels_to_cvvs_by_hand(d_vels, ['meiosis'], 'vvs.csv',
>>>                                max_t_over_delta=4)
>>> cvv_stats = mla.stats.vvc_stats_by_hand('vvs.csv', ['meiosis'])
>>> cvv_stats = mla.stats.cvv_by_hand_make_usable(cvv_stats, ['meiosis'])
>>> cvv_stats.set_index(['meiosis', 'delta', 't'], inplace=True)
                         sqs      sum     ...      cvv_normed  ste_normed
meiosis delta t                           ...
t0      30    0    8787.4250  1427.42     ...        1.000000    0.065668
              30   6635.5702  -570.96     ...       -0.421770    0.060173
              60   6503.1084   -88.80     ...       -0.067135    0.060968
              90   6000.1380   -76.52     ...       -0.058923    0.059647
              120  5884.4852   191.24     ...        0.150432    0.060341

We can then plot the velocity correlation curves (for example at meiotic stage t0) by simply running

>>> from multi_locus_analysis import plotting
>>> plotting.cvv_plot_sized(cvv.loc['t0'].reset_index(), data_deltas=30*np.arange(1, 5))

where we ignore larger values of delta because the amount of data available for averaging decreases to a single point as delta approaches the length of the observation.

Notice how the correlation decays as \(\delta\) increases. For details on why this means that the correlation is actually increasing as delta increases, please see the paper in preparation (Newman, Beltran, and Calhoon et al).

Notes

For a listing of all functions exposed by the stats module, see the API reference section.