# Finite window correction¶

Measurements of “waiting times” or “survival times” taken within a finite time interval require special statistical treatment to account for the bias towards short measurements introduced by the measurement itself. For mathematical details, see the manuscript in preparation (Beltran et. al., *in preparation*).

Otherwise, the process for these corrections in explained in the module’s doctring

## multi_locus_analysis.finite_window¶

For correcting interior- and exterior-censored data

This module is designed to facilitate working with Markov processes which have been observed over finite windows of time.

Suppose you wish to observe a process that switches between states A and B, and observe the process over a window of time $$[0, T]$$.

There will be two types of “observed” times, which we we call “interior times” (those when the entire time spent in the state is observed) and “exterior times” (the lengths spent in the state that we observed at the start/end point of our window).

For example, suppose you are observing two fluorescent loci that are either in contact (A) or not (B). If you measure for 5s, and the loci begin in contact, come apart at time t=2s, and rejoin at time t=4s,

    A A A A A A B B B B B B A A A A
| - - | - - | - - | - - | - - | -- | -- | ...
|                             |
t = 0s    1s    2s    3s    4s    5s


we would say that T = 5s, and you measured one “interior” time of length 2s, and two exterior times, one of length 2s and one of length 1s (at the start and end of the window, respectively).

When histogramming these times, we must apply a statistical correction to the final histogram weights due to the effects of the finite observation window.

The distribution of the exterior times $$f_E(t)$$ is exactly equal to the survival function of the actual distribution $$S(t) = 1 - \int_0^t f(s) ds$$ normalized to equal one over the observation interval. No functions are currently included that leverage this, since extracting information from the survival function is likely only worth it if a large fraction of your observations are exterior times.

On the other hand, the interior times distribution is given by $$f_I(t) = f(t)(T - t)$$ for $$t\in[0,T]$$. In order to plot the actual shape of $$f(t)$$ with this biasing removed, we provide the cdf_exact_given_windows function below.

A typical workflow is, given an array of interior times, t, and an array of the window sizes each time was observed within, w, is to extract the CDF exactly, then optionally convert that to a PDF to display a regular histogram

>>> x, cdf = cdf_exact_given_windows(t, w, pad_left_at_x=0)
>>> xp, pdf = bars_given_cdf(x, cdf)
>>> confint = simultaneous_confint_from_cdf(0.05, len(t), x, cdf)
>>> xc, confs = bars_given_confint(x, confint)
>>> plt.plot(xp, pdf, 'k', xc, confs, 'r-.')


The remainder of the functions herein are related to extracting confidence intervals for these distributional estimates.

For details of the derivation, see the Biophysical Journal paper in preparation by Beltran and Spakowitz.

## Tutorial¶

### Generating AB Trajectories¶

We use the following functions to generate example data (for the case when the means of the distribution are not known or hard to calculate, see the function multi_locus_analysis.finite_window.ab_window()):

multi_locus_analysis.finite_window.ab_window_fast(rands, means, window_size, num_replicates=1, states=[0, 1], seed=None, random_state=None)[source]

Simulate a two-state system switching between states A and B.

In addition to functions that can generate random waiting times for each state, this “fast” version of the code requires the average waiting times are are means, means, respectively.

Warning

apparently, bruno_util.random.strong_default_seed() is broken (or this function is) because passing a seed does not make the output reproducible.

Parameters: rands ((2,) List[scipy.stats.rv_continuous]) – One of the random variables defined in scipy.stats. Alternatively, any callable that takes random_state and size kwargs. random_state should accept a np.random.RandomState seed. size will be a tuple specifying output shape of random number array requested. means ((2,) array_like) – average waiting times for each of the states window_size (float) – the width of the window over which the observation takes place num_replicates (int) – number of times to run the simulation, default to 1 states ((2,) array_like) – the “names” of each state, default to [0,1] seed (np.random.RandomState) – state to start the simulation with df – The start/end times of each waiting time simulated. This data frame has columns=[‘replicate’, ‘state’, ‘start_time’, ‘end_time’, ‘window_start’, ‘window_end’]. pd.DataFrame

Notes

Consider the waiting time intersecting the left boundary of the observation window. The left boundary will be a uniform fraction of the way through this wait time. This can easily be seen in the case of finite-variance wait times using CLT and starting the switching process arbitrarily far left of the window of observation, or in general be imposed by requiring time-homogeneity of the experiment.

We use this fact here to speed up correct simulation of time-homogenous windows by directly simulating only the waiting times within the windows instead of also simulating a long run of “pre-equilibrating” waiting times some offset before the window, as in ab_window().

Here’s an example of a so-called time-homogeneous process that switches between states “exp(4,4)” and “exp(1,6)”. We observe 1000 copies of the process for 20 time units.

>>> import scipy.stats
>>> from multi_locus_analysis import finite_window as fw
>>> e44 = scipy.stats.expon(scale=4, loc=4)
>>> e16 = scipy.stats.expon(scale=6, loc=1)
>>> trajs = fw.ab_window_fast([e44.rvs, e16.rvs], [e44.mean(), e16.mean()],
>>>     window_size=20, num_replicates=10000, states=['exp(4,4)', 'exp(1,6)'])
replicate     state  start_time   end_time  window_start  window_end
0          0  exp(1,6)   -3.327818   0.788432             0          20
1          0  exp(4,4)    0.788432   6.865933             0          20
2          0  exp(1,6)    6.865933  11.175979             0          20
3          0  exp(4,4)   11.175979  17.382677             0          20
4          0  exp(1,6)   17.382677  18.787130             0          20


From each trajectory, we can extract the wait times:

>>> waits = trajs.groupby('replicate').apply(fw.traj_to_waits)
replicate     state  start_time   end_time  window_start  window_end  wait_time  window_size      wait_type
rank_order
0                   0  exp(1,6)    0.000000   0.788432             0          20   0.788432           20  left exterior
1                   0  exp(4,4)    0.788432   6.865933             0          20   6.077500           20       interior
2                   0  exp(1,6)    6.865933  11.175979             0          20   4.310046           20       interior
3                   0  exp(4,4)   11.175979  17.382677             0          20   6.206698           20       interior
4                   0  exp(1,6)   17.382677  18.787130             0          20   1.404453           20       interior


### Rescaling Exact Waiting Times¶

Below, we plot three curves alongside each other. In black, we show the actual distribution of “waiting time”s in state “exp(4,4)” from the simulation above. In green is the empirical distribution function of the “interior times”. If you didn’t have specific experience with lifetimes data, it might seem like one could simply histogram the interior times in order to reproduce the actual distribution (since you observed the whole waiting time). However, as you can see, even if we only take “interior” times, the empirical CDF of the data does not match the distribution that we put in (e.g. expon(loc=4, scale=4) in this case, from above), even if we are extra generous and rescale the CDFs to match at “t=window_size”.

>>> # plot kaplan-meier estimate first, library creates new figure
>>> import lifelines
>>> waits44 = waits[waits['state'] == 'exp(4,4)']
>>> kmf = lifelines.KaplanMeierFitter().fit(
>>>     waits44['wait_time'].values,
>>>     event_observed=(waits44['wait_type'] == 'interior').values,
>>>     label='Meier-Kaplan Estimator, $\pm$95% conf int'
>>> )
>>> kmf.plot_cumulative_density()
>>>
>>> # plot actual distribution
>>> t = np.linspace(0, 20, 100)
>>> plt.plot(t, e44.cdf(t), 'k-.', label='Actual CDF, exp(4,4)')
>>>
>>> # now compute the empirical distribution of the "interior" times
>>> interior_44 = waits.loc[
>>>     (waits['state'] == 'exp(4,4)') & (waits['wait_type'] == 'interior'),
>>>     'wait_time'
>>> ].values
>>> x44, cdf44 = fw.ecdf(interior_44, pad_left_at_x=0)
>>> plt.plot(x44, cdf44*e44.cdf(x44[-1]), label='Empirical CDF, exp(4,4)')
>>> # prettify the plot
>>> plt.xlabel('t')
>>> plt.ylabel(r'$P(\mathrm{wait} <= t)$')
>>> plt.legend() The classical solution to this problem is to use the “Kaplan-Meier” correction. This is the blue line in the figure above, along with 95% (pointwise) confidence intervals. However, if the interval of observation is large enough that multiple state changes can be observed in one trajectory, then the Kaplan-Meier correction will under-correct, because there is negative autocorrelation between “successful” observation of times within a given trajectory (i.e. if we observe a long time, then we will be near to the end of our observation window, so the next observation must be small). The Meier-Kaplan correction was designed to work with right-censored data, but only with right-censored data where every single observed wait time is perfectly independent.

In order to correctly reproduce the actual underlying distribution, we can apply a more general correction using multi_locus_analysis.finite_window.ecdf_windowed().

>>> plt.figure(figsize=[4,3])
>>> x, cdf = fw.ecdf_windowed(interior_44, 20)
>>> plt.plot(x, cdf, label='Empirical CDF, state 1')
>>> plt.plot(t, e44.cdf(t), 'k-.', label='Actual CDF')
>>> plt.xlabel('t')
>>> plt.ylabel(r'$P(\mathrm{wait} <= t)$') As you can see, this still does not perfectly reproduce the distribution, but the bias is effectively zero for quite a large amount of time. We plot the probability distribution functions (using a gaussian kernal density estimator) below, and compare it to the result of using the Meier-Kaplan corrector. Notice that the Meier-Kaplan corrector mis-estimates the “slope” of the PDF in semilog space (which would lead to a mis-estimate of the parameter of the distribution) while our estimator has the correct slope for quite some time, and so a fit to the PDF (correctly cut off where the variance in the estimator starts to increase) would correctly estimate the parameters of the actual distribution.

>>> plt.figure(figsize=[4,3])
>>> kernel = fw.smooth_pdf(x, cdf)
>>> plt.plot(t, kernel(t), label='Gaussian KDE, exp(4,4)')
>>> plt.plot(t, e44.pdf(t), 'k-.', label='Actual PDF')
>>> plt.yscale('log')
>>> plt.ylim([e44.pdf(20), 0.4])
>>> plt.xlabel('t')
>>> plt.ylabel(r'$P(\mathrm{wait} <= t)$') >>> plt.figure(figsize=[4,3])
>>> kcdf = kmf.cumulative_density_at_times(x)
>>> kernel = fw.smooth_pdf(x, kcdf)
>>> plt.plot(t, kernel(t), label='Meier-Kaplan + "Gaussian KDE"')
>>> plt.plot(t, e44.pdf(t), 'k-.', label='Actual PDF')
>>> plt.yscale('log')
>>> plt.ylim([e44.pdf(20), 0.4])
>>> plt.xlabel('t')
>>> plt.ylabel(r'$P(\mathrm{wait} <= t)$') ### A Caveat for Discrete Movies¶

While we would ideally have an exact measurement of when transitions between A and B states happen, it is more often the case that we have a “movie” of sorts: where we measure the state of the system at a fixed set of times.

This only provides us with upper and lower bounds for the actual waiting time that we’re trying to observe. For example, consider the trajectory depicted below.

    A A A A A B B B B B B B A A A A B
| - - - | - - - | - - - | - - - |  ...
|                               |
t = 0s      1s      2s      3s      4s


This trajectory, when measured at the discrete times shown, would look like

>>> pd.Series({0: 'A', 1: 'A', 2: 'B', 3: 'A', 4: 'B'}).head()
0    A
1    A
2    B
3    A
4    B
dtype: object


Naively, if you only had this movie in front of you with no knowledge of the actual underlying state change times, it might seem to suggest that there was an exterior-censored “A” of length 2, one each interior censored times of length 1, and one exterior-censored “B” time of length 1. However, by looking at the true trajctory above, we see that the first “A” wait was much shorter than 2s, and the first “B” wait was much longer than 1s, whereas the last “A” wait just happened to match up with our prediction of 1s.

Because our normalization factor depends non-linearly on the observed waiting time, one might guess that simply using the “naive” times might cause bias. We will show that this is the case by generating some artificial movies ourselves.

### Generating Discrete Trajectories (Movies)¶

multi_locus_analysis.finite_window includes a convenience method for generating “movies” from the output of the AB_window* functions.

multi_locus_analysis.finite_window.state_changes_to_trajectory(traj, times, state_col='state', start_times_col='start_time', end_times_col=None)[source]

Converts a series of state change times into a Series containing observations at the times requested. The times become the index.

Parameters: times ((N,) array_like) – times at which to “measure” what state we’re in to make the new trajectories. traj (pd.DataFrame) – should have state_col and start_times_col columns. the values of state_col will be copied over verbatim. state_col (string) – name of column containing the state being transitioned out of for each measurement in traj. start_times_col (string) – name of column containing times at which traj changed state end_times_col ((optional) string) – by default, the function assumes that times after the last start time are in the same state. if passed, this column is used to determine at what time the last state “finished”. times after this will be labeled as NaN. movie – Series defining the “movie” with frames taken at times that simply measures what state traj is in at each frame. index is times, state_col is used to name the Series. pd.Series

Notes

A start time means that if we observe at that time, the state transition will have already happened (right-continuity). This is confusing in words, but simple to see in an example (see the example below).

Examples

For the DataFrame

>>> df = pd.DataFrame([['A',  -1, 0.1], ['B', 0.1, 1.0]],
>>>     columns=['state', 'start_time', 'end_time'])


the discretization into tenths of seconds would give

>>> state_changes_to_trajectory(df, times=np.linspace(0, 1, 11),
>>>     end_times_col='end_time')
t
0.0      A
0.1      B
0.2      B
0.3      B
0.4      B
0.5      B
0.6      B
0.7      B
0.8      B
0.9      B
1.0    NaN
Name: state, dtype: object


Notice in particular how at 0.1, the state is already ‘B’. Similarly at time 1.0 the state is already “unknown”. This is what is meant by the Notes section above.

If the end_times_col argument is omitted, then the last observed state is assumed to continue for all times requested from then on:

>>> state_changes_to_trajectory(df, times=np.linspace(0, 1, 11))
t
0.0    A
0.1    B
0.2    B
0.3    B
0.4    B
0.5    B
0.6    B
0.7    B
0.8    B
0.9    B
1.0    B
Name: state, dtype: object


This function has an alias for convenience (multi_locus_analysis.finite_window.traj_to_movie()).

>>> movies = trajs.groupby('replicate').apply(traj_to_movie,
times=np.linspace(0, 1, 10))
t          0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0
replicate
0            0    0    1    1    1    1    0    0    0    1    1
1            0    1    1    1    1    1    1    1    1    1    1
2            1    1    1    1    0    1    1    1    0    1    1
3            0    0    1    1    1    1    1    1    1    1    0
4            0    0    0    1    1    1    1    1    1    1    0


We can get a long-form DataFrame by simply doing

>>> movies.T.unstack()
>>> movies.name = 'state' # name resulting column from unstack
replicate  t
0          0.0    0
0.5    0
1.0    1
1.5    1
2.0    1
Name: state, dtype: int64


As is clear from the following plot, the data being effectively discretized creates a bias in the tail of the distribution, even when the times are corrected with our method.

## Theoretical Details¶

The following section contains a complete derivation of the framework used to generate the corrections used in this module.

### Motivating (A)synchronicity¶

We first motivate our definition of “(a)synchronicity”, the critical property that allows us to correct for the effects of observing in a finite window.

Suppose a process starts at $$-t_\text{inf}$$ (WLOG, assume it starts in state $$A$$). For times after $$-t_\text{inf} \lll 0$$, the process switches between states $$A$$ and $$B$$. The distribution of times spent in each state before switching are IID, and distributed like $$f_A(t)$$ and $$f_B(t)$$, respectively. We then are able to observe the process during the interval of time $$[0, T]$$. This can be thought of as a renewal(-reward) process that started far in the past. As long as the starting point, $$-t_\text{inf}$$, is sufficiently far in the past, and the distributions $$f_*(t)$$ have finite variance, various convenient properties hold true for the observed state switching times between $$0$$ and $$T$$. We use the same example as in the “tutorial” section in what follows.

The first convenient property is that the switching times are uniformly distributed within the observation interval (as $$-t_\text{inf}\to-\infty$$). Intuitively, this just means that $$-t_\text{inf}$$ is far enough in the past that, independently of the distribution, we are not biased towards the switching times being early or late in our observation interval (i.e. we have lost all memory of the “real” start time).

Now let’s label the observed state switches as $$t_0,\ldots{},t_{n-1}$$, with $$t_0$$ and $$t_n$$ corresponding to the “actual” (unobserved) state switch times flanking the observation interval. The next useful property is that the start of the observation interval ($$t=0$$) is uniformly distributed within $$[t_0, t_1]$$ (similarly, the end of the observation interval, $$t=T$$, is uniformly distributed in $$[t_{n-1}, t_n]$$.

For the interior times, we can simply use the first fact to derive our interior time correction. Since we know the starting times of each state are uniformly distributed, we immediately can tell that if a waiting time of length $$\tau$$ has a start time within the interval, then the the fraction of times that this waiting time will end up being an interior time is just $$(T - \tau)/T$$. More precisely, we have that

$P(t_{i+1} \leq T | t_{i+1}-t_i=\tau, t_i \in [0,T]) = \int_0^T 1_{t_i + \tau \leq T} f_{\text{Unif}[0,T]}(t_i) dt_i$

which is just equal to $$(T - \tau)/T$$.

This correction factor can be visualized easily as simply counting what fraction of “start” times of a given length lead to “end” times still inside the interval. Namely, it’s the green part of the interval in the following diagram: On the other hand, we have to be careful about the distribution of exterior times, even if we do somehow magically have the values of $$t_0$$ and the state at $$t=0$$. You can’t simply assume that $$t_1 - t_0$$ is distributed like $$f_A(t)$$ or $$f_B(t)$$. After all, in fact it is distributed like $$tf_*(t)$$. This is because (loosely speaking) if you fill the real line with a bunch of intervals whose lengths are distributed like $$f(t)$$, then you choose a point on the real line at random, you are more likely to land in an interval of size $$t$$ the longer that $$t$$ is.

### (A)synchronicity¶

While the explicit framework presented above is a useful tool, it is ill-defined for heavy-tailed processes, in which we are primarily concerned when making these types of corrections. In order to retain the useful properties of the system that made it possible to derive the interior and exterior times distributions, we simply notice that the real property that we want to be true when measuring these systems is asynchronicity, or what a physicist might call “symmetry under time translations” or “time homogeneity”. In short, we want to impose the constraint that we are only interested in scientific measurements where changing the interval of observation $$[0, T]$$ to $$[0+\tau,T+\tau]$$ for any $$\tau$$ will not change any properties of the measurement.

Note

We leave as an exercise to the reader to show that:

1. the renewal process of the previous section is a special case of an asynchronous process
2. this definition of asynchronicity produces all three properties we demonstrated for our renewal process above

On the other extreme from asynchronicity is the situation in which the Meier-Kaplan correction was originally designed to be used. Namely, we could imagine that a perfectly synchronous process is one where $$t_0$$ is fixed to be at time $$t=0$$, meaning that $$t_1 - t_0$$ is distributed as just $$f_*(t)$$.

While in principle anything between asynchrony and synchrony is possible, it is true in general that almost all scientific measurements area already done using either purely synchronous or asynchronous systems, since it is intuitively clear that a lack of understanding of the synchronicity of one’s system can lead to uninterpretable results.