multi_locus_analysis.stats.distributions¶

Tools for analysis of distributional statistics of trajectories.

Anything that’s not a moment of the trajectory that you want to analyze, you typically have to look at distributionally (e.g. displacement distributions, waiting time distributions, etc.). This module contains code that facilitates working with probability and cumulative distribution functions, bootstrapping, etc., as well as various miscellaneous distributional tools like normality tests.

multi_locus_analysis.stats.distributions.bars_given_cdf(x, cdf)[source]¶: takes x, cdf from cdf_exact* functions and makes a plottable histogram by tracing out the PDF. Works well for CDFs that come from observations on a fixed grid, and not well for continuous observations. (i.e. discrete_trajectory_to_wait_times output will work well, but not state_changes_to_wait_times).

multi_locus_analysis.stats.distributions.bars_given_confint(x, confint)[source]¶: takes x, confint from cdf_exact*, binom.multinomial_proportions_confint (respectively), and rescales confint to correctly fit around the output of bars_given_cdf in an aesthetic way.

multi_locus_analysis.stats.distributions.bars_given_discrete_cdf(x, cdf)[source]¶: like bars_given_cdf for when you’ve used ecdf’s times_allowed arg.

multi_locus_analysis.stats.distributions.bootstrapped_pmf_confint(n_samples, alpha, x, cdf, num_bootstraps=1000, bonferroni=True)[source]¶

Given an empirical cdf (x, cdf), this function generates bootstrapped error bars that represent, pointwise, the area that a second observation of n_samples (need not equal the number of samples used to generate (x, cdf)) would lie between with probability 1-alpha if it had a true CDF given by the (continuous, linear interpolation of the) empircal CDF.

Parameters

n_samples (int) – How many samples the secondary measurement has. This is the number of data points drawn in each bootstrap iteration.
alpha (float in [0,1]) – 1 - confidence level desired
x ((N,) array_like) – Values at which the empirical CDF was measured
cdf ((N,) array_like) – Values of the empirical CDF
num_bootstraps ((optional) int) – Number of bootstrapping interations to perform. WARNING: scales the memory required for now.
bonferroni ((optional) bool) – Whether to scale alpha based on the number of bins so that the plot can be used to visually assert pointwise statistical significance at the requested alpha.

Returns

confint – Upper and lower bounds of the confidence interval calculated.

Return type

(2,N-1) array_like

multi_locus_analysis.stats.distributions.bootstrapped_pmf_confint_bars(n_samples, x, cdf, num_bootstraps=1000)[source]¶: same as non-bars version, but returns the actual samples as pmf bars, ready to plot.

multi_locus_analysis.stats.distributions.bootstrapped_pmf_from_waits(times, window_sizes, times_allowed, n_samples=None, alpha=0.05, num_bootstraps=1000, bonferroni=True, progress_bar=False, **kwargs)[source]¶

Takes n_samples, num_bootstraps (# iterations), and calculates the pmf of the data num_bootsraps times using n_samples-sized samples drawn with replacement from the wait_times/windows that were passed.

The internal call to cdf_exact_given_windows_quinn needs you to use the times_allowed argument, but all kwargs are forwarded to that function just in case.

multi_locus_analysis.stats.distributions.bootstrapped_pmf_from_waits_(n_samples, num_bootstraps, times, window_sizes, times_allowed, progress_bar=False, **kwargs)[source]¶: Does shared work of creating non-bar pmfs, used by other bootstrapped_pmf_from_waits_* functions.

multi_locus_analysis.stats.distributions.bootstrapped_pmf_from_waits_bars(times, window_sizes, times_allowed, n_samples=None, num_bootstraps=1000, **kwargs)[source]¶: Same as bootstrapped_pmf_from_waits, but returns bars_given_cdf, ready to plot results.

multi_locus_analysis.stats.distributions.ecdf(y, y_allowed=None, auto_pad_left=False, pad_left_at_x=None)[source]¶

Compute empirical cumulative distribution function (eCDF) from data.

Parameters

y ((N,) array_like) – Values of the data.
y_allowed ((M,) array_like) – Unique values that the data can take. Mostly useful for adding eCDF values at locations where data could or should have been observed but none was recorded.
auto_pad_left (bool) – If left False, the data will not have a data value at the point where the eCDF equals zero. Use mean inter-data spacing to automatically generate an aesthetically reasonable such point.
pad_left_at_x (bool) – If auto_pad_left is False, you may explicitly specify the value at which to add the leftmost extra point.

Returns

x ((M,) array_like) – The values at which the eCDF was computed. By default np.sort(np.unique(y)).
cdf ((M,) array_like) – Values of the eCDF at each x.

Notes

If using y_allowed, the pad_left parameters are redundant, and should typically be left False/None.

multi_locus_analysis.stats.distributions.sample_from_cdf(n, x, cdf)[source]¶

Takes a sample count, cdf in the form x,cdf, like from output of cdf_exact* functions [i.e. pairs of (x, P(X<=x))]. Samples from the empirical distribution function at the maximum “x” resolution allowed by x.

Returns fraction of the resampled data that fell into each bin. In other words, it returns pmf, as if one had done:

>>> pmf, x = np.histogram(samples, bins=x)

multi_locus_analysis.stats.distributions.smooth_pdf(x, cdf, bw_method=None)[source]¶

Takes x, cdf from cdf_exact* and returns a kernel density estimator that can be evaluated at any X to get an estimate of pdf(X).

use bw_method to specify the way scipy.stats.gaussian_kde should determine the bandwidth of the gaussian.