This notebook provides a discussion and demonstration of one way to proceed in performing probabilistic inference in dynamical systems. In particular we prepare and simulate a deterministic dynamical system and then perform probabilistic inference on the latent variables of the system. This illustrates how to combine these two approaches, which is essential for understanding Pyro-Velocity.
Here we describe the model we will use to demonstrate probabilistic inference in dynamical systems. We begin with a candidate effective theory[1, 2] of gene transcription and splicing following [3–5] for the model and [6] for the presentation of the analysis. In particular we show in Table 1 the variables and parameters of the model together with their units and rough order of magnitude estimates for their values [7].
Table 1: Variables and parameters of the transcription-splicing-degradation model of [3] with order of magnitude estimates for ranges based on [7] or references therein.
Symbol
Description
Units
\(O(-)\) Estimate
Note
\(u\)
Number of pre-mRNA
molecules/cell
\(10^0 - 10^4\)
Wide range accounts for low to high gene expression levels.
\(s\)
Number of mRNA
molecules/cell
\(10^0 - 10^5\)
Similar to \(u\); depends on gene expression and stability of mRNA.
\(t\)
Time
seconds (\(s\)) to hours (\(hr\))
\(5s\) - \(48hr\)
Depends on the duration of the experimental observation.
\(\alpha\)
Production rate of pre-mRNA
molecules/(cell·hr)
\(10^0 - 10^3\)
Many transcripts are produced at rates in the range or 1 to 1000 per hour.
\(\beta\)
Splicing rate of pre-mRNA
\(hr^{-1}\)
\(10^{-1} - 10^2\)
Many Pre-mRNA to mRNA splicing rates are in the range of 1 minute to 6 hours.
\(\gamma\)
Degradation rate of mRNA
\(hr^{-1}\)
\(10^{-2} - 10^0\)
Many mRNA half-lives are in the range of a half-hour to a day.
Given state variables representing concentrations of pre-mRNA, \(u\), and mRNA, \(s\), we have the following ordinary differential equations taken from [3] (please refer again to Table 1 for the meaning of the variables and parameters),
\[\begin{align}
\frac{du}{dt} & = \alpha - \beta u \label{eq-dudt}, \\
\frac{ds}{dt} & = \beta u - \gamma s \label{eq-dsdt},
\end{align}\]
which proposes a mean-field model for the dynamics of transcription and splicing on a continuous state space. It is important to note this is only one among an eventual ensemble of models that, with appropriate analysis and inference, may be organized into a hierarchy of such models according to their domains of validity relative to the scales involved in and resolution of observations of the relevant phenomenon [1]. Of course, in the context of single gene transcription without explicit modeling of interactions, this ensemble of models would include the various adaptations and extensions of the so-called random telegraph model and the experimental justification for its consideration [8–10]. As an example of a minimal extension, \(\eqref{eq-dudt}\) and \(\eqref{eq-dsdt}\) are usually presented with the concept that the parameter \(\alpha\) could better account for the external inputs to the regulation of transcription of the gene if it were allowed to be a function of time, \(\alpha(t)\). However, because this complicates or eliminates analytical tractability, and thus maximum likelihood inference, approximation of this function may be piecewise constant [5]. We will return to this point later, but, for the purpose of simply illustrating the relationsips among analyzing, simulating, and performing inference upon the parameters of a dynamical system, we will assume that \(\alpha\) and other parameters are constant values.
1.2 Dimensional analysis
It is common to perform a dimensional analysis of dynamical models to ensure the units of the variables and parameters are consistent and reduce the total number of parameters by the number of dimensions of the associated dilation group symmetry [11]. Intuitively, this maps sets of parameters to associated equivalence classes of similar dynamics. For reference, in addition to dimensional analysis and nondimensionalization, the procedure we make use of depends on what is frequently referred to as the Buckingham \(\Pi\) theorem[6].
Table 1 together with (\(\ref{eq-dudt}\)-\(\ref{eq-dsdt}\)) contain essentially all the information we require to get started. The dimensions of the variables and parameters together with dimensionless ones are provided in Table 2.
Table 2: Variables and parameters of the transcription-splicing-degradation model together with their fundamental units and their relations to dimensionless parameters.
Dimensioned Parameter
Relation to Rescaling Parameters
Fundamental Units
Production Rate (\(\alpha\))
\(\alpha = U_0 \beta\)
molecules/(cell·time)
Splicing Rate (\(\beta\))
Reference Scale for \(t^*\)
\(1/\text{time}\)
Degradation Rate (\(\gamma\))
\(\gamma = \gamma^* \beta\)
\(1/\text{time}\)
Pre-mRNA Concentration (\(u\))
\(u = u^* U_0\)
molecules/cell
mRNA Concentration (\(s\))
\(s = s^* U_0\)
molecules/cell
Time (\(t\))
\(t = t^* / \beta\)
time
Note that molecules are dimensionless numbers while cells have units associated to their volume (i.e. the cube of a distance or length \(L^3\)). Usually we would just write volume but we retain “cell” to associate to the object that determines the relevant volume in this case. The dimensionless parameters are defined in Table 3.
Table 3: Dimensionless Parameters, definitions, and intuitive Descriptions
Dimensionless Variables and Parameters
Definition
Description
\(u^{\ast}\) , \(s^{\ast}\)
\(u / U_0\), \(s / U_0\)
Characteristic scale of (pre-)mRNA concentration based on the balance between production and splicing rates.
\(t^*\)
\(\beta t\)
Characteristic time scale relative to the splicing rate.
\(\gamma^*\)
\(\gamma / \beta\)
Relative degradation rate, comparing the degradation rate of mRNA to the splicing rate, indicating the stability or turnover rate of mRNA relative to its production/splicing.
Given our previously defined parameter range for the splicing rate \(\beta\), we can interpret the dimensionless parameter \(t^{\ast}\) according to Table 4.
Table 4: This table interprets the dimensionless time \(t^{\ast}\) across various splicing rates \(\beta\) (hr\(^{-1}\)), demonstrating how real time and expected splicing events scale with changes in \(\beta\). Each unit of \(t^{\ast}\) represents a normalized measure of time relative to the splicing rate, with the final three columns indicating the expected number of splicing events per molecule of pre-mRNA for 1, 10, and 100 units of \(t^{\ast}\) is indeed the meaning of time units for this scaling rule. These columns confirm that the dimensionless time directly reflects the expected splicing activity, endowing otherwise meaningless or difficult to interpret numbers into a universally applicable framework based on scales intrinsic to a given instance of the model.
\(\beta\) (hr\(^{-1}\))
1 unit of \(t^*\) (hr)
10 units of \(t^*\) (hr)
100 units of \(t^*\) (hr)
1 unit of \(t^{\ast}\)
10 units of \(t^{\ast}\)
100 units of \(t^{\ast}\)
\(10^{-1}\)
10
100
1000
1
10
100
\(10^0\)
1
10
100
1
10
100
\(10^1\)
0.1
1
10
1
10
100
\(10^2\)
0.01
0.1
1
1
10
100
Combining variables and parameters gives six total. We have two fundamental dimensions (time and length) we can eliminate to arrive at four essential variables and parameters. To derive these, we begin declaring the equations in python using sympy and substitute the dimensionless variables and parameters to obtain the dimensionless equations.
in \(\eqref{eq-dustardtstar}\) and \(\eqref{eq-dsstardtstar}\), which contains the \(6 - 2 = 4\) variables and parameters: \(u^{\ast}, s^{\ast}, t^{\ast}, \gamma^{\ast}\).
We’ll solve the dimensionless system analytically, focusing on what can be understood about the system's dynamics. Given specific initial conditions \(u^{\ast}(0) = u^{\ast}_0\) and \(s^{\ast}(0) = s^{\ast}_0\), which we restore to arbitrary \(t^{\ast}_0\) at the end of the derivation, the first equation
\[
\frac{du^{\ast}}{dt^{\ast}} = 1 - u^{\ast},
\]
can be solved using separation of variables
\[
u^{\ast}(t^{\ast}) = C_1 e^{-t^{\ast}} + 1.
\]
\(C_1\) is of course the constant determined by the initial conditions, which can be shown to be \(C_1 = u^{\ast}_0 - 1,\) giving us
This equation describes how the concentration of pre-mRNA, in units of the characteristic concentration scale, \(U_0 = \alpha/\beta\), evolves over the timescale characterized by the inverse of the splicing rate, \(1/\beta\). The term \(e^{-t^{\ast}}\) indicates that any deviation of \(u^{\ast}\) from its steady state \(1\) decays exponentially over time.
Now, we substitute the expression for \(u^{\ast}(t^{\ast})\) into the differential equation for mRNA concentration
This equation is a non-homogeneous first-order linear ODE for \(s^{\ast}\). To solve it, we can use an integrating factor \(e^{\gamma^{\ast} t^{\ast}}\), which gives
where \(C_2\) is another integration constant determined by initial conditions. Incorporating the initial condition \(s^{\ast}(0) = s^{\ast}_0\) into the solution for \(s^{\ast}(t^{\ast})\) to determine \(C_2\)
For the case \(\gamma^{\ast} = 1\), we can solve \(\eqref{eq-dustardtstar}\) and \(\eqref{eq-dsstardtstar}\) to find the solution for \(u^{\ast}\) is unchanged and \(s^{\ast}\) is then given by
Note we can restore an arbitrary initial timepoint putting \(t^{\ast} \rightarrow t^{\ast} - t^{\ast}_0\).
The dimensionless pre-mRNA dynamics\(u^{\ast}(t^{\ast})\) given in \(\eqref{eq-ustar-dimless}\) settles to its steady state value of 1 at rate 1, regardless of the initial concentration, reflecting the balance between transcription and splicing rates. Note again that this does not refer to an arbitrary value of \(1\) but rather represents the equivalence class of ratios \(\alpha/\beta\).
The dimensionless mRNA dynamics\(s^{\ast}(t^{\ast})\) given in \(\eqref{eq-sstar-dimless}\) reveals a slightly more complicated balance of forces. The first transient term shows an exponential decay influenced by \(\gamma^{\ast}\),the degradation rate relative to the splicing rate, when \(\gamma^{\ast}\) is not precisely \(1\). The second transient term reflects how the initial deviation in pre-mRNA concentration from its steady state affects mRNA levels over time. The steady state of \(s^{\ast}\) simply reflects the balance of the splicing rate, which is the only source of mRNA production in this model, and the degradation rate also given, just as in the case of the pre-mRNA, in units of the transcription rate relative to the splicing rate.
The analytical solutions to the dimensionless system reveal how the pre-mRNA and mRNA concentrations evolve over time towards their steady states. The exponential terms highlight the system's inherent timescales: \(u^{\ast}\) relaxes to its steady state independently, while \(s^{\ast}\)’s dynamics are coupled to both \(u^{\ast}\) and its own degradation rate. While this system is extremely simplified relative to well-known biology, as a stylized model it still indicates the interplay among transcription, splicing, and degradation rates.
While it is trivial to solve this system analytically, in more general cases this will not be possible. So we will proceed to simulate the system to characterize the possible solution sets and then perform probabilistic inference on the latent variables.
2 Data
If we are given samples of pre-mRNA and mRNA counts along with the sampling time of each cell, we can describe the data set \(\mathcal{D}\) consisting of pre-mRNA and mRNA counts for a number of genes \(G\) across a number of cells \(N\)
\(t_j\) represents the sampling time for cell \(j\),
\(u_{ij}\) represents the count of pre-mRNA for gene \(i\) in cell \(j\),
\(s_{ij}\) represents the count of mRNA for the same gene \(i\) in the same cell \(j\),
\(G\) is the total number of genes in the study,
\(N\) is the total number of cells sampled.
Roughly speaking, the inference problem involves
Normalization: Transforming the observed counts \(u_{ij}\) and \(s_{ij}\) to their dimensionless counterparts \(u^{\ast}_{ij}\) and \(s^{\ast}_{ij}\) using a reasonable concentration scale \(U_0\).
Model Fitting: Applying statistical methods to estimate the latent variables \(\Theta = \left( \gamma^{\ast}, u^{\ast}_0, s^{\ast}_0 \right)\) that best fit the observed data.
Evaluation: Assessing the fit of the model and the estimated parameters’ biological plausibility and consistency across different genes and cells.
We will return later to the alternative version of this problem where \(t_j\) are not observed and must be inferred along with the latent variables \(\Theta = \left( \gamma^{\ast}, u^{\ast}_0, s^{\ast}_0, t^{\ast} \right)\).
This representation of the dataset and the associated objective of statistical learning provides a solid foundation for applying probabilistic modeling techniques to calibrate an inference procedure and evaluate the plausibility of this model in such a manner that we can eventually compare multiple candidate models to one another.
3 Simulation
We will primarily focus on simulating the system, since this will generalize to more complicated models. However, we will confirm our simulations for this first and simplest model recapitulate its analytical solution derived above.
As a simple example, we will simulate the system for a single set of initial conditions and parameters.
We can visualize the results of the simulation by plotting the pre-mRNA and mRNA concentrations over time as shown in Figure 1,
Code
from pyrovelocity.plots import plot_deterministic_simulation_trajectoriesplot_deterministic_simulation_trajectories( solution=solution, title_prefix="TSD Model Simulated", colormap_name=colormap_name,)
We see that if we plot the analytical solution, we get the same result as shown in Figure 2 to within the expected error of the numerical simulation (\(O(10^{-5} - 10^{-8})\)).
Code
from pyrovelocity.plots import plot_deterministic_simulation_trajectoriesplot_deterministic_simulation_trajectories( solution=analytical_simulation_error, title_prefix="TSD Model Analytical-Simulation Error", colormap_name=colormap_name,)
The same simulated trajectories from Figure 1 are shown in the phase space given by \(u^{\ast} \otimes s^{\ast}\) in Figure 3.
Code
from pyrovelocity.plots import plot_deterministic_simulation_phase_portraitplot_deterministic_simulation_phase_portrait( solution=solution, title_prefix="TSD Model", colormap_name=colormap_name,)
4 Inference
Now that we have illustrated how to simulate a prototypical system, we can proceed to define a probabilistic model and perform inference on its latent variables given data like \(\eqref{eq-dataset}\). In general we should consider multiple models with different assumptions and levels of complexity, but for now we will focus on one of the simpler models to illustrate the process. In particular, in most cases, when performing inference on the latent variables of a dynamical system, we assume that the observation time points are fixed or otherwise observed. We will begin with this assumption and then proceed to discuss how to relax it.
In the case where we can indeed estimate analytical solutions to the system, we can consider a model with the structure proposed in (\(\ref{eq-init-conds-priors-gene}\)–\(\ref{eq-s-obs-cell}\)) where, given a number of genes \(G\) and cells \(N\) that are measured at \(K_j\) distinct times, then for each \((i,j,k) \in \{1, \ldots, G\} \otimes \{1, \ldots, N\} \otimes \{1, \ldots, K_j\}\) we have
Of course, in general, we will not have access to the analytical solutions, so we will need to simulate the system and then perform inference on the latent variables as suggested in (\(\ref{eq-init-conds-priors-num}\)–\(\ref{eq-s-obs-num}\)).
Where \(\frac{du^{\ast}}{dt^{\ast}}\) and \(\frac{ds^{\ast}}{dt^{\ast}}\), are given by (\(\ref{eq-dustardtstar}\)-\(\ref{eq-dsstardtstar}\)) and noting that we suppress the indices from (\(\ref{eq-init-conds-priors-gene}\)–\(\ref{eq-s-obs-cell}\)) for brevity.
If, for example, we only observe one time point for each cell, then Figure 4 and its associated description in (\(\ref{eq-init-conds-priors-gene}\)–\(\ref{eq-s-obs-cell}\)) reduces to the graphical model in Figure 5.
La Manno, G., Soldatov, R., Zeisel, A., RNA velocity authors, Linnarsson, S., Kharchenko, P.V.: RNA velocity of single cells. Nature. (2018). https://doi.org/10.1038/s41586-018-0414-6
5.
Bergen, V., Lange, M., Peidli, S., Wolf, F.A., Theis, F.J.: Generalizing RNA velocity to transient cell states through dynamical modeling. Nat. Biotechnol. 38, 1408–1414 (2020). https://doi.org/10.1038/s41587-020-0591-3
6.
Cantwell, B.J.: Introduction to symmetry analysis. Cambridge University Press (2002)
7.
Phillips, R., Kondev, J., Theriot, J., Garcia, H.: Physical biology of the cell. Garland Science (2012)
8.
Ham, L., Schnoerr, D., Brackston, R.D., Stumpf, M.P.H.: Exactly solvable models of stochastic gene expression. J. Chem. Phys. 152, 144106 (2020). https://doi.org/10.1063/1.5143540
9.
Cao, Z., Grima, R.: Analytical distributions for detailed models of stochastic gene expression in eukaryotic cells. Proc. Natl. Acad. Sci. U. S. A. 117, 4682–4692 (2020). https://doi.org/10.1073/pnas.1910888117
10.
Bohrer, C.H., Larson, D.R.: The stochastic genome and its role in gene expression. Cold Spring Harb. Perspect. Biol. 13, (2021). https://doi.org/10.1101/cshperspect.a040386
11.
Pegoraro, F.: The dilation group and dimensionless quantities in classical and relativistic hydrodynamics. Meccanica. 8, 216–222 (1973). https://doi.org/10.1007/BF02342407