2008-03-15 - Robust stochastic chemical reaction networks and bounded Tau-leaping - soloveichik_robust_CRN2008arxiv.pdf
It is my understanding that the Gillespie algorithm is just a kinetic Monte Carlo equation. The algorithm itself consists of setting up the environment and all of the molecules (I think the rule is something like this: as the molecular count increases, the order proportionally increases (double check)). And once you set up all of the molecules, you then take a "cumulative function" (sum of the rates of events per each of the simulated molecules) Then, use a binary search (guess the median, then if that's too high, chop everything above, and if it's too low, chop everything below, choose the median again) (assume well-ordered list) to find the event/reaction/transition that would make the updated environmental ensemble piecewisely increased in terms of the arrow of time or increased entropy. Carry out that event, recalculate all of the rates for each of the remaining molecualr objects, And apparently the change in time has to be updated by the negative logarithm of the random number all over the sum of the rates of the system (tack this on to the total running time variable for the simulation, but not the system clock).
Kinetic Monte Carlo simulation pseudocode or python source code here
Gillespie offers tau-leaping methodology as an improvement on the kinetic Monte Carlo simulation idea. Instead of processing one time-tick or iterative unit at a time, he suggests taking larger time leaps. How? Instead of asking what discrete element will react next, the tau-leaping methodology asks how often does each reaction channel fire in the next specified time interval (time interval named tau)? You then throw in a Poisson random variable and you can calculate the change in the species population through the next step by approximating a reaction that would take out some discrete members because of a stoichiometric change vector for that particular reaction multiplied (just like before) the sum of the reaction rates (except this is a different function -- still need to read up on the deal with the Poisson function ... ). The trick here is that, while the computations might be time-consuming, ultimately the number of memory reads and writes is significantly reduced (fewer steps) and thus this results in faster simulations. Another trick is figuring out how large of a chunk of time to do at once such that computational benefits are made while not going beyond accumulated benefits of bypassing so many memory IO calls (the leap condition). Then calculate the net change in state (just like in the kinetic Monte Carlo simulation), and add in the time leap into the simulation clock, and all is well.
TODO: Poisson distributions, ...
According to Soloveichik in ref, chemical reaction networks are robust (able to computationally proceed despite unpredictable mutations in inputs or environmental conditions) with respect to state probabilities, to some degree independent of reaction rate deviations. Note that in the kinematic Monte Carlo methods and in the Gillespie algorithm and tau-leaping, the Poisson function for the summation of reaction rates is used for the simulation. What is the formalization of state probabilities in this context?
Re: Soloveichik's abstract. Some where in the past few days I read a line that said that 80% of all of the genes in the cell and even human only ever interact with no more than a handful of other genes (each network having their own genes to interact with) -- this may have been on the reactome database or in the Henry Markram articles re: the Blue Brain project (ah, or the Paul Allen Institute re: the human brain atlas to profile gene expression and SNPs). However: it is worth noting that the paper also mentions that the computational work of a chemical reaction network scales polylogarthmically in the total molecular count, suggesting that large scale ensembles of molecules directly preform computations via physical/chemical reactions and are genuinely powerful number-crunchers (perhaps more so than our Cray or BlueGene supercomputers). In fact, I just realized that the Soloveichik paper answers a fairly fundamental question in biology (re: robustness and how it all works out), perhaps "bridging" computer science and biology more formally than von Neumann did in his time. That "80%" observation makes for good observational confirmation of his idea. ("Finally, by noticing that even robust stochastic chemical reaction networks are capable of embedding complex computational problems, we argue that the linear dependence on simulated time and concentration is optimal.") Oops -- turns out the 80% observation is from this paper itself. So this is is self-confirming.
SCRN := stochastic chemical reaction network model of chemical kinetics
-- describes interactions as Markov jump processes, which if I recall correctly, just means that the history of the system is ignored and the latest states are conserved/remembered for proceeding forward with the simulation.
SSA := stochastic simulation algorithm
ODE/PDE solvers --> my list of computational fluid dynamic solvers (I have been using OpenFOAM)
In the paper, when it says numerical ODE solver-sim complexity does not scale with the number of reactions, but rather with simulation time and concentration, does this refer to Kolmogorov complexity? Or does it refer to Chaitin complexity, in the sense of algorithmic information theory (AIT)? I suspect this use of the word 'complexity' refers to entropy and information in the sense of Shannon information, so complexity might refer to the "lack of meaning" or -- indeed -- present stochasticism (which is proven somewhere else, some references are thrown to the reader). The mass action simulations seem to be good for homogenous, well-diluted solutions, which is not inclusive of digital/finite stochastic computations.
It would be interesting to have a proof that biological systems are Markov jump processes and that not all information of previous reactions are required to maintain biological livelihood. Perhaps this is proven by healing skin: a wound will heal, despite the lack of historical information on where the previous skin went etc.
Ref 24 - review of tau-leaping algorithms
robustness := perturbations in reaction propensities (propensities for, say, change in state)
bounded tau-leaping (BTL) --- in contrast to Gillespie's tau-leaping, the leap time of each step is a random variable and not a function of the current state.
The stochastic chemical reaction network definition that they are using seems to be based off of just the addition of stoichiometric vectors, so that each state is just the accumulation of certain vectors which either add or subtract different terms (molecules) from the ensemble. They are "state change vectors". Only uniary and bimolecular reactions, per Gillespie.
pg 4. prediction problem definition - this throws me for a loop, what is the Bernoulli random variable?
The perturbation idea makes sense. And the stochastic definition of the new SSA propensity functions. If the new process is not Markov and does not possess poisson transition probabilities, then it's a p-perturbation (and can this be handled by separate simulation packages?).
monotonicity -- is this the same use of monotonicity as in my notes on Taylor and Maclaurin power series in calculus? (2007-10-15)
-- If we are showing that robustness is an important component of cells and simulations, then does this mean that we can use "robust-generating" theorems from comp sci to set up evolutionary selection experiments?
Re: randomized turing machines, see nondeterministic algorithms in comp sci. "A shopping list can be viewed as a very simple nondeterministic algorithm. Every item on the list is a directive to find the indicated product, but the order in which to find them is not indicated."
Just where are the turing machines implemented in the case of cascading chemical reactions in cells? And can we prove that DNA is a Universal Turing Machine?
Enzyme-free nucleic acid logic circuits - DNA_logic_circuits2006.pdf - George Seelig
Huh. Apparently these guys have made basic logic gates that take single-strand nucleic acids as IO? They call them "nucleic acid devices". Ref #3 -- allosteric ribozymes that take small molecules as input have been shown to perform logical functions. Output is cleaved or ligated oligonucleotide, and their output is not their input ((and even if it was, what is the guarantee that the oligos wouldn't just double back around and become the inputs to the exact same device and process recursively??)) And, if I remember my oligonucleotide synthesis, ligation takes a long time.
"Recently, engineered nucleic acid logic switches based on hybridization and conformational changes have been successfully demonstrated in vivo (15,16)." <--- go fetch the references
high/low logic values are represented by high/low concentrations (respectively).
From the diagrams in the paper, it looks like they are expecting oligos to go from gate to gate and move "down stream" in a logical progression. But how robust is this functionality when the cell is in a centrifuge? Maybe this is (what I would call) gated concentration logic? Yes, this must be the case. The "toe-hold" seems to be a key/signal method for specifying which gate a certain oligo output is meant for, where each gate is then designed to rewrite this "key" parameter for the next gate in the sequence, regardless of the concentration levels of these gates since ultimately there will be a finite level of inputs and processing gates and thus bandwidth will be constrained (if not here then by RNA polymerase and gene expression rates).
A few weeks ago I found myself talking with Tom Knight or maybe Drew Endy on a mailing list and came across the idea of amorphous computing and the GPL or growth programming language (not GNU licensing). Specifically, this language has been designed to allow for situations where concentration and growth-direction can be determined, but there's not necessarily any global positioning system available and all of the agents or all of the logic gate units don't have any clue as to the "bigger picture" of the circuit. These circuits have to be probabilistic since there's no "physical connection" from one gate to the next (although maybe there's something that can be hacked together for extracellular matrice logic?).
Signal transduction simulation software: MCell, tools for modeling signal-transduction, ...
What are the RNA sequences for the Winfree logic gates? (see at the end of the 2006 article - Construction of an in vivo bistable circuit from synthetic transcriptional switches)
Construction of an in vitro bistable circuit from synthetic transcriptional switches.pdf
the Belousov-Zhabotinsky oscillator
cyanobacterial circadian clock operation is independent of transcription and translation (see Nakajima et al, 2005)
variety of interesting circuits can be constructed within cell-free transcription-translation systems (Noireaux et al, 2003; Isalan et al, 2005)
-- Given a well-characterized biochemical logic system, would it be possible to rewrite the genome in terms of new chemical processing routines as defined by our circuits? If this was to be the case, would there be any translation from historical genetic regulatory circuits to transcriptional circuits? And what would an internet database of these new circuits look like?
Is any of this steady-state?
"The regulatory domain is upstream of the promoter region; the output domain is downstream of the promoter region. This separation of domains allows us to design DNA templates that have any desired connectivity." Regulated DNA templates are called switches (Sw) and unregulated DNA templates are called sources (So). The OFF state of the switch consists of a double-stranded DNA template ('T') with a partially single-stranded (ss) and thus incompete promoter region. The switch is turned on by the addition of an ssDNA activator ('A') that completes the promoter region. The activator contains a 'toehold', a single-stranded overhang beyond the helical domain it forms with the DNA template, where an inhibitor can bind to initiate a toehold-mediated strand displacement reaction (Yurke and Mills, 2003). The switch can be turned off via addition of inhibitor strand of either ssRNA ('I') or ssDNA ('dl'). An ON state source template has a complete promoter sequence with a nick and an OFF-state source template is missing five bases of the promoter sequence on the template side. Source templates do not interact with activators, due to the hairpin stem permanently covering the branch migration sequence and therefore maintain their transcription efficiency in the presence of inibitors.
formalization of transcriptional circuits?
Michaelis-Menten enzyme reactions
Where is this taking place ? I thought the logic gates were in vitro, as in, floating pieces of oligonucleotides. But reading this doc leads me to think that each of the gates may be on the DNA molecule and thus expressed once and only once, therefore allowing very specific circuits and sequences to be executed. This would avoid protein folding since you just use a DNA nt seq that doesn't fold into anything, just does its function. How, then, do you design or generate or engineer the keys/signalers for keying into a certain logic gate? Are these transcriptional-regulatory molecules produced in massive selection experiments? Edit: maybe you don't have to, since everything can be largely localized by repressor-promoter inhibition circuits. Choose sequences that can do long-range feedback if you have to separate your logic circuits (but why would you?).
Note: evince is not good for reading papers. Had to go install kpdf so that I could select text properly. That was always an annoying problem with evince.
TODO: VHDL to transcriptional network translator program (use flex/bison, the compiler compiler).
VHDL for an adder circuit (found here):
entity adder is
Note that the RTL is the same as the binary-algebra representation of the adder logic table: just two outputs, the sum is the XOR of all of the inputs and the carry is any of the ANDed inputs. From here, go look at the numerous VHDL grammar definitions (like uc_vhdlgrammar.Z or VHDL93-BNF, VHDL93 syntax, ...). The GHDL source .tar.bz2 has significantly more content than the grammar from uc_vhdlgrammar.z, what's up with this? I'd have to wade through a lot of files if I was to convert GHDL into the VHDL->DNA compiler. Maybe I'll use Verilog, since that particular compiler for Verilog generates a netlist. Perhaps more simply: gcc just needs to be able to compile cross-platform with the target being DNA.
-- i0, i1 and the carry-in ci are inputs of the adder.
-- s is the sum output, co is the carry-out.
port (i0, i1 : in bit; ci : in bit; s : out bit; co : out bit);
architecture rtl of adder is
-- This full-adder architecture contains two concurrent assignment.
-- Compute the sum.
s <= i0 xor i1 xor ci;
-- Compute the carry.
co <= (i0 and i1) or (i0 and ci) or (i1 and ci);
The program xxdna should be the fundamental unit: it outputs the logic gate in DNA sequence form. The first input should be the type of logic gate (one of the 16 basic logic gates, or something like that), and then the second input should be the "input that the gate should recognize" -- and then the program will have to generate (1) the DNA sequence and (2) the output-key, in other words it's the next string that would serve as the second parameter to the next call of the xxdna program.
Vienna RNA fold package -- RNAcofold (see also: Wikipedia's list of RNA structure prediction software packages). There happens to be an online CGI web interface to the RNAinverse program where you can insert the desired structure.
Simulate: toehold-mediated strand displacement by branch migration (Yurke and Mills, 2003); toehold sequestering;
Some collected research on synbio logic gates
The major problem seems to be searching for RNA strands that we can use. Generating a list of all possible sequences is computationally intensive and would require up to 4^12 simulations.
See the 2008-03-16 emails with Winfree. Simulation ideas for determining appropriate RNA strands.
Winfree got back to me on the idea of a GA. He estimates 2^(4^N) checks/simulations, which is computationally infeasible. The 2^ has to be counted in there because you are trying to find mutually compatible sequences. He'd like an analytic method of generating (instead of selecting) RNA sequences from first principles. And a GA would work for small values of N, anywhere from 0 to 3 ... since at N=3 you're already doing 10^19 comparisons. On a 1 GHz computer, N=3 would take (at best) 316 years of computation. On a terahertz computer, that's 4.32 years, and on an inaccessible petahertz computer that's 2 minutes ... assuming there's no memory IO, which slows these machines down to a crawl. Edit -- note that we might be able to encode these directly into electronic circuits, the whole simulation, instead of on a microprocessor: but is it worth the effort? For example, JPEG decoders can be implemented in transistors, instead of just a software package that runs on some ISA.
Here's my attempt at formalization of the problem: in any set S_0 with 4^N different molecules (where N = length of RNA molecules in nt), there is a list X (of length L(X)) of specific "mostly-lonely RNA"s, RNA molecules which interaction only with one other RNA molecule (what we are looking for). In set N+1, the members of X -- the specific "lonely RNAs" -- are not necessarily in the new list X of set S_1, or N+1, even if L(X(S_0)) == L(X(S_1)). You can't compare the members of X of S_0 with the members of X of S_1 since membership will change. Instead, my idea is to try to predict what the addition of a certain nucleotide to a sequence does, knowing that all other possible changes are also being made at the same time, for all of the other RNA sequences.
For example, if sequence AAUG makes the cut for membership in X, then the addition of A, G, U, or C will do something with respect to the other members of the next iteration. Let's say we are checking AAUGA: we know the other three members are AAUGG, AAUGU, AAUGC. (Meanwhile, at this same N=5 length, all of the other possible combinations are being tested too, but we're just interested in whether or not one of the sequences *locally* qualifies for inclusion in X). This is like bracketing in hand-to-hand combat competitions. After all, if they do not locally, amongst their "brothern" possibilities, qualify, then they will definitely not qualify against any other molecule. **But** this does not mean that they do not work with (do not interact with) a large majority of other molecules ... it is obvious that I am imposing very strict requirements.
Once we have the 'winner' (if any) of the local-check, we move up the bracket to compare the winner (if any) of AAUG* and AAUA*, and then we move up again with the winner to check against the winner of AAUC* and AAUU* (where asterisk represents all of the combinations below that, bringing up only the qualifiers for local inclusion in X).
The benefits of this method is the significantly reduced computational checks. The downside is that each time you increase N, you have to start from the "bottom" and work your way up to compare the different 'brackets'. The good news is that this should be analytical in its nature: I bet we can find a way to determine whether or not a certain sequence is going to interact with one of its brothern at that scale. Without simulation. I am pretty sure of this because it should be just as predictable as Watson-Crick pairing. Using some molecular dynamics software packages, maybe I can explore the properties of the nucleotides and show to what extent the addition of a nucleotide alters the functionality of the entire strand.
Note that RNAcofold is a direct analytic program package -- as far as I can tell, they are not doing simulation to figure out what the solutions are. This is good news.
- Review of RNA software
- GA to simulate the evolution of RNA sequences that meet all of the specifications (type up my notes)
- My "analytical" approach.
NMR of nucleic acids -- if we were going to do this, how many molecules of each of the strands would we need? If we were doing N=5, then we would need 2^(4^(N)) strands or in other words 1E308 molecules. There's only 1E77 molecules in the known universe.
Winfree mentioned analytical methods [that still need to be made] of generating a next RNA sequence to use instead of simulation. Simulation becomes exponentially more computationally infeasible as you add nucleotides or if you want four times as many logic gates to work with (and semiconductor fabs do billions of logic gates on a surface, though granted for ridiculously inefficient functions). At length = 4, you start getting up into 10^19 territory, which is stuff like 319 years on a 1 GHz machine (or 2 min on a petahertz computer). In a GA, you still have to do the scoring function (interaction simulator), so you don't avoid these fundamental problems that way. So, I am having trouble figuring out where to start working on such an 'RNA generator', since I would need sample data of the RNA simulators in the first place to analyze and make sweeping generalizations about the addition of nucleotides altering the functionality of the molecule. So I have begun to do a review of online RNA software -- but I am not too hopeful about this DNA logic idea, really, as awesome as it would be. Neither simulation or evolution or physical interaction (I calculate I'd need 1E308 molecules, and there's only 1E77 atoms in the universe) can do it since 2^(4^(N)) is 2^(4^(N)) no matter the medium, and anything past N=2 is terrible, so at most we can do 16 gates per circuit, not including the RNA that doesn't qualify, and if we ever put these circuits in cells, then one circuit per cell. But then how did the Winfree minions get their RNA strands via simulation ? I am missing something --- something is wrong here. I think the Winfree minions came up with a giant specification and then had software compute RNA that would all work together that all meet the specifications, with a random seed to allow differentiation, and then from there random scoring of all of the possibilities per each RNA strand to see which one is optimal. Yeah, this makes sense. Much better.
- Domain lengths
- Complementarity requirements - recognition regions must be complimentary to their target and each gate strand must be able to correctly bind its neighboring gate strand(s).
- Terminate double-helical domains by G-C pairs to reduce fraying.
Input stand determined entirely by seq of target gate strand. Note: previous strand choices constrain future strand choices.
Generate a set of random sequences satisfying the above structural constraints. And then apply these soft constriants:
(1) Minimization of secondary structures in single-stranded species (input/messenger strands) as predicted by the minimum-free-energy (MFE) struct [S3] at 25 deg C using DNA params [S4];
(2) Minimization of cross-talk between all single-stranded species, as measured by the delta G deg of association between pairs of strands (estimated as intramolecular MFE for a 'virtual' strand linking the two seqs together via 5 unpaired nts);
(3) Especially avoiding secondary-structs in messenger strands and single strand portions of partially-triggered gates that hide the toe-hold binding region; ( -- are other secondary structs OK?)
(4) No more than 3 consecutive repeats of the same base.
(5) Sequence Symmetry Minimization [S5, S6];
(6) Making all toe-holds of similar strength (predicted delta G deg @ 25 deg C)
(7) Avoid branch migration of the bulge loop region in a bound output strand. 1st 4 ss nts @ base on one side of the loop must not = last 4 nts in the ds region directly adjacent to the loop on the other side.
Weight each parameter. Summation = score. Make random sequence mutations (that satisfy the struct-constraints) and select the mutations/seq that reduce overall score.
DNA logic gates in vivo: instead of encoding for amino acids, encode for logic gates, allowing mutation to occur on the sequential expression of logic functions to be later expressed. Then, have a way to make the associated ssDNA and mRNA strands; then, quickly export the ssDNA on to the surface of the cell via some transport proteins (all cells tend to try to break down the ssDNA, apparently; can they break down the ssDNA if it is crystallized or in a different state?).
2008-03-21: Bacterial transformation requires dsDNA binding proteins on the plasma membrane. Now, can I find a molecular chaperone that can bring DNA outside of the cell and immediately bind with the dsDNA binding proteins? And can they be used as ssDNA binding proteins? What about a membrane-embedded single-strand binding protein?
-- SSBP with translocation protein targetting to bind it to the plasma membrane. (note: there's a dsDNA binding protein required for bacterial transformation, but I don't know the name of the protein)
-- molecular chaperone to move ssDNA to the outer plasma membrane for binding on the SSBP. I need a mechanism for copying a specific sequence of the DNA molecule into a single-stranded DNA molecule *bound to* 'protein targetting' proteins. Note that there are PCR protocols for making ssDNA; encode these "ssDNA PCR protocols" (or just hack DNA replication) into the genome, and then add a step where each ssDNA strand (the gates) are bound to the membrane-targetting proteins. I get it; I can evolve alternative DNA replication proteins (just like Andy did), making the process that I need. He evolved DNA polymerase to work with alternative nucleic acids a while back. But how can I evolve this alternative DNA replication process? I need to do cellular DNA -> ssDNA -> capped with proteins to move it past the plasma membrane without being destroyed.
-- way to get the ssDNA for the logic gates while inside the cell
------ Manufacture ssDNA in vivo?
-- Now you can evolve logic by having cells replicate, exchange strings for new logic, etc.
---- This first requires a way to encode DNA logic in the DNA strand itself ("alternative nucleic acids").