Monday, May 11, 2009

Connecting the dots... "Let us begin anew"


As I've learned more about bio-systems, starting from water molecules and working up to synapses and networks of neurons, I've come to appreciate how incredibly powerful and compact the molecular computing substrate that life is built on top of is. Our most powerful supercomputers take days to calculate how one protein molecule folds, when the simplest bacteria can perform millions of these operations in parallel in seconds. What these simulations give us, however, is insight into exactly what special characteristics each of the proteins has in all of the various shapes it can assume. Building up from this low level understanding, hopefully we will be able to understand what the larger-scale purpose is for each of the various signaling chains and genetic transcriptions that are taking place, and perhaps we may one day be able to model these complex molecular interactions using state machines and logic that allows us to achieve a functionally equivalent set of operations without having to precisely simulate cells at the molecular level.

There are a number of new approaches to try to get to this level of understanding.
On the BrainScience podcast mentioned in the previous post, Seth Grant provided some nice descriptions of the difference and connections between the "trendy" terms "genetics", "genomics" and "proteomics":
Genetics is the study of gene function or the function of the biology as revealed by genes, and typically involves the study of cells or animals where there has been a mutation or an abnormality introduced into a gene and as a result of that, the function of the cell or animal is changed. And, of course, the readers will understand this, but a mutation in a gene effectively means a change in the DNA sequence that encodes that gene.
Genomics is a different thing. Genomics is the study of the organization of all of the DNA or the 'genome'. And, of course, the genome encodes roughly 20,000 genes in mammalian systems, and therefore, when one is studying the genomics of man or mouse, we're studying all of the genes. Typically in genetics you might only study one gene at a time in many cases. So that gives you a sense of the difference between the large scale features of genomics and the somewhat small scale features of genetics.
Proteomics is the study of the sets of proteins, or all of the proteins that perform biological functions or are found in cells or tissues. "Proteome" is to proteins what "genome" is to genes. Again, proteome is dealing with large sets of molecules. In our case, we were particularly interested in the 'proteome' (or all of the proteins) found in synapses. But you might be interested in all of the 'proteome' of red blood cells, in other words, all of the proteins that are found in a red blood cell.


There's a very good paper called "The Many Facets of Natural Computing" that looks at some of the interaction networks that are active in biological systems. The paper was written by
  • Lila Kari, Department of Computer Science, University of Western Ontario, London, ON, N6A 5B7, Canada, lila@csd.uwo.ca

  • Grzegorz Rozenberg, Leiden Inst. of Advanced Computer Science, Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands, Department of Computer Science, University of Colorado at Boulder, Boulder, CO 80309, USA, rozenber@liacs.nl

  • Their copyright notice:
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

    ...
    [A]t the cell level, scientific research on organic components has focused strongly on four different interdependent interaction networks, based on four different “biochemical toolkits”: nucleic acids (DNA and RNA), proteins, lipids, carbohydrates, and their building blocks.

    The genome consists of DNA sequences, some of which are genes that can be transcribed into messenger RNA (mRNA), and then translated into proteins according to the genetic code that maps 3-letter DNA segments into amino acids. A protein is a sequence over the 20-letter alphabet of amino acids. Each gene is associated with other DNA segments (promoters, enhancers, or silencers) that act as binding sites for proteins which activate or repress the gene’s transcription. Genes interact with each other indirectly, either through their gene products (mRNA, proteins) which can act as transcription factors to regulate gene transcription – either as activators or repressors –, or through small RNA species that directly regulate genes.

    These gene-gene interactions, together with the genes’ interactions with other substances in the cell, form the most basic interaction network of an organism, the gene regulatory network. Gene regulatory networks perform information processing tasks within the cell, including the assembly and maintenance of the other networks. Research into modeling gene regulatory networks includes qualitative models such as random and probabilistic Boolean networks, asynchronous automata, and network motifs.(ref.)
    ...
    Proteins and their interactions form another interaction network in a cell, that of biochemical networks, which perform all mechanical and metabolic tasks inside a cell. Proteins are folded-up strings of amino acids that take three-dimensional shapes, with possible characteristic interaction sites accessible to other molecules. If the binding of interaction sites is energetically favourable, two or more proteins may specifically bind to each other to form a dynamic protein complex by a process called complexation. A protein complex may act as a catalyst by bringing together other compounds and facilitating chemical reactions between them. Proteins may also chemically modify each other by attaching or removing modifying groups, such as phosphate groups, at specific sites. Each such modification may reveal new interaction surfaces.

    There are tens of thousands of proteins in a cell. At any given moment, each of them has certain available binding sites (which means that they can bind to other proteins, DNA, or membranes), and each of them has modifying groups at specific sites either present or absent. Protein-protein interaction networks are large and complex, and finding a language to describe them is a difficult task. A significant progress in this direction was made by the introduction of Kohn-maps, a graphical notation that resulted in succinct pictures depicting molecular interactions. Other approaches include the textual biocalculus, or the recent use of existing process calculi (π-calculus), enriched with stochastic features, as the language to describe chemical interactions. (ref.)

    Yet another biological interaction network, and the last that we discuss here, is that of transport networks mediated by lipid membranes. Some lipids can self-assemble into membranes and contribute to the separation and transport of substances, forming transport networks. A biological membrane is more than a container: it consists of a lipid bilayer in which proteins and other molecules, such as glycolipids, are embedded. The membrane structural components, as well as the embedded proteins or glycolipids, can travel along this lipid bilayer. Proteins can interact with free-floating molecules, and some of these interactions trigger signal transduction pathways, leading to gene transcription. Basic operations of membranes include fusion of two membranes into one, and fission of a membrane into two. Other operations involve transport, for example transporting an object to an interior compartment where it can be degraded. Formalisms that depict the transport networks are few, and include membrane systems described earlier, and brane calculi.

    The gene regulatory networks, the protein-protein interaction networks, and the transport networks are all interlinked and interdependent. Genes code for proteins which, in turn, can regulate the transcription of other genes, membranes are separators but also embed active proteins in their surfaces. Currently there is no single formal general framework and notation able to describe all these networks and their interactions. Process calculus has been proposed for this purpose, but a generally accepted common language to describe these biological phenomena is still to be developed and universally accepted. It is indeed believed that one of the possible contributions of computer science to biology could be the development of a suitable language to accurately and succinctly describe, and reason about, biological concepts and phenomena.


    One of the problems that happens in science is that, in order to understand things deeply, scientists typically need to specialize in one specific area of research. As Daphne Koller, a professor of computer science at Stanford University, relates in an interview about her being awarded the first-ever ACM-Infosyst Foundation Award in Computing Sciences(ref.):
    The world is very complex: people interact with other people as well as with objects and places. If you want to describe what’s going on, you have to think about networks of things that interact with one another. We’ve found that by opening the lens a little wider and thinking not just about a single object but about everything to which it’s tied, you can reach much more informed conclusions.

    [Interviewer]Which was an insight you brought to the field of artificial intelligence…
    Well, I wasn’t the only one involved. There had been two almost opposing threads of work in artificial intelligence: there were the traditional AI folks, who grew up on the idea of logic as the most expressive language for representing the complexities of our world. On the other side were people who came in from the cognitive reasoning and machine learning side, who said, “Look, the world is noisy and messy, and we need to somehow deal with the fact that we don’t know things with certainty.” And they were both right, and they both had important points to make, and that’s why they kept arguing with each other.

    How did probabilistic relational modeling help settle the dispute?
    The synthesis of logic and probability allows you to learn this type of holistic representation [of complex systems] from real-world data. It gives you the ability to learn higher-level patterns that talk about the relationships between different individuals in a reusable way.

    You’ve begun applying your techniques to the field of biology.
    Originally, it was a method in search of a problem. I had this technology that integrated logic and probability, and we had done a lot of work on understanding the patterns that underlay complex data sets. Initially, we were looking for rich data sets to motivate our work. But I quickly became interested in the problem in and of itself.

    What problem is that?
    Biology is undergoing a transition from a purely experimental science — where one studies small pieces of the system in a very hypothesis-driven way — to a field where enormous amounts of data about an entire cellular system can be collected in a matter of weeks. So we’ve got millions of data points that are telling us very important insights, and we have no idea how to get at them.

    What have you learned about interdisciplinary collaboration from your work with biologists?
    The important thing is to set up a collaborative effort where each side respects
    the skills, insights, and evaluation criteria of the other. For biologists
    to care about what you build, you need to convince them that it actually produces good biology. You have to train yourself to understand what things they care about, and at the same time you can train them in the methods of your community.

    So it’s not just learning a new scientific language, but training yourself to respect a different research process.
    It’s a question of finding people who are capable of learning enough of the other side’s language to make the collaboration productive.


    This sentiment is echoed in numerous papers I've come across, as well as in the poetic conclusion of "The Many Facets of Natural Computing":
    In these times brimming with excitement, our task is nothing less than to discover a new, broader, notion of computation, and to understand the world around us in terms of information processing.

    Let us step up to this challenge. Let us befriend our fellow the biologist, our fellow the chemist, our fellow the physicist, and let us together explore this new world. Let us, as computers in the future will, embrace uncertainty. Let us dare to ask afresh: “What is computation?”, “What is complexity?”, “What are the axioms that define life?”.

    Let us relax our hardened ways of thinking and, with deference to our scientific forebears, let us begin anew.

    More...
    Bulletin of the EATCS (2007): Machines of systems biology.
    Nature (Sept. 2002): Cellular abstractions: Cells as computation
    Cambridge University Press (1999): Computing and Mobile Systems - the π-Calculus
    Information Technology in Systems Biology (Kohn Maps)
    Developmental Biology (2007):The regulatory genome and the computer
    Science Signaling (2004):Molecular interaction map of the mammalian cell
    cycle control and DNA repair systems."

    The Calculus of Looping Sequences for Modeling Biological Membranes"
    IEEE (2007): A Uniform Framework of Molecular Interaction for an Artificial Chemistry with Compartments

    No comments: