spacer gif spacer gif spacer gif spacer gif spacer gif
 QUICK SEARCH:   [advanced]


spacer gif
     Home     Help     Feedback     Subscriptions     Archive     Search     Table of Contents    

First published online 8 August 2007
doi: 10.1242/dev.001073


Development 134, 3227-3238 (2007)
Published by The Company of Biologists 2007


This Article
Right arrow Summary Freely available
Right arrow Full Text
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Zhong, W.
Right arrow Articles by Sternberg, P. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhong, W.
Right arrow Articles by Sternberg, P. W.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

Automated data integration for developmental biological research

Weiwei Zhong and Paul W. Sternberg*

HHMI and Division of Biology, Caltech, 1200 E. California Blvd, Pasadena, CA 91125, USA.


Figure 1
View larger version (10K):
[in this window]
[in a new window]

 
Fig. 1. Experiments that establish relationships between genes, proteins, cells and functions. Most genome-wide data sets describe biological entities or draw connections between entities. For example, DNA sequence is linked to genes by gene prediction and experimental annotation (e.g. cDNA sequencing). Genes are associated with other genes by genetic interactions. Proteins are related by physical binding, e.g. as detected in yeast two-hybrid assays. Proteins are shown to interact with DNA through chromatin-immunoprecipitation (ChIP) and yeast one-hybrid assays (e.g. Deplancke et al., 2006Go). Genes and protein are assigned functions based on perturbations (mutations, overexpression, RNAi). Cells are associated with genes and proteins by gene expression. Cells (or tissues) are associated with functions by mechanical (e.g. laser ablation) or genetic (e.g. mutation) lesion experiments or by generating genetic mosaics.

 

Figure 2
View larger version (28K):
[in this window]
[in a new window]

 
Fig. 2. Examples of bio-ontologies. An ontology captures relationships among terms and their definitions in a structured way. A structure used in many of the current ontologies is a `directed acyclic graph' that differs from a tree or outline in that one term can connect to many terms but the connection is oriented (shown by arrows rather than by lines) and no cycles are allowed. Commonly used relationships are `Is a' and `Part of': term A is an example of term B; structure A is part of structure B (see www.geneontology.org or www.bioontology.org for more information). (A) Phenotype ontology. Reproductive system development defects include vulval developmental abnormalities, which include more-specific phenotypes, such as vulvaless and abnormal cell-fate specification. (B) Anatomy ontology. The intestine is part of the `digestive tract' and `alimentary system' and is an `organ'. The intestine comprises intestinal cells, intestinal lumen and intestinal muscle. (A and B from WormBase WS180.) (C) Biological processes in the Gene Ontology (GO). `Spinal cord development' is a case of `anatomical structure development' and is part of `central nervous system (CNS) development'. Spinal cord development comprises the development of sub-structures and includes both cell differentiation and patterning. (From GO Biological Process.)

 

Figure 3
View larger version (10K):
[in this window]
[in a new window]

 
Fig. 3. Correlating a spectrum of phenotypes. A set of 14 phenotypes for eight genes is indicated by the presence (blue) or absence (yellow) of the phenotype. In this example, genes A and B are perfectly correlated (14 of 14 phenotypes), genes C and D are tightly correlated (12 of 14 phenotypes), and genes A-D are more correlated with each other than with E-H. This data representation allows genes and phenotypes to be clustered and calculations of pairwise correlation coefficients to be made.

 

Figure 4
View larger version (12K):
[in this window]
[in a new window]

 
Fig. 4. Three methods of assigning orthology relationships. Species are designated by letters and paralogs by numbers. (A) KOG. The NCBI KOG detects reciprocally best-matching proteins from BLAST searches as orthologs. An ortholog group is thus defined as the union of best BLAST hits among all pairwise comparisons of multiple species. In the example shown, species A and species B each have a 1:1 ortholog, but species C has two orthologous proteins. (B) InParanoid. Since inter-genome reciprocal best BLAST analysis forces a one-to-one relationship, InParanoid also detects intra-genome best BLAST hits as co-orthologs. Solid arrows, inter-genome BLAST; dashed arrows, intra-genome BLAST. (C) TreeFam. In this approach, the relationships among proteins are defined by phylogenetic analysis.

 

Figure 5
View larger version (26K):
[in this window]
[in a new window]

 
Fig. 5. Four examples of statistical models for data integration. (A) Voting system. Each circle represents one data set and has one vote. Gray numbers indicate total votes. Data that are confirmed by multiple data sets have multiple votes. In this example, there are three data sets; thus, three is the maximum number of possible votes. (B) Support vector machine. Blue circles indicate positives in the training set and yellow squares represent negatives. In this example, there are two attributes (as represented by the x- and y-axes) for each data point. The data are plotted based on the values of these attributes. A function f is used to convert the data points so that they become linearly separable. The training set is used to derive the one-dimensional plane (red line) that separates positives from negatives. (C) Decision tree. In this hypothetical tree, the goal is to classify the input items into two categories, X and Y, which are denoted as blue circles and yellow squares, respectively. The category of each item is hidden, but we know the values of its three attributes (A,B,C). We use a set of conditions (represented by pink diamonds) to evaluate these attributes. Based on their values, we separate the items into subsets. The separation continues until the final outcome of the items (leaf nodes, represented by green boxes) is reached. (D) Bayesian network. In Bayesian networks, nodes represent variables and edges represent variable dependencies. Here, each node represents a Boolean variable, the value of which is denoted as true (T) or false (F) in conditional probability tables. The edges indicate that the value of B is dependent on the value of A and that the value of D is influenced by both the values of A and C. The conditional probability tables detail such dependency. For example, the probability of B being true is 0.8 if A is true; the probability drops to 0.4 if A is false. This network enables us to derive probabilities from different attribute values - for example, the probability of A being true given that B is true.

 

Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati   Add to Twitter Twitter    What's this?




© The Company of Biologists Ltd 2007