TRANSPATH Professional Documentation

MOLECULE

Molecule

Molecules interact with each other to build pathways. A molecule is anything that is subject to reactions. Most molecules have a mass, be it a small molecule like ATP, or a protein. No difference is made between receptors, enzymes, second messengers, transcription factors or other special kinds of proteins. A molecule can also be a group of such entities, like a protein family, a state of such an entity, like the phosphorylated form or a complex of several other molecules. And finally a molecule can be part of another molecule, either non-covalently bound as in a complex, or covalently bound as in a structural motif of a protein. The reason for such a wide scope for this class is to catch anything that shows specific signaling behavior.

A molecule can serve in four fundamental roles in the context of a reaction: it can be an educt, a product, an enzyme or a modulator. Enzyme, educt and modulator are the inputs of the reactions, while products are the outputs. For semantic reactions molecules are only grouped into signal donors, which are the inputs, and signal acceptors, which are the outputs.

We use two other collections, modified forms and complexes, to keep the states/modified forms and complexes separated from the family grouping hierarchy.

Family

Grouping

Important attributes of all components are the collections that allow expression of family relationships. Each component can act as a group of other components or as a member in such a group. In the case of molecule, this grouping represents family relationships between the molecules. In the case of reaction, this grouping can be used to assign reactions to a general class of reactions, as it is done successfully in the EC nomenclature. The third possibility is the use of mixed groups, which contain both reactions and molecules, and correspond to nets or explicitly stated pathways. This is interesting since it would support the appealing idea of building networks from subnets in a modular fashion.

Grouping molecules and motifs into families is essential for a usable signal transduction database. When the family relationships are stated in a formal way, it is easy to write algorithms that exploit them in user queries, while marking inherited properties as inferred information.

Family specifically addresses the issues of

molecule classification
isoforms
orthologs

The family relation is implemented by the groups/members collections.

Isoforms

Due to gene duplication and mutation within species, multiple paralogues have often evolved. For a single gene, different splice variants may exist. Sometimes in the literature a signaling activity is first attributed to a single molecule, and later it is discovered that there is a whole group of similar molecules. Therefore, a special type of molecule entry is needed, which we term "isoform". This abstracted group entry collects all the non-specific information.

For example, look at adenylate cyclase (AC, Fig. 1). There are at least ten isoforms (AC1-9, sAC), which are all activated by G_sa, while only isoforms AC1, AC3 and AC8 are Ca/CaM dependent, AC2, AC4 and AC7 have no inhibition mechanism and AC5 and AC6 are inhibited by G_ia or calcium.

Figure 1: Adenylate cyclase family tree. AC group 1 sums up the calcium dependent forms, group 2 contains the forms without inhibition mechanism and in group 3 are the forms inhibited by G_ia or calcium.

Orthologs

In many cases, reactions between proteins have only been shown for one of several model species, such as for rat, clawed frog or fruit fly. It is common to scan sequence databases for ortholog sequences and assume, if these can be found, that they play a similar role in signal transduction. This orthologous grouping makes it possible to stitch together pathways from bits and pieces that have been investigated in different species.

Classification methods

Since it is easier to examine the sequence of a gene or protein than to investigate its function, a certain behavior is usually shown in detail for only few members of a protein family, and homologues are added to the functional group by sequence or structural similarity.

Based on this assumption, various good databases exist that try to classify proteins and map sequence motifs to functional annotation. They cluster proteins by multiple sequence alignments and use common structural motifs, "profile" patterns, or Hidden Markov Models derived from these alignments to classify new proteins. Sometimes these methods can correctly predict the function, but sometimes not. Thus it is common practice to group molecules into families on the basis of sequence similarities, even if they do not share common behavior.

For a signaling database, it is advantageous to group molecules that show common signaling behavior, that is, to group them by function. Since it is the function we are interested in, we would like to group only by function. On the other hand, we would like to draw as much as possible upon expert knowledge, and stay coherent with the groupings that already exist. To solve this dilemma, the best option seems to be that we group the molecules as it is done traditionally, but link signaling only to those molecules for which it has been shown.

Given the tree for adenylate cyclase in Fig. 1, if the calmodulin-dependent reaction has been demonstrated for ACI, we link that reaction to ACI only. We link statements made on a generalized level, like the ones from review papers, to nodes on a higher level in the molecule hierarchy. Given the tree we can, for example, link the general G_sa activation to the AC group. The context of the original literature statement is preserved.

States (modified forms)

The unmodified form of a protein and all its modified forms are its states, where the modification can be by covalent binding, by complexation, or by change of environment. The protein per se is a concept that is based on the observation that there is only one gene coding for each protein sequence. All the states share the same gene and consequently part of their structure, the amino acid chain. They are functionally related, often even reversibly transformable into each other.

A basic molecule entry captures this concept and is the class of all states of a protein. These states are different molecules, and we store them as different molecule entries. As molecules, they can be used in a pathway assembly. We store general information like the amino acid sequence in the basic molecule entry and link its states to it.

In the simplest case there are only two states- an inactive one, and an active one. In other cases, there are more. For example, a transcription factor can be

de-phosphorylated in the cytosol
phosphorylated in the cytosol or
phosphorylated and bound to DNA in the nucleus

to name a few possibilities. The same protein will exhibit distinctly different signaling functions in these three states. For example in state 1 it will be susceptible to phosphorylation, in state 2 to translocation into the nucleus or dephosphorylation and in state 3 it will activate transcription.

The number of states for a molecule is the product of the number of its modified forms and the number of locations where it is found. Only compounds which share the same location interact in nature.

It is impractical to enter a separate state for each location. Most molecules can be found in several tissues, at several development stages, in several cellular compartments, several organs and several celltypes. To enter a state for each possible combination would lead to an explosion in the number of states, and redundancy in the reactions. This problem is circumvented by using a list of positive and negative locations that is linked to the basic molecule.

In each state, the molecule is available only for a subset of all reactions for that molecule. Receiving a signal changes the molecule's state, usually leading to a new state from which reactions are triggered (Fig.2).

Stateswitching
Figure 2: State switching

Motifs are structural and functional features of a protein and are often responsible for its signaling behavior. A single protein can have several signaling motifs. Motifs are listed in the feature field together with position information and references.

Figure 3 sums up all the various hierarchical relations that molecules in the database can have.

Figure 3: Hierarchical relations and roles for molecules, shown for a subset of the T-cell protein tyrosine phosphatase (TC-PTP) family. Family molecules are groups of related molecules or of other molecule groups. Orthologs group homologue molecules from different species, while isoforms assort homologue molecules in one organism. Basic molecules are translated proteins or small molecules that have mass. Complexes and modified forms change availability of the molecule for reactions.

Fields

It should be noted that in individual entries, some fields may be empty. In this case, these fields are not displayed.

Field		Content and format
AC	Accession number	The accession number is the unique identifier for each entry. Its format is "MO" in capital letters followed by nine digits (e.g. MO000012345). If two entries have to be merged , the AC of the primary name is retained. The other AC will be stored in the secondary accession numbers (AS) field.
AS	Accession numbers, secondary	The secondary accession numbers are optional alternate identifiers for each entry. They are of the form defined separated by semicolons, and are created when two entries are merged.
DT	Created by	The name of the curator who created the entry. E-mail your feedback directly to the curator.
DT	Updated by	The name of the curator who last updated the entry. E-mail your feedback directly to the curator.
CO	Copyright-information
NA	Name	A human-readable identifier for the component. The most common spelling in the literature is used, with an emphasis on forms with a dash (e.g. Grb-2 instead of Grb2), if both forms exist. Non-abbreviated molecule names are written according to their most common appearance, in most cases beginning with a lower case letter. Note that Greek letters are expanded to alpha, beta, gamma etc.. Molecules (transcriptions factors) with a lot of detailed information available at TRANSFAC^® have a hyperlink with the respective accession number in TRANSFAC in front of their name. If a molecule is modified chemically and in a different state the name is tagged with an abbreviation in {} brackets. List of abbreviations and their meanings There are also tags to differentiate the species the molecule comes from. This short identifier is useful in reaction names, because the experimental evidences for the reactions are frequently based on molecules from different species. The tag (v.s.) for vertebrate species is used when the exact (vertebrate) species has not been described in the reference and could also not be investigated from cited references or websites. List of species tags Stoichiometry is used only in complexes: the notation for homodimers is A:A (or A(2)); in larger complexes the number of equal components is written in brackets behind the respective molecule, for example A:B(2):C:D(4). Queries with the search field name automatically include the fields fullname and synonyms.
SY	Synonyms	Other names for the same component. This is needed, since the names in biosciences change often. The field lists other names or abbreviations. Orthologs from other species are not stored here. They are accessible via the group-member hierarchy. Different synonyms are separated by a semicolon. The field synonyms is automatically included if you run a query with the search field name. Since professional version 3.4 the synonyms field also contains the information from the fullname field.
GE	Encoding gene	The corresponding gene encoding for this protein.
OS	Species	The species that the molecule entry belongs to. Given is the common name (if it exists), followed by the Latin denomination, as it is done in TRANSFAC.
CL	Classification	This sorts the entry into all the groups it belongs to. The classification stretches over all hierarchical levels, that are "above" the entry. As a molecule can belong to more than one functional group several classification paths can exist. Main paths are marked.
TY	Type	The type of this molecule entry. Possible values are: family, for superfamilies, families group (XOR), is a grouping entry for molecules with the same function (e.g. isoforms) in a pathway; should be interpreted in a way that exactly one member (exclusive OR) of this group is involved in a concrete instance of the pathway ortholog, group entry for homologue molecules in different species isoform, group entry for homologue molecules in one species, which have emerged from gene duplication, alternative splicing of mRNA etc. basic, for real isoforms which have a polypeptide chain complex, for complexes, consisting of non-covalently bound molecules other, for lipids, second messenger (such as DAG, IP3, NO, cAMP...)
NP	NetPro entry	The corresponding NetPro� entry.
HP	Superfamilies	Lists all groups or families this component belongs directly to (one hierarchical level above). This is a very important field, since abstracting common signaling behaviour is needed to avoid the explosion of entries. For molecules, a group is a set of molecules which share a common signaling behaviour. For example, all isoforms of the gamma subunit of G-proteins could be grouped into one group. Also, orthologs from different species can be grouped together. On the bottom of the whole hierarchy, there should be proteins (and other molecules) which physically exist.
HC	Subfamilies	Lists all molecule entries that are a step downward in the hierarchy, e.g. splice variants of a molecule or members of a group/family.
SZ	Sequence length, molecular weight	The total number of amino acids in the given sequence. The calculated molecular weight of the protein in Dalton or Kilodalton (derived from cDNA /genomic clones). Experimental molecular mass (or range) in kDa (experimental method, e. g. SDS PAGE, GF/gel filtration), [reference].
IP	Isoelectric point	The calculated isoelectric point of the protein. Experimental isoelectric point (experimental method), [reference].
SC	Sequence source	Names the data source (e.g. database) the amino acid sequence has been derived from.
SQ	Sequence	Shows the amino acid sequence in one-letter-code.
DR	External database hyperlink	Database name (e. g. SwissProt): database accession number; identifier. The focus lies on linking TRANSPATH with SwissProt, EMBL, InterPro, LocusLink, UniGene, GO, DIP, BIND, and HyperCLDB. For some of the molecules links to PDB, PROSITE, Flybase, MGD, and others are also provided. Also, corresponding Affymetrix micro-array probe set identifiers are listed. For the following chips data is available: U95A, U95B, U95C, U95D, U95E, U95Av2, U133A, U133B, HuGeneFL. The format is AFFYMETRIX:chip:probeset. Except for those from chip HuGeneFL, the Affymetrix links are based on those in Ensembl, v.14.31 for human and v.14.30 for mouse.
DR	External database hyperlink (of encoding gene)	Hyperlinks of the encoding gene. They are included to retrieve matches in an array data analysis that is focussed on gene products.
FT	Features (motifs)	Lists all features/motifs/domains of the molecule (e.g. SH2-domain, signal peptide...), that are important in the diverse signaling cohesions the molecule is involved in. Gives the first and last position of the feature/motif in N –> C direction on the AA sequence. Structural and functional features/motifs, that are annotated from the literature, are named by the common motif name and have a reference link. Automatically annotated features/domains that were calculated using Pfam hidden Markov models (HMM) are also shown in this field. These features are characterized by their Pfam model name, a link to the corresponding Pfam entry, a raw score and an E-value (expectation value). For further explanations please take a look at the TRANSPATH^® Report 1, 0001 (2003).
	Scheme of	Shows a graphical representation of the listed features and their positions.
CC	Comments	A list of comments. Further information about the different categories of comments is available at Annotate.
GO	GO: biological process, molecular function	A list of Gene Ontology (GO) terms from the ontologies 'molecular function' and 'biological process'. Associated terms from the third ontology from GO -cellular component- can be found in the field 'Location positive and experiment(s)'.
CP	Location positive and experiment(s)	Lists all locations (tissues, physiological cell types, cell lines, cellular compartments, developmental stages) where the compound was found. Gives experiment abbreviation, the signal strength in the experiment as expression level and cites reference (if available). For further information please see Location. Abbreviations for the types of experiments (methods) used to verify the abundance of the molecule are given in brackets. For an acronym explanation table please see Methods.
CN	Location negative and experiment(s)	Lists all locations where the compound was NOT found. By doing this, having no entry is used to tell that nothing is known about the components location. Abbreviations for the types of experiments (methods) used to verify the absence of the molecule are given in brackets. For an acronym explanation table please see Methods.
ST	Complex or modified form of	A list of molecules this molecule entry is a modified form of. Or, if it is a complex, its subunits.
MF	Modified forms	A list of modified forms for this molecule. Modified forms have no own AA sequence, as it is given in the linked "Modified form of"-entry.
CX	Complexes	A list of complexes this molecule is engaged in.
XB	Reaction upstream	A list of reactions which produce this molecule (in the mechanistic view), or which lead to this molecule (in the semantic view). So the molecule serves either as a product or a signal acceptor.
XA	Reaction downstream	A list of reactions that consume this molecule (in the mechanistic view), or which go out from this molecule (in the semantic view). Here the molecule serves either as a substrate or as a signal donor.
XC	Reaction catalyzed	A list of reactions which are catalyzed by this molecule.
XI	Reaction inhibited	A list of reactions which are inhibited by this molecule.
PW	Pathways	Indicates all the pathways and chains in which the respective molecule is involved.
RN	Reference number	[consecutive entry reference number]. A list of the papers from which the information in this entry was extracted. For further information have a look at Reference.
RX	Pubmed database hyperlink	The PMID number in PubMed
RA	Reference author(s)	list of authors Reference
RT	Reference title	Reference
RL	Reference publication	Reference