TRANSPATH Professional Documentation

MOLECULE

Molecule

Molecules interact with each other to build pathways. A molecule is anything that is subject to reactions. Most molecules have a mass, be it a small molecule like ATP, or a protein. No difference is made between receptors, enzymes, second messengers, transcription factors or other special kinds of proteins. A molecule can also be a group of such entities, like a protein family, a state of such an entity, like the phosphorylated form or a complex of several other molecules. And finally a molecule can be part of another molecule, either non-covalently bound as in a complex, or covalently bound such as ubiquitin-like proteins. The reason for such a wide scope for this class is to catch anything that shows specific signaling behavior.

A molecule can serve in four fundamental roles in the context of a reaction: it can be an educt, a product, an enzyme or a modulator. Enzyme, educt and modulator are the inputs of the reactions, while products are the outputs. For semantic reactions molecules are only grouped into signal donors, which are the inputs, and signal acceptors, which are the outputs.

We use two other collections, modified forms and complexes, to keep the states/modified forms and complexes separated from the family grouping hierarchy.

Orthofamily/Family

Grouping

Important attributes of all components are the collections that allow expression of family relationships. Each component can act as a group of other components or as a member in such a group. In the case of molecule, this grouping represents family relationships between the molecules. In the case of reaction, this grouping can be used to assign reactions to a general class of reactions, as it is done successfully in the EC nomenclature. The third possibility is the use of mixed groups, which contain both reactions and molecules, and correspond to nets or explicitly stated pathways. This is interesting since it would support the appealing idea of building networks from subnets in a modular fashion.

Grouping molecules into families is essential for a usable signal transduction database. When the family relationships are stated in a formal way, it is easy to write algorithms that exploit them in user queries, while marking inherited properties as inferred information. The prefix "ortho" is used when the family entry is not specific for a certain species or higher taxon (Fig. 1).

The family relation is implemented by the groups/members collections.

Orthogroup/Isogroup

For a single gene, different isoforms such as splice variants may exist. Sometimes in the literature a signaling activity is first attributed to a single molecule, and later it is discovered that there is a whole group of similar molecules. Therefore, a special type of molecule entry is needed, which we term "isogroup" for taxon-specific entries and "orthogroup" for orthologous, non-species-specific ones. To these abstracted group entries all the information can be assigned, where it is not known which specific isoform is involved.

Orthobasic/Basic

Molecules of the type "basic" contain data for a specific isoform, e.g. splice variant, to which an amino acid sequence can be assigned. Again, the prefix "ortho" is used to generalize information for orthologous isoforms from different species.

Figure 1 sums up all the various hierarchical grouping relations that molecules in the database can have.

Figure 1: Hierarchical grouping relations and roles for molecules. A definition for the different types is given below. A modification tag '_mod' is added to the type terms listed above if the entry carries a modification such as phosphorylation. Complexes and modified forms change availability of the molecule for reactions.

Classification methods

Since it is easier to examine the sequence of a gene or protein than to investigate its function, a certain behavior is usually shown in detail for only few members of a protein family, and homologs are added to the functional group by sequence or structural similarity.

Based on this assumption, various good databases exist that try to classify proteins and map sequence motifs to functional annotation. They cluster proteins by multiple sequence alignments and use common structural motifs, "profile" patterns, or Hidden Markov Models derived from these alignments to classify new proteins. Sometimes these methods can correctly predict the function, but sometimes not. Thus it is common practice to group molecules into families on the basis of sequence similarities, even if they do not share common behavior.

For a signaling database, it is advantageous to group molecules that show common signaling behavior, that is, to group them by function. Since it is the function we are interested in, we would like to group only by function. On the other hand, we would like to draw as much as possible upon expert knowledge, and stay coherent with the groupings that already exist. To solve this dilemma, the best option seems to be that we group the molecules as it is done traditionally, but link signaling only to those molecules for which it has been shown.

Given the hierarchical tree for in Fig. 1, if the reaction has been demonstrated for A(h), we link that reaction to A(h) only. We link statements made on a generalized level, like the ones from review papers, to nodes on a higher level in the molecule hierarchy. Given the tree we can link the general activation of A-like proteins to the orthofamily entry "A-like". The context of the original literature statement is preserved.

States (modified forms)

The unmodified form of a protein and all its modified forms are its states, where the modification can be by covalent binding, by complexation, or by change of environment. The protein per se is a concept that is based on the observation that there is only one gene coding for each protein sequence. All the states share the same gene and consequently part of their structure, the amino acid chain. They are functionally related, often even reversibly transformable into each other.

A basic molecule entry captures this concept and is the class of all states of a protein. These states are different molecules, and we store them as different molecule entries. As molecules, they can be used in a pathway assembly. We store general information like the amino acid sequence in the basic molecule entry and link its states to it.

In the simplest case there are only two states- an inactive one, and an active one. In other cases, there are more. For example, a transcription factor can be

de-phosphorylated in the cytosol
phosphorylated in the cytosol or
phosphorylated and bound to DNA in the nucleus

to name a few possibilities. The same protein will exhibit distinctly different signaling functions in these three states. For example in state 1 it will be susceptible to phosphorylation, in state 2 to translocation into the nucleus or dephosphorylation and in state 3 it will activate transcription.

The number of states for a molecule is the product of the number of its modified forms and the number of locations where it is found. Only compounds which share the same location interact in nature.

It is impractical to enter a separate state for each location. Most molecules can be found in several tissues, at several development stages, in several cellular compartments, several organs and several celltypes. To enter a state for each possible combination would lead to an explosion in the number of states, and redundancy in the reactions. This problem is circumvented by using a list of positive and negative locations that is linked to the basic molecule.

In each state, the molecule is available only for a subset of all reactions for that molecule. Receiving a signal changes the molecule's state, usually leading to a new state from which reactions are triggered (Fig.2).

Stateswitching
Figure 2: State switching

Motifs are structural and functional features of a protein and are often responsible for its signaling behavior. A single protein can have several signaling motifs. Motifs are listed in the feature field together with position information and references.

More detailed information about the data model of TRANSPATH can be found at:

Choi, C., Crass, T., Kel, A., Kel-Margoulis, O., Krull, M., Pistor, S., Potapov, A., Voss, N., Wingender, E. (2004)
"Consistent re-modeling of signaling pathways and its implementation in the TRANSPATH database"
Genome Inf. Ser. 15, 244-254. [Pubmed] [.pdf]

Fields

It should be noted that in individual entries, some fields may be empty. In this case, these fields are not displayed.

Field		Content and format
AC	Accession number	The accession number is the unique identifier for each entry. Its format is "MO" in capital letters followed by nine digits (e.g. MO000012345). If two entries have to be merged , the AC of the primary name is retained. The other AC will be stored in the secondary accession numbers (AS) field.
AS	Accession numbers, secondary	The secondary accession numbers are optional alternate identifiers for each entry. They are of the form defined separated by semicolons, and are created when two entries are merged.
DT	Created by	The name of the curator who created the entry. E-mail your feedback directly to the curator.
DT	Updated by	The name of the curator who last updated the entry. E-mail your feedback directly to the curator.
CO	Copyright-information
NA	Name	A human-readable identifier for the component. The most common spelling in the literature is used, with an emphasis on forms with a dash (e.g. Grb-2 instead of Grb2), if both forms exist. Non-abbreviated molecule names are written according to their most common appearance, in most cases beginning with a lower case letter. Note that Greek letters are expanded to alpha, beta, gamma etc.. Molecules (transcriptions factors) with a lot of detailed information available at TRANSFAC^® have a hyperlink with the respective accession number in TRANSFAC in front of their name. If a molecule is modified chemically and in a different state the name is tagged with an abbreviation in {} brackets. List of abbreviations and their meanings There are also tags to differentiate the species the molecule comes from. This short identifier is useful in reaction names, because the experimental evidences for the reactions are frequently based on molecules from different species. The tag (v.s.) for vertebrate species is used when the exact (vertebrate) species has not been described in the reference and could also not be investigated from cited references or websites. List of species tags Stoichiometric factors are used, if quantitative details are known. The notation for homodimers is A:A (or (A)2); in larger complexes the number of equal components is written behind the respective molecule, for example A:(B)2:C:(D)4. Queries with the search field name automatically include the fields fullname and synonyms.
SY	Synonyms	Other names for the same component. This is needed, since the names in biosciences change often. The field lists other names or abbreviations. Specific names from orthologous species are not stored here. They are accessible via the group-member hierarchy. Different synonyms are separated by a semicolon. The field synonyms is automatically included if you run a query with the search field name. Since professional version 3.4 the synonyms field also contains the information from the fullname field.
GE	Encoding gene	The corresponding gene encoding for this protein.
OS	Species	The species that the molecule entry belongs to. Given is the common name (if it exists), followed by the Latin denomination, as it is done in TRANSFAC.
CL	Classification	This sorts the entry into all the groups it belongs to. The classification stretches over all hierarchical levels, that are "above" the entry. As a molecule can belong to more than one functional group several classification paths can exist. Main paths are marked.
TY	Type	The type of this molecule entry. Possible values are: orthofamily, group entry for homologous superfamilies, families family, for species-specific superfamilies, families orthogroup, group entry for the products of orthologous genes (equivalent for orthogene entries in gene table) isogroup, group entry for the products of one gene in one taxon (usually species), which have emerged from gene duplication, alternative splicing of mRNA etc. orthobasic, entry for othologous isoforms (e.g. splice variants) basic, for taxon-specific isoforms (usually species) orthocomplex, group entry for othologous complexes, consisting of non-covalently bound molecules complex, for taxon-specific complexes, consisting of non-covalently bound molecules A modification tag '_mod' is added to the type terms listed above if the entry carries a modification such as phosphorylation. group (XOR), is a grouping entry for molecules with the same function (e.g. isoforms) in a pathway; should be interpreted in a way that exactly one member (exclusive OR) of this group is involved in a concrete instance of the pathway other, for lipids, second messenger (such as DAG, IP3, NO, cAMP...)
NP	NetPro entry	The corresponding NetPro™ entry.
HP	Superfamilies	Lists all groups or families this component belongs directly to (one hierarchical level above). This is a very important field, since abstracting common signaling behaviour is needed to avoid the explosion of entries. For molecules, a group is a set of molecules which share a common signaling behaviour. For example, all isoforms of the gamma subunit of G-proteins could be grouped into one group. Also, orthologous forms from different species can be grouped together on hierarchical levels. On the bottom of the whole hierarchy, there should be proteins (and other molecules) which physically exist.
HC	Subfamilies	Lists all molecule entries that are a step downward in the hierarchy, e.g. splice variants of a molecule or members of a group/family.
SZ	Sequence length, molecular weight	The total number of amino acids in the given sequence. The calculated molecular weight of the protein in Dalton or Kilodalton (derived from cDNA /genomic clones). Experimental molecular mass (or range) in kDa (experimental method, e. g. SDS PAGE, GF/gel filtration), [reference].
IP	Isoelectric point	The calculated isoelectric point of the protein. Experimental isoelectric point (experimental method), [reference].
SC	Sequence source	Names the data source (e.g. database) the amino acid sequence has been derived from.
SQ	Sequence	Shows the amino acid sequence in one-letter-code.
DR	External database hyperlink	Database name (e. g. SwissProt): database accession number; identifier. The focus lies on linking TRANSPATH with SwissProt, EMBL, InterPro, Entrez Gene, UniGene, GO, DIP, BIND, and HyperCLDB. For some of the molecules links to PDB, PROSITE, Flybase, MGD, and others are also provided. Also, corresponding Affymetrix micro-array probe set identifiers are listed. For the following chips data is available: AFFY_HG_FOCUS AFFY_HG_U133A AFFY_HG_U133A_2 AFFY_HG_U133B AFFY_HG_U133_PLUS_2 AFFY_HG_U95AV2 AFFY_HG_U95B AFFY_HG_U95C AFFY_HG_U95D AFFY_HG_U95E AFFY_MG_U74AV2 AFFY_MG_U74BV2 AFFY_MG_U74CV2 AFFY_MOUSE430_2 AFFY_MOUSE430A_2 AFFY_MU11KSUBA AFFY_MU11KSUBB AFFY_RAT230_2 AFFY_RG_U34A AFFY_RG_U34B AFFY_RG_U34C AFFY_U133_X3P HuGeneFL The format is AFFYMETRIX:chip:probeset. Except for those from chip HuGeneFL, the Affymetrix links are based on those in Ensembl, version 27.35a for human, 27.33c for mouse, and 27.3e for rat.
DR	External database hyperlink (of encoding gene)	Hyperlinks of the encoding gene. They are included to retrieve matches in an array data analysis that is focussed on gene products.
FT	Features (motifs)	Lists all features/motifs/domains of the molecule (e.g. SH2-domain, signal peptide...), that are important in the diverse signaling cohesions the molecule is involved in. Gives the first and last position of the feature/motif in N –> C direction on the AA sequence. Structural and functional features/motifs, that are annotated from the literature, are named by the common motif name and have a reference link. Automatically annotated features/domains that were calculated using Pfam hidden Markov models (HMM) are also shown in this field. These features are characterized by their Pfam model name, a link to the corresponding Pfam entry, a raw score and an E-value (expectation value). For further explanations please take a look at the TRANSPATH^® Report 1, 0001 (2003).
	Scheme of	Shows a graphical representation of the listed features and their positions.
CC	Comments	A list of comments. Further information about the different categories of comments is available at Annotate.
GO	GO: biological process, molecular function	A list of Gene Ontology (GO) terms from the ontologies 'molecular function' and 'biological process'. Associated terms from the third ontology from GO -cellular component- can be found in the field 'Location positive and experiment(s)'. All links to GO terms have been retrieved from Ensembl.
CP	Location positive and experiment(s)	Lists all locations (tissues, physiological cell types, cell lines, cellular compartments, developmental stages) where the compound was found. Gives experiment abbreviation, the signal strength in the experiment as expression level and cites reference (if available). For further information please see Location. Abbreviations for the types of experiments (methods) used to verify the abundance of the molecule are given in brackets. For an acronym explanation table please see Methods.
CN	Location negative and experiment(s)	Lists all locations where the compound was NOT found. By doing this, having no entry is used to tell that nothing is known about the components location. Abbreviations for the types of experiments (methods) used to verify the absence of the molecule are given in brackets. For an acronym explanation table please see Methods.
ST	Complex or modified form of	A list of molecules this molecule entry is a modified form of. Or, if it is a complex, its subunits.
MF	Modified forms	A list of modified forms for this molecule. Modified forms have no own AA sequence, as it is given in the linked "Modified form of"-entry.
CX	Complexes	A list of complexes this molecule is engaged in.
	Reaction	A list of mechanistic reactions this molecule is involved in.
XB	Reaction upstream	A list of reactions which lead to this molecule (in the semantic view). So the molecule serves as a signal acceptor.
XA	Reaction downstream	A list of reactions which go out from this molecule (in the semantic view). Here the molecule serves as a signal donor.
XC	Reaction catalyzed	A list of reactions which are catalyzed by this molecule.
XI	Reaction inhibited	A list of reactions which are inhibited by this molecule.
PW	Pathways	Indicates all the pathways and chains in which the respective molecule is involved.
RN	Reference number	[consecutive entry reference number]. A list of the papers from which the information in this entry was extracted. For further information have a look at Reference.
RX	Pubmed database hyperlink	The PMID number in PubMed
RA	Reference author(s)	list of authors Reference
RT	Reference title	Reference
RL	Reference publication	Reference