TRANSFAC® Release 7.0 - Documentation
Site
As outlined above, SITE gives information on individual (putatively)
regulatory protein binding sites. In this release, it contains 7915
entries, 6360 of them referring to sites within 1504 eukaryotic genes,
the species of which ranging from yeast to human. Additionally, this
table comprises 1295 artificial sequences which resulted from
mutagenesis studies, in vitro selection procedures starting from random
oligonucleotide mixtures or from specific theoretical considerations.
And finally, there are 260 entries with consensus binding sequences
given in the IUPAC code, many of them being taken from the compilation
of Faisst and Meyer (Nucleic Acids Res. 20:3-26, 1992). The symbols used
in addition to A, C, G, or T for these consensi are:
W |
= A or T |
|
|
S |
= C or G |
R |
= A or G |
|
|
Y |
= C or T |
K |
= G or T |
|
|
M |
= A or C |
B |
= C, G, or T |
|
|
D |
= A, G, or T |
H |
= A, C, or T |
|
|
V |
= A, C, or G |
N |
= A, C, G, or T |
|
|
|
|
A number of consensi have been generated by the TRANSFAC® team,
generally derived from the profiles stored in the MATRIX table.
Here, the use of degenerate codes follows the following rules
(adapted from Cavener, Nucleic Acids Res. 15:1353-1361, 1987):
a single nucleotide is shown if its frequency is greater than 50%
and at least twice as high as the second most frequent nucleotide.
A double-degenerate code indicates that the corresponding two nucleotides
occur in more than 75% of the underlying sequences but each of them is
present in less than 50%. Usage of triple-degenerate codes was restricted
to those positions where one of the nucleotides did not show up at all in
the sequence set and none of the afore-mentioned rules applies.
Collectively, SITE contains 6321 sequences with 116035 nucleotides.