|
|
|
|
|
Database Description
|
 |
| |
 |
UniProt Knowledgebase (UniProtKB)
UniProtKB is the central hub for the collection of functional information on proteins, with accurate, consistent, and rich annotation. In addition to capturing the core data mandatory for each UniProtKB entry (principally, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and cross-references, and clear indications of the quality of annotation in the form of evidence attribution of experimental and computational data.
Created by merging the data in Swiss-Prot, TrEMBL and PIR-PSD, individual UniProt Knowledgebase entries may contain more information than was available in any given separate source database. The UniProt Knowledgebase consists of two sections: a section containing manually-annotated records with information extracted from literature
and curator-evaluated computational analysis, and a section with computationally analyzed records that await full manual annotation. For the sake of continuity and name recognition, the two sections are referred to as "Swiss-Prot" and "TrEMBL", respectively.
UniProt Non-redundant Reference (UniRef) Databases
The UniRef databases provide clustered sets of sequences from UniProt Knowledgebase (including splice variants and isoforms) and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view. The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues (from any organism*) into a single UniRef entry, displaying the sequence of a representative protein, the accession numbers of all the merged UniProtKB entries, and links to the corresponding UniProtKB and UniParc records. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues using the CD-HIT algorithm (Li W., Jaroszewski L., and Godzik A., Bioinformatics, 17: 282-283, 2001) such that each cluster is composed of sequences that have at least 90% or 50% sequence identity, respectively, to the representative sequence. UniRef90 and UniRef50 yield a database size reduction of approximately 40% and 65%, respectively, providing for significantly faster sequence searches.
All the sequences in each cluster are ranked to facilitate the selection of a representative sequence. The sequences are ranked as follows: (1) quality of the entry: member entries from UniProtKB/Swiss-Prot are preferred, (2) meaningful name (entries with names that do not contain words such as hypothetical, probable, etc. are preferred), (3) organism (entries from model organisms preferred), and (4) length of the sequence (longest sequence preferred).
* Prior to Release 1.8, identical sequences and sub-fragments were combined only if they were derived from the same species.
UniProt Archive (UniParc)
UniParc is a comprehensive non-redundant protein sequence collection. Protein sequences are loaded daily from many different publicly accessible sources, including not only the UniProt Consortium databases Swiss-Prot, TrEMBL and PIR-PSD, but also translations from the EMBL-Bank/DDBJ/GenBank nucleotide sequence databases, the EnsEMBL database of animal genomes, the International Protein Index (IPI), the Protein Data Bank (PDB), NCBI's Reference Sequence Collection (RefSeq), model organism databases such as FlyBase and WormBase, and protein sequences from the European, American and Japanese Patent Offices. While a protein sequence may exist in multiple databases, and even more than once in a given database (with different identifiers), UniParc stores each unique sequence only once and assigns it a unique UniParc identifier. Cross-references back to the source databases are provided, and include source accession numbers, sequence versions, and status (active or obsolete). A UniParc sequence version is also provided and incremented each time the underlying sequence changes, making it possible to observe the history of sequence changes in all source databases.
|
|
|
|