Skip Header

 

Release 12.7

Published January 15, 2008

Headlines

Addition of more than 40'000 microbial entries derived from automated annotation in UniProtKB

Thanks to genome sequencing efforts, there has been a tremendous rise in the number of submitted protein sequences. And this is only the beginning, as faster and cheaper sequencing methods will greatly increase the rate at which new genomes are sequenced.

Semi-automated annotation methods are necessary in order to provide the users with a maximum number of annotated protein sequences. The approach used by UniProtKB/Swiss-Prot differs from most other automated methods as the bulk of the annotation procedure is still performed manually, since we want to make sure that we produce high quality annotation with a minimal amount of incorrect inferences.

Our first automatic annotation project is called HAMAP, which stands for High-quality Automated and Manual Annotation of microbial Proteomes. In the context of this project, proteins from complete bacterial and archaeal proteomes, together with the related plastid proteins, are automatically annotated based on manually created family rules for complete protein annotation, with template-based feature propagation. We are very aware of the danger posed by automatic annotation procedures and have been extremely careful in the implementation of the pipeline, establishing many checks and conditional propagation in order to ensure that automatic annotation will produce data of a quality up to that of manual curation.

At this release, we have begun the procedure to integrate automatically into UniProtKB/Swiss-Prot the entries annotated by the HAMAP automated pipeline; over 40'000 bacterial and archaeal entries were integrated. This is the largest number of entries ever integrated at one release.

It must be noted that the planned introduction of 'evidence tags' should allow us to unambiguously flag whether an information item has been derived manually or automatically. For the time being, all entries annotated by the HAMAP pipeline have a cross-reference to HAMAP (for an example see entry Q02JM4).

UniProtKB News

Cross-references to dictyBase

The DictyBase database was renamed dictyBase. We changed the database name in the relevant cross-references (DR lines in the flat file) accordingly.

Example:

DR   dictyBase; DDB0201569; manA.

Cross-references to PDBsum

Cross-references have been added to the PDBsum database. PDBsum provides an overview of every macromolecular structure deposited in the Protein Data Bank (PDB), giving schematic diagrams of the molecules in each structure and of the interactions between them.

The PDBsum database is available at http://www.ebi.ac.uk/pdbsum.

The format of the explicit link in the flat file is:

Data bank identifier PDBsum
Primary identifier The primary identifier consists of a PDB entry name.
Secondary identifier None; a dash '-' is stored in that field.
Examples
Q07540:
        DR   PDBsum; 2FQL; -.
        DR   PDBsum; 2GA5; -.
       
P78536:
        DR   PDBsum; 1BKC; -.
        DR   PDBsum; 1ZXC; -.
        DR   PDBsum; 2A8H; -.
        DR   PDBsum; 2DDF; -.
        DR   PDBsum; 2FV5; -.
        DR   PDBsum; 2FV9; -.
        DR   PDBsum; 2I47; -.
       

Cross-references to VectorBase

Cross-references have been added to the Invertebrate Vectors of Human Pathogens database. VectorBase is a NIAID Bioinformatics Resource Center for Invertebrate Vectors of Human Pathogens. VectorBase annotates and maintains vector genomes providing an integrated resource for the research community.

The VectorBase database is available at http://www.vectorbase.org/index.php.

The format of the explicit link in the flat file is:

Data bank identifier VectorBase
Primary identifier The primary identifier consists of a VectorBase Gene ID.
Secondary identifier The secondary identifier consists of a species name.
Examples
Q17KX3:
        DR   VectorBase; AAEL001551; Aedes aegypti.
       
Q7PD39:
        DR   VectorBase; AGAP005024; Anopheles gambiae.
        DR   VectorBase; AGAP005025; Anopheles gambiae.
       

Release of new species-specific documents

There are 9 new documents for several Brucella, Rickettsia and Coxiella complete proteomes, listing all the UniProtKB/Swiss-Prot entries from these proteomes and their corresponding gene designations.

The documents contain, for each relevant UniProtKB/Swiss-Prot entry, the corresponding ordered locus name, entry name, accession number, sequence length and gene name(s).

Changes concerning keywords

New keywords:

Modified keywords:

Changes in subcellular location controlled vocabulary

New subcellular location:

UniMES News

New clustered sequence sets

The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data.

We now provide UniMES clusters, i.e. clustered sets (unimes_cluster100.fasta and unimes_cluster90.fasta) of sequences at two resolutions (100% and >90%). In unimes_cluster100.fasta, identical sequences and subfragments from unimes.fasta are placed into a single cluster.

The unimes_cluster90.fasta is built by clustering unimes_cluster100.fasta representative sequences (the longest sequence in a cluster) using the CD-HIT algorithm (Li W., Jaroszewski L., and Godzik A., Bioinformatics, 17: 282-283, 2001) such that each cluster is composed of sequences that have at least 90% sequence identity, to the representative sequence. Only the representative sequences of the clusters are present in these files.

UniMES is available in the subdirectory current_release/unimes of the UniProt ftp servers (Uniprot, EBI and ExPASy).