UniProt
Contributors: Aeman Zahra, Shahina Hayat, Mubah Shaheen
The Universal Protein Resource (UniProt) is the largest provider of data on the proteome of organisms. The data are based on submissions to the DNA sequence databases, as well as varying levels of curation. It is the good way to quickly assess a vast set of protein properties, particularly for species that are not model organisms.
UniProt was created in collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR)[1][2].
UniProt provides four core databases which are UniProtKB, UniRef, UniParc and UniMES.
UniProt Knowledgebase(UniProt KB)
UniProtKB is the central database of protein sequences with accurate, consistent, and rich sequence and functional annotation[3].
There are two sections of knowledgebase.
- Reviewed and manually annotated Section
This section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis. This section is referred to as “UniProtKB/Swiss-Prot”[4]. - Unreviewed and Automatically annotated Section
This section contains computationally analyzed records that await full manual annotation. This section is refer to as "UniProtKB/TrEMBL"[4].
UniProt Non-redundant Reference (UniRef)
The UniRef databases combine closely related sequences into a single record to speed searches. The UniRef100 database combines identical sequences and sub-fragments of the UniProt Knowledgebase (from any species) into a single UniRef entry, displaying the sequence of a representative protein, the accession numbers of all the merged UniProt entries, and links to the corresponding UniProt and UniParc records. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues such that each cluster is composed of sequences that have at least 90% or 50% sequence identity, respectively, to the representative sequence. UniRef90 and UniRef50 yield a database size reduction of approximately 40% and 65%, respectively, providing for significantly faster sequence searches [5].
UniProt Archive(UniParc)
UniParc is a comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world. Proteins may exist in different source databases and in multiple copies in the same database. UniParc avoided such redundancy by storing each unique sequence only once and giving it a stable and unique identifier (UPI) making it possible to identify the same protein from different source databases [6].
The UniProt Metagenomics and Evironmental Sequences (UniMES)
The UniMES database is a repository specifically developed for metagenomic and environmental sequence data. UniMES database currently contains the data from the Global Ocean Sampling Expedition [GOS]. The initial GOS dataset is composed of 28 million DNA sequences from oceanic microbes and predicts nearly 6 million proteins [7].