Expressed Sequence Tags
In our help pages and training course material, we suggest Expressed Sequence Tags (ESTs) as a source of sequence data for organisms that are poorly represented in the protein databases. For example, even though the Atlantic cod (Gadus morhua) is of vital importance to the British Fish & Chip industry, there are only 1,207 distinct protein sequences in NCBInr, compared with 257,435 ESTs in GenBank. The EST sequences are highly redundant and may be just fragments of full length proteins, but this can be compensated for by using a UniGene index to map the identified accessions into gene families. Doing so gives a result report that is as concise and meaningful as from searching a non-redundant protein database. Details can be found in this talk or module 3 of the webcast training course.
ESTs may be very useful for protein identification, but they are old hat in the DNA sequencing world. The Unigene index for Atlantic cod was last updated in July 2012 and only 1,962 ESTs have been added to GenBank since that date. The reason is that almost everyone has switched from Sanger sequencing to next-generation sequencing. A torrent of nextgen data is being added to the NCBI and EBI repositories. At NCBI, raw read data goes into the Sequence Read Archive (SRA). The EBI equivalent is the European Nucleotide Archive (ENA).
Unfortunately, raw read data is really not suitable for direct searching. Some degree of assembly is needed to remove most of the redundancy and reduce the volume of data. In general, reads are assembled into contigs, contigs to scaffolds, and scaffolds to chromosomes. For Atlantic cod you can download most of the genome as 6467 scaffolds The size of the Fasta is approximately 600 MB, slightly more manageable than the underlying 110 GB of raw read data. Note that individual scaffolds can be very large, so you may need to use the entry splitting utility described on the generic database help page.
From the perspective of protein ID, a collection of scaffolds is not as attractive as a database of ESTs of similar size:
- A genome tends to represent a single individual, while a collection of ESTs from many individuals is more likely to contain some of the same variants as your sample
- ESTs are 100% coding sequence
- If the genome contains introns, a proportion of peptide candidates are unavailable because they span exon-intron boundaries
- EST accessions can easily be mapped to gene families using UniGene. Figuring out what you’ve matched from an unannotated genome assembly is a substantial amount of work
In summary, first choice is protein, second is EST, and third is genomic DNA. The NCBI taxonomy browser will show how many proteins are present in GenBank for your organism. The actual number of entries in NCBInr will be fewer because the count of proteins includes entries that are identical to one another, while NCBInr is a non-identical database. If the count is low compared with the expected number of proteins, take a look at the count of ESTs. If there are tens of thousands, try these. If not, hopefully, there is a link to at least one Assembly. Where there is a choice, download the most highly assembled sequences in Fasta format.