Suggested Format for Genbank Record Entries

Why enter my GenBank records using the LLNL suggested format?

GenBank records are entered in a loose format. While this allows for a great amount of freedom and facilitates the needs of many users it makes it difficult for computers to process the data. The uses the sequences stored in GenBank for many purposes including our Imagene clustering software and our QC efforts. There are certain important pieces of data we try to determine from a GenBank record. While we have tried to be as robust as possible when determining criteria for parsing a GenBank record our software relies on certain assumptions, which will be explained here.

How we retrieve data regarding an IMAGE clone from a GenBank record.

Example GenBank record:

LOCUS       AA099559      436 bp    mRNA            EST       28-OCT-1996
DEFINITION  zl78a03.s1 Stratagene colon (#937204) Homo sapiens cDNA clone
            IMAGE:510700 3' similar to gb:D11086 CYTOKINE RECEPTOR COMMON GAMMA
            CHAIN PRECURSOR (HUMAN);, mRNA sequence.
NID         g1645633
VERSION     AA099559.1  GI:1645633
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
            Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 436)
  AUTHORS   Hillier,L., Lennon,G., Becker,M., Bonaldo,M.F., Chiapelli,B.,
            Chissoe,S., Dietrich,N., DuBuque,T., Favello,A., Gish,W.,
            Hawkins,M., Hultman,M., Kucaba,T., Lacy,M., Le,M., Le,N.,
            Mardis,E., Moore,B., Morris,M., Parsons,J., Prange,C., Rifkin,L.,
            Rohlfing,T., Schellenberg,K., Soares,M.B., Tan,F., Thierry-Meg,J.,
            Trevaskis,E., Underwood,K., Wohldmann,P., Waterston,R., Wilson,R.
            and Marra,M.
  TITLE     Generation and analysis of 280,000 human expressed sequence tags
  JOURNAL   Genome Res. 6 (9), 807-828 (1996)
  MEDLINE   97044478
            Contact: Wilson RK
            Washington University School of Medicine
            4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108
            Tel: 314 286 1800
            Fax: 314 286 1810
            This clone is available royalty-free through LLNL ; contact the
            Hypothetical Example Institution ( for further information.
            Seq primer: -40M13 fwd. from Amersham
            High quality sequence stop: 353.
FEATURES             Location/Qualifiers
     source          1..436
                     /organism="Homo sapiens"
                     /clone_lib="Stratagene colon (#937204)"
                     /lab_host="SOLR cells (kanamycin resistant)"
                     /note="Organ: colon; Vector: pBluescript SK-; Site_1:
                     EcoRI; Site_2: XhoI; Cloned unidirectionally.  Primer:
                     Oligo dT. T-84 colonic epithelial cell line.  Average
                     insert size: 1.0 kb; Uni-ZAP XR Vector; ~5' adaptor
                     sequence: 5' GAATTCGGCACGAG 3' ~3' adaptor sequence: 5'
                     CTCGAGTTTTTTTTTTTTTTTTTT 3'"
BASE COUNT      118 a     72 c    153 g     91 t      2 others
        1 tttttttgat gattatcaac agaaacttta tttctcatcg gttcaggaac aatcggaggg
       61 tagatggaaa gaggaaggga gggaaagagg gagggaggaa gaatcctgcg aaaaggaagg
      121 gccagactga gggagaagaa aaacatgttc ggggcaaaag ggtaattctc aagtggggaa
      181 tgccaaatga aggggtgctt acatgggggc acaaaattcc aaatcagcca cagtggggtg
      241 aggtgagtat gagacgcagg tgggttgaat gaaggaaagt tagtaccact tagggctaca
      301 ggaccctggg gttcttcttg tcagaggatt gggggttcag gtttcaggct ttagggtgta
      361 acattggggg ggcccagtta ggggctattg ctggttngca tggngggggg ccccaggccc
      421 cctcccccaa gggccc

The LOCUS field:

This is where we determine our GenBank accession number, the record type and date to be "AA101995", "EST" and "28-OCT-1996" respectively. Currently The only processes records of type "EST" and "PRI".


This is where we determine the orientation of a clone by searching for the first occurence of 5' or 3'. In this case the EST is the 5 prime end.

The COMMENT field:

IMAGE also makes note of how much of the sequence is considered poor quality by searching this field for the phrase "High quality sequence stop: ###" as in the example below:

            High quality sequence stop: 353.

The FEATURES field:

The most important part to us is this field.


In this one line we are able to determine our internal clone id and that this GenBank record describes an IMAGE clone. While we also try some secondary methods to determine this information this is the safest way to ensure that data regarding an IMAGE clone can be retrieved by IMAGE.

Information determined by searching the entire record:

We also try to determine further data about a clone by searching the entire record for certain key phrases. The phrase ěconsidered poor qualityî marks a clone as low quality and the phrase ěreversed cloneî marks a clone as reversed.

