As part of DBI-0421717, we have generated random genomic and Cot-filtered moderately repetitive (M) and single/low-copy (S) sequence libraries. A high copy Cot (H) library also was constructed, but sequencing of this library proved unfruitful. Clones were sequenced using ABI 3730xl DNA analyzers. Sequences are available through the National Center for Biotechnology Information (NCBI) and can be accessed via NCBI's GenBank using the following steps:
Alternatively, FASTA files containing the sequences can be downloaded via the links in the table below.
DNA source | Number of reads | Mean read length (bp)a | Total base pairs | GenBank Accession String | Downloadable data foldersb | ||
Genomic (random)c | 1,007 | 876 | 881,730 | ET181630:ET182636 [ACCN] | PT_7GC.zip | ||
Cot-filtered moderately repetitive (M) | 266 | 442 | 117,631 | ET182637:ET182902 [ACCN] | PT_7MC.zip | ||
Cot-filtered single/low-copy (S) | 2,328 | 213 | 495,667 | ET182903:ET185230 [ACCN] | PT_7SC.zip | ||
TOTAL | 3,601 | 1,495,028 | ET181630:ET185230 [ACCN] |
a After trimming
b Each folder contains all trimmed reads in a single FASTA file.
c Paired end reads with significant overlap (i.e., representing a continuous
sequence) were assigned a single GenBank accession. Consequently,
while there were 1,007 genomic reads, there are only 600 GenBank entries.
A collaboration with John E. Carlson (Penn State) allowed us to sequence uncloned kinetic components and genomic DNA using the 454/Roche Applied Sciences GS20 platform. The 454 sequencing, performed with the permission of the NSF and requiring only minor re-budgeting, has resulted in production of > 100 Mb of pine genomic sequence, i.e., 18 times more sequence than proposed in the original funded version of DBI-0421717. This bonus sequence has provided many new opportunities and challenges. To facilitate timely characterization of these sequences, we have developed an automated "Sequence Read Classification Pipeline" (see Publications and Bioinformatics Tools). All 454 sequence data has been archived in the NCBI Short Read Archive. It can also be obtained below.
DNA source | Number of reads | Mean read length (bp)a | Total base pairs | Short Read Archive Accession | Downloadable data foldersb | ||
Genomic (random) | 275,038 | 102 | 28,038,360 | SRX001948 | PT_7G4.zip | ||
Cot-filtered highly repetitive (H) | 216,921 | 97 | 21,029,350 | SRX001949 | PT_7H4.zip | ||
Cot-filtered moderately repetitive (M) | 206,402 | 97 | 20,017,474 | SRX001950 | PT_7M4.zip | ||
Cot-filtered single/low-copy (S) | 102,708 | 93 | 9,544,980 | SRX001951 | PT_7S4.zip | ||
Cot-filtered theoretical single-copy (T)c | 215,387 | 101 | 21,801,502 | SRX001952 | PT_7T4.zip | ||
TOTAL | 1,016,456 | 100,431,666 |
a After
trimming
b Each folder contains sequence files in FASTA format and their corresponding
quality files.
c A "T" sequence is isolated from any DNA that remains single-stranded
at 0.1*theoretical Cot value for single-copy DNA as predicted from genome
size (see Sequence Names
for further explanation).
The sequence data from both ABI 3730xl and 454 synthesis sequencing should facilitate many aspects of pine genomics including detailed characterization of pine repeat sequences, e.g., see GenomeWeb News article.
Please feel free to contact us if you have any questions.