Indexing Genomic Sequence Libraries
Bioinformatics, Genomics, Information retrieval, Mumps, Sequence retrieval
Information Processing and Management
This paper describes an extensible, open-source (GPL) data repository and retrieval system that supports fast, efficient, keyword based retrieval of genomic sequences from multiple libraries with retrieved sequences post-processed by FASTA, Smith-Waterman and other analysis software. This application is implemented for Linux and is written in Mumps, C, and C++ with supporting components that include the Berkeley Data Base, the Perl Compatible Regular Expression Library, GLADE, and tools such as FASTA, Smith-Waterman, and modules from EMBOSS. The package described here can quickly index data sets of up to 256 terabytes using a B-tree based multi-dimensional data model. An example is presented that indexes the text of the full NCBI Genbank library. © 2003 Elsevier Ltd. All rights reserved.
Department of Computer Science
Original Publication Date
DOI of published version
O'Kane, Kevin C. and Lockner, Matthew J., "Indexing Genomic Sequence Libraries" (2005). Faculty Publications. 2961.