Faculty Publications
Indexing Genomic Sequence Libraries
Document Type
Article
Keywords
Bioinformatics, Genomics, Information retrieval, Mumps, Sequence retrieval
Journal/Book/Conference Title
Information Processing and Management
Volume
41
Issue
2
First Page
265
Last Page
274
Abstract
This paper describes an extensible, open-source (GPL) data repository and retrieval system that supports fast, efficient, keyword based retrieval of genomic sequences from multiple libraries with retrieved sequences post-processed by FASTA, Smith-Waterman and other analysis software. This application is implemented for Linux and is written in Mumps, C, and C++ with supporting components that include the Berkeley Data Base, the Perl Compatible Regular Expression Library, GLADE, and tools such as FASTA, Smith-Waterman, and modules from EMBOSS. The package described here can quickly index data sets of up to 256 terabytes using a B-tree based multi-dimensional data model. An example is presented that indexes the text of the full NCBI Genbank library. © 2003 Elsevier Ltd. All rights reserved.
Department
Department of Computer Science
Original Publication Date
3-1-2005
DOI of published version
10.1016/j.ipm.2003.09.001
Recommended Citation
O'Kane, Kevin C. and Lockner, Matthew J., "Indexing Genomic Sequence Libraries" (2005). Faculty Publications. 2961.
https://scholarworks.uni.edu/facpub/2961