TY - GEN
T1 - Prefix-free parsing for building big BWTs
AU - Boucher, Christina
AU - Gagie, Travis
AU - Kuhnle, Alan
AU - Manzini, Giovanni
N1 - Publisher Copyright:
© Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini.
PY - 2018/8/1
Y1 - 2018/8/1
N2 - High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive - a characteristic that can be exploited and enable the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. Therefore, prefix-free parsing eases BWT construction, which is pertinent to many bioinformatics applications.
AB - High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive - a characteristic that can be exploited and enable the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. Therefore, prefix-free parsing eases BWT construction, which is pertinent to many bioinformatics applications.
KW - Burrows-Wheeler Transform
KW - Compression-aware algorithms
KW - Genomic databases
KW - Prefix-free parsing
UR - http://www.scopus.com/inward/record.url?scp=85051345740&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.WABI.2018.2
DO - 10.4230/LIPIcs.WABI.2018.2
M3 - Conference contribution
AN - SCOPUS:85051345740
SN - 9783959770828
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 18th International Workshop on Algorithms in Bioinformatics, WABI 2018
A2 - Parida, Laxmi
A2 - Ukkonen, Esko
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 18th International Workshop on Algorithms in Bioinformatics, WABI 2018
Y2 - 20 August 2018 through 22 August 2018
ER -