TY - GEN
T1 - Efficient Construction of a Complete Index for Pan-Genomics Read Alignment
AU - Kuhnle, Alan
AU - Mun, Taher
AU - Boucher, Christina
AU - Gagie, Travis
AU - Langmead, Ben
AU - Manzini, Giovanni
N1 - Publisher Copyright:
© 2019, Springer Nature Switzerland AG.
PY - 2019
Y1 - 2019
N2 - While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT ) of the string that will allow us to find the interval in the string’s suffix array (SA ) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the SA that—when used with the rank data structure—allows us access to the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that Gagie et al. (SODA 2018) have defined an SA sample that takes about the same space as the run-length compressed BWT —we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the BWT of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.’s SA sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over Bowtie with respect to both memory and time. Availability: The implementations of our methods can be found at https://gitlab.com/manzai/Big-BWT (BWT and SA sample construction) and at https://github.com/alshai/r-index (indexing).
AB - While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT ) of the string that will allow us to find the interval in the string’s suffix array (SA ) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the SA that—when used with the rank data structure—allows us access to the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that Gagie et al. (SODA 2018) have defined an SA sample that takes about the same space as the run-length compressed BWT —we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the BWT of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.’s SA sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over Bowtie with respect to both memory and time. Availability: The implementations of our methods can be found at https://gitlab.com/manzai/Big-BWT (BWT and SA sample construction) and at https://github.com/alshai/r-index (indexing).
UR - https://www.scopus.com/pages/publications/85065534874
U2 - 10.1007/978-3-030-17083-7_10
DO - 10.1007/978-3-030-17083-7_10
M3 - Conference contribution
AN - SCOPUS:85065534874
SN - 9783030170820
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 158
EP - 173
BT - Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Proceedings
A2 - Cowen, Lenore J.
PB - Springer Verlag
T2 - 23rd International Conference on Research in Computational Molecular Biology, RECOMB 2019
Y2 - 5 May 2019 through 8 May 2019
ER -