Document Type
Dissertation
Degree
Doctor of Philosophy (PhD)
Major/Program
Computer Science
First Advisor's Name
Fahad Saeed
First Advisor's Committee Title
Committee chair
Second Advisor's Name
Ananda Mondal
Second Advisor's Committee Title
Committee member
Third Advisor's Name
Jason Liu
Third Advisor's Committee Title
Committee member
Fourth Advisor's Name
Janki Bhimani
Fourth Advisor's Committee Title
Committee member
Fifth Advisor's Name
Jun Li
Fifth Advisor's Committee Title
Committee member
Sixth Advisor's Name
Ashok Srinivasan
Sixth Advisor's Committee Title
Committee member
Keywords
bioinformatics, computational engineering, computer and systems architecture, computer sciences, systems architecture, theory and algorithms
Date of Defense
3-15-2023
Abstract
Fast and accurate identification of peptides and proteins from the mass spectrometry (MS) data is a critical problem in modern systems biology. Database peptide search is the most commonly used computational method to identify peptide sequences from the MS data. In this method, giga-bytes of experimentally generated MS data are compared against tera-byte sized databases of theoretically simulated MS data resulting in a compute- and data-intensive problem requiring days or weeks of computational times on desktop machines. Existing serial and high performance computing (HPC) algorithms strive to accelerate and improve the computational efficiency of the search, but exhibit sub-optimal performances due to their inefficient parallelization models, low resource utilization and high overhead costs.
In this dissertation, we design and develop data- and architecture-aware algorithms and optimizations to accelerate the database peptide search algorithms on heterogeneous distributed-memory (top-500) supercomputers. We first present an HPC framework which efficiently parallelizes both the compute- and memory-intensive portions of the database peptide search workloads across homogeneous supercomputers achieving a 10x speed improvement against the state-of-the-art algorithms. To achieve maximum performance, we also develop several optimizations including a low-overhead algorithm for balanced distribution of the voluminous theoretical MS databases, and a novel data structure to reduce the memory footprint of these databases by 2x without compromising the query speeds. We also developed GPU-accelerated algorithms, data pipelines and optimizations to leverage the heterogeneous (CPU-GPU) supercomputing architectures and further accelerate our HPC framework by 4x, providing a combined acceleration of 40x over existing shared- and distributed-memory, and GPU-accelerated software infrastructure. Furthermore, we extensively analyze the performance of our developed methods and show near-optimal results for several metrics including the throughput, resource utilization and overheads. Finally, we explore possible extension methods for our methods to accelerate the existing and new numerical, and machine- and deep-learning based peptide identification algorithms.
Our advancements in the HPC software infrastructure for ultrafast peptide identification have key application in meta-proteomics, multiomics, and cancer research, which require astronomical computational resources to process tera-byte scale raw MS-data at swift rates leading to useful scientific investigations and discoveries in the respective domains.
Identifier
FIDC010999
ORCID
0000-0002-0697-6894
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 4.0 License.
Recommended Citation
Haseeb, Muhammad, "High-Performance Computing Algorithms for Accelerating Peptide Identification from Mass-Spectrometry Data using Heterogeneous Supercomputers" (2023). FIU Electronic Theses and Dissertations. 5324.
https://digitalcommons.fiu.edu/etd/5324
Included in
Bioinformatics Commons, Computational Engineering Commons, Computer and Systems Architecture Commons, Systems Architecture Commons, Theory and Algorithms Commons
Rights Statement
In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/
This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).