Document Type

Dissertation

Degree

Doctor of Philosophy (PhD)

Major/Program

Computer Science

First Advisor's Name

Fahad Saeed

First Advisor's Committee Title

Committee chair

Second Advisor's Name

Ananda Mondal

Second Advisor's Committee Title

Committee member

Third Advisor's Name

Jason Liu

Third Advisor's Committee Title

Committee member

Fourth Advisor's Name

Janki Bhimani

Fourth Advisor's Committee Title

Committee member

Fifth Advisor's Name

Jun Li

Fifth Advisor's Committee Title

Committee member

Sixth Advisor's Name

Ashok Srinivasan

Sixth Advisor's Committee Title

Committee member

Keywords

bioinformatics, computational engineering, computer and systems architecture, computer sciences, systems architecture, theory and algorithms

Date of Defense

3-15-2023

Abstract

Fast and accurate identification of peptides and proteins from the mass spectrometry (MS) data is a critical problem in modern systems biology. Database peptide search is the most commonly used computational method to identify peptide sequences from the MS data. In this method, giga-bytes of experimentally generated MS data are compared against tera-byte sized databases of theoretically simulated MS data resulting in a compute- and data-intensive problem requiring days or weeks of computational times on desktop machines. Existing serial and high performance computing (HPC) algorithms strive to accelerate and improve the computational efficiency of the search, but exhibit sub-optimal performances due to their inefficient parallelization models, low resource utilization and high overhead costs.

In this dissertation, we design and develop data- and architecture-aware algorithms and optimizations to accelerate the database peptide search algorithms on heterogeneous distributed-memory (top-500) supercomputers. We first present an HPC framework which efficiently parallelizes both the compute- and memory-intensive portions of the database peptide search workloads across homogeneous supercomputers achieving a 10x speed improvement against the state-of-the-art algorithms. To achieve maximum performance, we also develop several optimizations including a low-overhead algorithm for balanced distribution of the voluminous theoretical MS databases, and a novel data structure to reduce the memory footprint of these databases by 2x without compromising the query speeds. We also developed GPU-accelerated algorithms, data pipelines and optimizations to leverage the heterogeneous (CPU-GPU) supercomputing architectures and further accelerate our HPC framework by 4x, providing a combined acceleration of 40x over existing shared- and distributed-memory, and GPU-accelerated software infrastructure. Furthermore, we extensively analyze the performance of our developed methods and show near-optimal results for several metrics including the throughput, resource utilization and overheads. Finally, we explore possible extension methods for our methods to accelerate the existing and new numerical, and machine- and deep-learning based peptide identification algorithms.

Our advancements in the HPC software infrastructure for ultrafast peptide identification have key application in meta-proteomics, multiomics, and cancer research, which require astronomical computational resources to process tera-byte scale raw MS-data at swift rates leading to useful scientific investigations and discoveries in the respective domains.

Identifier

FIDC010999

ORCID

0000-0002-0697-6894

Creative Commons License

Creative Commons Attribution-Share Alike 4.0 License
This work is licensed under a Creative Commons Attribution-Share Alike 4.0 License.

Share

COinS
 

Rights Statement

Rights Statement

In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/
This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).