Document Type



Doctor of Philosophy (PhD)


Computer Science

First Advisor's Name

Giri Narasimhan

First Advisor's Committee Title

Committee Chair

Second Advisor's Name

Ruogu Fang

Second Advisor's Committee Title

Committee Member

Third Advisor's Name

Jennifer Clarke

Third Advisor's Committee Title

Committee Member

Fourth Advisor's Name

Kalai Mathee

Fourth Advisor's Committee Title

Committee Member

Fifth Advisor's Name

Leonardo Bobadilla

Fifth Advisor's Committee Title

Committee Member


Cloud Computing, MapReduce, Hilbert Curve, Deep Learning, Metagenomics, Microbiome, DNA Sequencing, Image Analysis, Neural Networks, Genomics

Date of Defense



Metagenomics is the study of the combined genetic material found in microbiome samples, and it serves as an instrument for studying microbial communities, their biodiversities, and the relationships to their host environments. Creating, interpreting, and understanding microbial community profiles produced from microbiome samples is a challenging task as it requires large computational resources along with innovative techniques to process and analyze datasets that can contain terabytes of information.

The community profiles are critical because they provide information about what microorganisms are present in the sample, and in what proportions. This is particularly important as many human diseases and environmental disasters are linked to changes in microbiome compositions.

In this work we propose novel approaches for the creation and interpretation of microbial community profiles. This includes: (a) a cloud-based, distributed computational system that generates detailed community profiles by processing large DNA sequencing datasets against large reference genome collections, (b) the creation of Microbiome Maps: interpretable, high-resolution visualizations of community profiles, and (c) a machine learning framework for characterizing microbiomes from the Microbiome Maps that delivers deep insights into microbial communities.

The proposed approaches have been implemented in three software solutions: Flint, a large scale profiling framework for commercial cloud systems that can process millions of DNA sequencing fragments and produces microbial community profiles at a very low cost; Jasper, a novel method for creating Microbiome Maps, which visualizes the abundance profiles based on the Hilbert curve; and Amber, a machine learning framework for characterizing microbiomes using the Microbiome Maps generated by Jasper with high accuracy.

Results show that Flint scales well for reference genome collections that are an order of magnitude larger than those used by competing tools, while using less than a minute to profile a million reads on the cloud with 65 commodity processors. Microbiome maps produced by Jasper are compact, scalable representations of extremely complex microbial community profiles with numerous demonstrable advantages, including the ability to display latent relationships that are hard to elicit. Finally, experiments show that by using images as input instead of unstructured tabular input, the carefully engineered software, Amber, can outperform other sophisticated machine learning tools available for classification of microbiomes.




Previously Published In

Valdes, C., Stebliankin, V., & Narasimhan, G. (2019). Large scale microbiome profiling in the cloud. Bioinformatics (Oxford, England), 35(14), i13–i22.

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Files over 15MB may be slow to open. For best results, right-click and select "Save as..."



Rights Statement

Rights Statement

In Copyright. URI:
This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).