Document Type
Dissertation
Degree
Doctor of Philosophy (PhD)
Major/Program
Computer Science
First Advisor's Name
Ananda M. Mondal
First Advisor's Committee Title
Committee Chair
Second Advisor's Name
Giri Narasimhan
Second Advisor's Committee Title
Committee member
Third Advisor's Name
Fahad Saeed
Third Advisor's Committee Title
Committee member
Fourth Advisor's Name
Leonardo Bobadilla
Fourth Advisor's Committee Title
Committee member
Fifth Advisor's Name
Wenrui Duan
Fifth Advisor's Committee Title
Committee member
Keywords
Feature Selection, Deep Learning, Cancer, TCGA, Classification
Date of Defense
3-23-2022
Abstract
Cancer is a complex molecular process due to abnormal changes in the genome, such as mutation and copy number variation, and epigenetic aberrations such as dysregulations of long non-coding RNA (lncRNA). These abnormal changes are reflected in transcriptome by turning oncogenes on and tumor suppressor genes off, which are considered cancer biomarkers.
However, transcriptomic data is high dimensional, and finding the best subset of genes (features) related to causing cancer is computationally challenging and expensive. Thus, developing a feature selection framework to discover molecular biomarkers for cancer is critical.
Traditional approaches for biomarker discovery calculate the fold change for each gene, comparing expression profiles between tumor and healthy samples, thus failing to capture the combined effect of the whole gene set. Also, these approaches do not always investigate cancer-type prediction capabilities using discovered biomarkers.
In this work, we proposed a machine learning-based framework to address all of the above challenges in discovering lncRNA biomarkers. First, we developed a machine learning pipeline that takes lncRNA expression profiles of cancer samples as input and outputs a small set of key lncRNAs that can accurately predict multiple cancer types. A significant innovation of our work is its ability to identify biomarkers without using healthy samples. However, this initial framework cannot identify cancer-specific lncRNAs. Second, we extended our framework to identify cancer type and subtype-specific lncRNAs. Third, we proposed to use a state-of-the-art deep learning algorithm concrete autoencoder (CAE) in an unsupervised setting, which efficiently identifies a subset of the most informative features. However, CAE does not identify reproducible features in different runs due to its stochastic nature. Thus, we proposed a multi-run CAE (mrCAE) to identify a stable set of features to address this issue. Our deep learning-based pipeline significantly extended the previous state-of-the-art feature selection techniques.
Finally, we showed that discovered biomarkers are biologically relevant using literature review and prognostically significant using survival analyses. The discovered novel biomarkers could be used as a screening tool for different cancer diagnoses and as therapeutic targets.
Identifier
FIDC010688
ORCID
https://orcid.org/0000-0002-0610-3057
Previously Published In
Al Mamun, A., Tanvir, R.B., Sobhan, M., Mathee, K., Narasimhan, G., Holt, G.E. and Mondal, A.M., 2021. Multi-run Concrete Autoencoder to Identify Prognostic lncRNAs for 12 Cancers. International Journal of Molecular Sciences, 22(21), p.11919. (Impact Factor 5.92)
Al Mamun, A., W. Duan and A. M. Mondal, ``Pan-cancer Feature Selection and Classification Reveals Important Long Non-coding RNAs," 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Korea (South), 2020, pp. 2417-2424.
Al Mamun, Abdullah, Masrur Sobhan, Raihanul Bari Tanvir, Charles J. Dimitroff, and Ananda M. Mondal. "Deep Learning to Discover Cancer Glycome Genes Signifying the Origins of Cancer." In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2425-2431. IEEE, 2020.
Al Mamun, A., and Ananda Mohan Mondal. "Long non-coding RNA based cancer classification using deep neural networks." Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 2019.
Al Mamun, A. and A. M. Mondal, "Feature Selection and Classification Reveal Key lncRNAs for Multiple Cancers," 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 2019, pp. 2825-2831.
Recommended Citation
Mamun, Md Abdullah Al, "A Machine Learning Framework for Identifying Molecular Biomarkers from Transcriptomic Cancer Data" (2022). FIU Electronic Theses and Dissertations. 4973.
https://digitalcommons.fiu.edu/etd/4973
Included in
Bioinformatics Commons, Biomedical Commons, Biostatistics Commons, Cancer Biology Commons, Computational Biology Commons, Computer Engineering Commons, Computer Sciences Commons, Data Science Commons, Health Information Technology Commons, Molecular, Cellular, and Tissue Engineering Commons
Rights Statement
In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/
This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).