Document Type

Dissertation

Degree

Doctor of Philosophy (PhD)

Major/Program

Computer Science

First Advisor's Name

Ananda M. Mondal

First Advisor's Committee Title

Committee Chair

Second Advisor's Name

Giri Narasimhan

Second Advisor's Committee Title

Committee member

Third Advisor's Name

Fahad Saeed

Third Advisor's Committee Title

Committee member

Fourth Advisor's Name

Leonardo Bobadilla

Fourth Advisor's Committee Title

Committee member

Fifth Advisor's Name

Wenrui Duan

Fifth Advisor's Committee Title

Committee member

Keywords

Feature Selection, Deep Learning, Cancer, TCGA, Classification

Date of Defense

3-23-2022

Abstract

Cancer is a complex molecular process due to abnormal changes in the genome, such as mutation and copy number variation, and epigenetic aberrations such as dysregulations of long non-coding RNA (lncRNA). These abnormal changes are reflected in transcriptome by turning oncogenes on and tumor suppressor genes off, which are considered cancer biomarkers.

However, transcriptomic data is high dimensional, and finding the best subset of genes (features) related to causing cancer is computationally challenging and expensive. Thus, developing a feature selection framework to discover molecular biomarkers for cancer is critical.

Traditional approaches for biomarker discovery calculate the fold change for each gene, comparing expression profiles between tumor and healthy samples, thus failing to capture the combined effect of the whole gene set. Also, these approaches do not always investigate cancer-type prediction capabilities using discovered biomarkers.

In this work, we proposed a machine learning-based framework to address all of the above challenges in discovering lncRNA biomarkers. First, we developed a machine learning pipeline that takes lncRNA expression profiles of cancer samples as input and outputs a small set of key lncRNAs that can accurately predict multiple cancer types. A significant innovation of our work is its ability to identify biomarkers without using healthy samples. However, this initial framework cannot identify cancer-specific lncRNAs. Second, we extended our framework to identify cancer type and subtype-specific lncRNAs. Third, we proposed to use a state-of-the-art deep learning algorithm concrete autoencoder (CAE) in an unsupervised setting, which efficiently identifies a subset of the most informative features. However, CAE does not identify reproducible features in different runs due to its stochastic nature. Thus, we proposed a multi-run CAE (mrCAE) to identify a stable set of features to address this issue. Our deep learning-based pipeline significantly extended the previous state-of-the-art feature selection techniques.

Finally, we showed that discovered biomarkers are biologically relevant using literature review and prognostically significant using survival analyses. The discovered novel biomarkers could be used as a screening tool for different cancer diagnoses and as therapeutic targets.

Identifier

FIDC010688

ORCID

https://orcid.org/0000-0002-0610-3057

Previously Published In

Al Mamun, A., Tanvir, R.B., Sobhan, M., Mathee, K., Narasimhan, G., Holt, G.E. and Mondal, A.M., 2021. Multi-run Concrete Autoencoder to Identify Prognostic lncRNAs for 12 Cancers. International Journal of Molecular Sciences, 22(21), p.11919. (Impact Factor 5.92)

Al Mamun, A., W. Duan and A. M. Mondal, ``Pan-cancer Feature Selection and Classification Reveals Important Long Non-coding RNAs," 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Korea (South), 2020, pp. 2417-2424.

Al Mamun, Abdullah, Masrur Sobhan, Raihanul Bari Tanvir, Charles J. Dimitroff, and Ananda M. Mondal. "Deep Learning to Discover Cancer Glycome Genes Signifying the Origins of Cancer." In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2425-2431. IEEE, 2020.

Al Mamun, A., and Ananda Mohan Mondal. "Long non-coding RNA based cancer classification using deep neural networks." Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. 2019.

Al Mamun, A. and A. M. Mondal, "Feature Selection and Classification Reveal Key lncRNAs for Multiple Cancers," 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 2019, pp. 2825-2831.

Share

COinS
 

Rights Statement

Rights Statement

In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/
This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).