Microbiome Depiction Through User-Adapted Bioinformatic Pipelines and Parameters
Abstract
Introduction and Objective. The microbiome’s role in human health is well-recognized. However, significant variability in the bioinformatic protocols for analyzing microbial genomic data has impeded the potential incorporation of microbiomics into the clinical setting. Few evidence-based recommendations exist for setting parameters of programs that infer microbiota compositions despite these parameters significantly impacting the accuracy of analysis. Our aim is to compare three programs (DADA2, QIIME2, and mothur) and optimize them into four user-adapted pipelines for processing paired-end amplicon reads. It is our hope to increase the accuracy of microbiota compositional analysis and help to standardize microbiomic protocol. Methods. Two key parameters with varying values were independently measured across four pipelines: filtering sequence reads based on a whole-number error threshold (maxEE) and truncating read ends based on a quality score threshold (QTrim). First, loss of input sequence reads was determined. Then, closeness of sample inference was evaluated by comparing the community profiles we generated using a mock community to the mock’s known composition via weighted UniFrac distance. The mock community used contained 20 bacterial species in equal proportions. Lastly, pipeline sensitivity and spurious taxa call rate were evaluated at the genus and species phylogenetic levels. Results. Quantity of raw genomic data lost varied by pipeline but, overall, read retention correlated with how stringently parameters were set. Read retention was inversely related with QTrim and directly related with maxEE. While all pipelines were 100% sensitive at the genus level, DADA2 achieved the highest sensitivity at the species level with the maxEE and QTrim parameters isolated, 40% and 35%, respectively. mothur falsely detected the most taxa, 28 genera. Accuracy of overall sample inference correlated with increased sequence read retention. DADA2 with maxEE set to four resulted in the lowest UniFrac distance from the mock community, 0.183. There was no significant difference among the average UniFrac scores of each pipeline’s aggregate output (p = 0.152). Conclusions-Implications. To improve microbial community profiling, bioinformatic protocols must be user-adapted. We found DADA2 to be the best pipeline for microbial compositional analysis, and short-read 16S sequencing to be a method only appropriate for identifying bacterial genera and higher phylogenetic ranks.
Keywords
Microbiology, Genetics; Basic Sciences
Presentation Type
Poster Presentation
Microbiome Depiction Through User-Adapted Bioinformatic Pipelines and Parameters
Introduction and Objective. The microbiome’s role in human health is well-recognized. However, significant variability in the bioinformatic protocols for analyzing microbial genomic data has impeded the potential incorporation of microbiomics into the clinical setting. Few evidence-based recommendations exist for setting parameters of programs that infer microbiota compositions despite these parameters significantly impacting the accuracy of analysis. Our aim is to compare three programs (DADA2, QIIME2, and mothur) and optimize them into four user-adapted pipelines for processing paired-end amplicon reads. It is our hope to increase the accuracy of microbiota compositional analysis and help to standardize microbiomic protocol. Methods. Two key parameters with varying values were independently measured across four pipelines: filtering sequence reads based on a whole-number error threshold (maxEE) and truncating read ends based on a quality score threshold (QTrim). First, loss of input sequence reads was determined. Then, closeness of sample inference was evaluated by comparing the community profiles we generated using a mock community to the mock’s known composition via weighted UniFrac distance. The mock community used contained 20 bacterial species in equal proportions. Lastly, pipeline sensitivity and spurious taxa call rate were evaluated at the genus and species phylogenetic levels. Results. Quantity of raw genomic data lost varied by pipeline but, overall, read retention correlated with how stringently parameters were set. Read retention was inversely related with QTrim and directly related with maxEE. While all pipelines were 100% sensitive at the genus level, DADA2 achieved the highest sensitivity at the species level with the maxEE and QTrim parameters isolated, 40% and 35%, respectively. mothur falsely detected the most taxa, 28 genera. Accuracy of overall sample inference correlated with increased sequence read retention. DADA2 with maxEE set to four resulted in the lowest UniFrac distance from the mock community, 0.183. There was no significant difference among the average UniFrac scores of each pipeline’s aggregate output (p = 0.152). Conclusions-Implications. To improve microbial community profiling, bioinformatic protocols must be user-adapted. We found DADA2 to be the best pipeline for microbial compositional analysis, and short-read 16S sequencing to be a method only appropriate for identifying bacterial genera and higher phylogenetic ranks.