A Sophisticated Bioinformatic Pipeline
The MicroGenDX bioinformatic pipeline separates quality reads from procedural noise and performs taxonomic assignment with the greatest accuracy
- Assess quality score assigned by MiSeq sequencer
- Quality trimming and read merging
- Chimera detection – Chimeric sequences occurs when an aborted sequence extension is misidentified as a primer and is extended upon incorrectly in subsequent PCR cycles
- Read cleanup
- OTU (operational taxonomic unit) selection – Dereplicate sequences; remove all singleton clusters; trim all sequences to the same length; perform OTU clustering using UPARSE; map original reads to OTUs
- Taxonomic assignment – Perform global alignment against MicrogenDX curated database of known microbes
- Diversity analysis – Analyze confidence values; merge matching OTUs and calculate percentages
The MicroGenDX analysis pipeline accomplishes denoising by employing the following steps on each sequencing run:
- The forward and reverse reads are taken in FASTQ format and are merged together using the PEAR Illumina paired-end read merger.
- The FASTQ formatted files are converted into FASTA-formatted sequence and quality files.
- Reads are run through an internally developed quality trimming algorithm. During this stage each read has a running average taken across the sequence, and is trimmed back at the last base where the total average is greater than 25.
- Sequence reads are then sorted by length from longest to shortest.
- Prefix dereplication is performed using the USEARCH algorithm. Prefix dereplication groups reads into clusters so that each sequence of equal or shorter length than the centroid sequence must be a 100% match to the centroid sequence for the length of the sequence. Each cluster is marked with the total number of member sequences. Sequences < 100bp in length are not written to the output file – however no minimum cluster size restriction is applied which will allow singleton clusters to exist in the output.
- Clustering at a 4% divergence is performed using the USEARCH clustering algorithm. The result of this stage is the consensus sequence from each new cluster, with each tagged to show their total number of member sequences (dereplicated + clustered). Clusters that contain <2 members (singleton clusters) are not added to the output file, thus removing them from the data set.
- OTU Selection is performed using the UPARSE OTU selection algorithm to classify the large number of clusters into OTUs.
- Chimera checking is performed on the selected OTUs using the UCHIME chimera detection software executed in de novo mode.
- Each clustered centroid from step 6 listed above is then mapped to their corresponding OTUs and then marked as either Chimeric or Non-Chimeric. All Chimeric sequences are then removed.
- Each read from Step 3 is then mapped to its corresponding nonchimeric cluster using the USEARCH global alignment algorithm.
- Using the consensus sequence for each centroid as a guide, each sequence in a cluster is then aligned to the consensus sequence, and each base is then corrected using the following rules, where “C” is the consensus sequence, and “S” is the aligned sequence:
- If the current base pair in S is marked to be deleted, then the base is removed from the sequence if the quality score for that base is less than 30.
- If the current position in S is marked to have a base from C inserted, then the base is inserted into the sequence if the mean quality score from all sequences that mark the base as existing is greater than 30.
- If the current position in S is marked as a match to C but the bases are different, then the base in S is changed if the quality score for that base is less than 30.
- If a base was inserted or changed, the quality score for that position is updated. If the base was deleted the quality score for that position is removed.
- Otherwise, leave the base in S alone and move to the next position.
- The corrected sequences are then written to the output file.
OTU selection clusters sequences into clusters using an OTU selection program. OTU selection is performed using the guidelines discussed in the paper “UPARSE: Highly accurate OTU sequences from microbial amplicon reads” by Robert Edgar. In that paper, the following methodology is laid out in order to select OTUs:
- Perform dereplication on the sequences
- Remove all singleton clusters from the data set and sort the data by abundance
- Trim all sequences to the same length
- Perform OTU clustering using UPARSE
- Map original reads to the OTUs
Dereplication of sequences is performed using the USEARCH prefix dereplication method. Once complete, we remove all singleton clusters and sort the remaining sequences by cluster size from largest to smallest. The sequences are then run through a trimming algorithm that trims each sequence down to the same size. It should be noted that the sequences are only trimmed for UPARSE and the final taxonomic analysis is based upon the full-length sequences. Next we use the UPARSE algorithm to select OTUs. Using the USEARCH global alignment algorithm, we then assign each of the original reads back to their OTUs and write the mapping data to an OTU map and OTU table file.
The global search method uses a mixture of the USEARCH global search algorithm along with a python program to determine the actual taxonomic assignment that is assigned to each read.
This method is described in the paper “An extensible framework for optimizing classification enhances short-amplicon taxonomic assignments” by Nicholas Bokulich, et al.
The paper describes a methodology in which a high-quality database is used in pair with USEARCH to rapidly find the top 6 matches in the database for a given sequence. From these 6 sequences we then assign a confidence value to each taxonomic level (kingdom, phylum, class, order, family, genus and species) by taking the number of taxonomic matches that agree with the top match, and dividing by the number of total matches: E.g. If Bacteria is the top kingdom match, with five matches showing Bacteria and one match showing Archaea, our algorithm would assign the kingdom Bacteria a confidence of 5/6 = .83.
This diversity analysis program takes the OTU/Derep table output from sequence clustering, along with the output generated during taxonomic identification, and begins generating a new OTU table with the taxonomic information tied to each cluster.
This updated OTU table is then written to the output analysis folder with both the trimmed and full taxonomic information for each cluster. For each taxonomic level (kingdom, phylum, class, order, family, genus and species), four files are generated that contain the number of sequences per full taxonomic match per sample, the percentage per full taxonomic match per sample, the number of sequences per trimmed taxonomic match per sample, and the percentage per trimmed taxonomic match per sample.
- Edgar R . Search and clustering orders of magnitude faster than BLAST. Bioinformatics. pp. 1-3, 12. August 2010.
- Edgar R C. UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nature Methods. vol. 10, pp. 996-998. 2013.
- Edgar R C, Haas B J, et al. UCHIME improves sensitivity and speed of chimera detection. Oxford Journal of Bioinformatics. vol. 27, no. 16, pp. 2194-2200. 2011.
- Bokulich N A, Rideout J R, et al. An extensible framework for optimizing classification enhances short-amplicon taxonomic assignments. Not Yet Published. 2014.
- Zhang J, Kobert K, et al. PEAR: A fast and accurate Illumina Paired-End reAd mergeR, Bioinformatics, 2013.