This bioinformatics pipeline protocol is intended for processing the V4 region of the 16S rRNA gene amplicons sequenced by Illumina MiSeq sequencing 2x250 bp. The pipeline uses DADA2 (Callahan et al 2016) to infer amplicon sequence variants (ASV) from the amplicon sequences and “resolves differences of as little as 1 nucleotide”. The pipeline also uses DADA2 to assign taxonomy to the ASVs using using the Silva reference database. In addition, the pipeline uses decontam to “provide simple statistical methods to identify and visualize contaminating DNA features, allowing them to be removed and a more accurate picture of sampled communities to be constructed from marker-gene…data” (Davis et al 2018). Thus, this pipeline generates ASV and taxonomy tables with statistically identified contaminating taxa removed that will be ready for downstream analyses.
This protocol is written for use in a r-markdown document. I usually start the pipeline by adding a brief description of the project and the sample processing methods. I also like to specify the people who helped collect samples and process them. I find doing this helps when it comes to writing the manuscript.
Please provide a brief description of the project here.
Please provide a list of the people who have worked in the project and their affiliations. Don’t forgot those who have collected the samples, people who have helped with the sample processing and so on. hanges for your datasets.
Please provide a brief description of the sample processing here, a basic summary of the lab methods.
Here is an example:
DNA was extracted from the samples using the DNeasy PowerBiofilm kit (Qiagen, Germantown, MD, USA). The v4 region of the 16srRNA gene was amplified with barcoded 515FY and 806RB primers using the MiSeq protocol on the thermocycler for 37-38 cycles depending on the sample. 1 or 2 µL sample DNA (volume was sample dependent) was used in duplicate 25 µL reactions. PCR product was purified by gel purification with the Qiagen MinElute Gel Extraction Kit (Qiagen Inc., Germantown, MD, USA). Purified product was quantified using Qubit High Sensitivity dsDNA fluorometric assay and pooled to equimolar concentrations such that 5 ng purified amplicons per sample was put into the pooled library. Samples were sequenced in two sequencing runs (one for each pool of samples) with Illumina MiSeq sequencing (250 bp paired end) by the Roy J. Carver Biotechnology Center Core Sequencing Facility at the University of Illinois Urbana-Champaign as part of Apprill Lab sequencing library P2. The library prep details can be found in an excel workbook at [INSERT PATH or LINK]
Please provide the locations of the fastq files:
An example: The 2022 fastq files are currently located in two places:
1. Apprill Lab Google Drive /Sequencing related/Sequencing_Data/MiSeq_Sequencing_Data
2. /Users/cmiller/
Please provide the location of this Rmd file:
An example:
Location of this Rmd file: [INSERT PATH TO YOUR RMD FILE]
Details about DADA2 can be found here: https://benjjneb.github.io/dada2/index.html
For our purposes, the DADA2 Tutorial will be followed almost line by line. The tutorial can be located at https://benjjneb.github.io/dada2/tutorial.html.
You will need to first install DADA2, if you have not already
DADA2 requires the following:
All criteria are met.
IMPORTANT NOTE REGARDING MULTIPLE SEQUENCING RUNS: If you have multiple sequencing libraries that you would like to run through DADA2 to use in the same downstream analysis, you will run each sequencing run through the DADA2 pipeline separately until just before the chimera checking step. Once you have each run through the pipeline until that point, you will merge sequence tables with the command: mergetab <- mergeSequenceTables(seqtab2_run1, seqtab2_run2) and then check for chimeras and proceed with the rest of the pipeline
Here you will need to fill in your specific directories
#change directory to the one with the fastq files (example commented out below)
#cd /Users/cmiller/Dropbox/Miller_Documents/Projects/RightWhale_BlowAndPhotogrammetry/DADA2_Analyses/2022Sequences_RightWhaleBlowProject
#unzip the fastq files
#gunzip *.fastq.gz
# These packages are needed for dada2
library(dada2); packageVersion("dada2")
## Loading required package: Rcpp
## [1] '1.20.0'
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(phyloseq); packageVersion("phyloseq") # load phyloseq library
## Warning: package 'phyloseq' was built under R version 4.0.3
## [1] '1.34.0'
library(ggplot2)
library(decontam); packageVersion("decontam")
## Warning: package 'decontam' was built under R version 4.0.3
## [1] '1.10.0'
#Clear the global environment for a fresh start to ensure I do not use any carry-over variables.
rm(list=ls())
Here is an example - you will need to define your own path.
path <- "/Users/cmiller/Dropbox/Miller_Documents/Projects/RightWhale_BlowAndPhotogrammetry/DADA2_Analyses/2022Sequences_RightWhaleBlowProject"
head(list.files(path))
## [1] "CCB_2019_1_TGAGTACG-TACGAGAC_L001_R1_001.fastq"
## [2] "CCB_2019_1_TGAGTACG-TACGAGAC_L001_R2_001.fastq"
## [3] "CCB_2019_2_CTGCGTAG-TACGAGAC_L001_R1_001.fastq"
## [4] "CCB_2019_2_CTGCGTAG-TACGAGAC_L001_R2_001.fastq"
## [5] "CCB_2019_3_TAGTCTCC-TACGAGAC_L001_R1_001.fastq"
## [6] "CCB_2019_3_TAGTCTCC-TACGAGAC_L001_R2_001.fastq"
This step reads in the fastq filenames and creates a matched list of the forward and reverse filenames.
The second part of this step “extracting sample names” assumes the sample names have the format: SAMPLENAME_XXX.fastq
If your sample names look different, you will need to do some string manipulations to extract the sample name from the fastq filename.
A helpful reference for string manipulation is: https://r4ds.had.co.nz/strings.html#strings
and within that reference: https://r4ds.had.co.nz/strings.html#splitting
# Comment: DADA2 tutorial (you will need to remove the # to use):
# Comment: Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq
#fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))
#fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
# Comment: Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
#sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
# Comment: Modified from the DADA2 tutorial to accommodate the differences in my sample names
# Comment: Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq
fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
# Comment: Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names22_1 <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
sample.names22_2 <- sapply(strsplit(basename(fnFs), "_"), `[`, 2)
sample.names22_3 <- sapply(strsplit(basename(fnFs), "_"), `[`, 3)
sample.names <- paste(sample.names22_1, sample.names22_2, sample.names22_3, sep='_')
# Comment: clean up work space
rm(sample.names22_1)
rm(sample.names22_2)
rm(sample.names22_3)
The command for doing this in the DADA2 tutorial is: plotQualityProfile(fnFs[1:2]) which plots the read quality profiles of the first two samples in the list of files.
Things to consider for your own data: You might consider choosing representative samples from each type of sample in your dataset instead of the first two samples on the list. Make sure you are viewing samples and not controls (controls can be messy looking).
# Comment: This is the command copied from DADA2 tutorial:
#plotQualityProfile(fnFs[1:2])
# Comment: This is an example of choosing and plotting specific samples
a <- c(1,6,28:30)
plotQualityProfile(fnFs[a])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
# Comment: clean up work space
rm(a)
# Comment: This is the command copied from DADA2 tutorial:
#plotQualityProfile(fnFs[1:2])
# Comment: This is an example of choosing and plotting specific samples
a <- c(8,9,24)
plotQualityProfile(fnFs[a])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
# Comment: clean up work space
rm(a)
# Comment: This is the command copied from DADA2 tutorial:
#plotQualityProfile(fnFs[1:2])
# Comment: This is an example of choosing and plotting specific samples
a <- c(1,6,28:30)
plotQualityProfile(fnRs[a])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
# Comment: clean up work space
rm(a)
# Comment: This is the command copied from DADA2 tutorial:
#plotQualityProfile(fnFs[1:2])
# Comment: This is an example of choosing and plotting specific samples
a <- c(8,9,24)
plotQualityProfile(fnRs[a])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
# Comment: clean up work space
rm(a)
DADA2 tutorial: “Green is median quality and orange is the quartiles of quality distribution.”In gray-scale is a heat map of the frequency of each quality score at each base position. The median quality score at each position is shown by the green line, and the quartiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same length, hence the flat red line)."
The forward reads in the tutorial were of good quality. The tutorial recommends trimming the last 10 nucleotides, i.e., truncating the read at position 240. The forward read quality profiles look like those in the tutorial, with the quartiles of the quality score distribution falling off-ish at the last 10 bp so I will trim the forward reads to 240 as recommended.
Like the reads in the tutorial, the reverse reads are lower and more variable quality (apparently common in Illumina sequencing).
DADA2 tutorial: “The reverse reads are of significantly worse quality, especially at the end, which is common in Illumina sequencing. This is not too worrisome, as DADA2 incorporates quality information into its error model which makes the algorithm robust to lower quality sequence, but trimming as the average qualities crash will improve the algorithm’s sensitivity to rare sequence variants. Based on these profiles, we will truncate the reverse reads at position 160 where the quality distribution crashes.”
Some of the reverse reads look better than those in the tutorial and do not look like they need as much trimming so I think I will trim at 200, which is different from the tutorial, which trims to 160.
Look at your plots and decide what the best trim length would be for your own dataset.
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names
maxN=
is the number of Ns (dada2 requires none)
maxEE=
is the number of expected error allowed in a read.
This example uses the same settings as the tutorial.
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,200),
maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE)
#head(out)
out
## reads.in reads.out
## CCB_2019_1_TGAGTACG-TACGAGAC_L001_R1_001.fastq 32890 29937
## CCB_2019_2_CTGCGTAG-TACGAGAC_L001_R1_001.fastq 45172 38017
## CCB_2019_3_TAGTCTCC-TACGAGAC_L001_R1_001.fastq 38901 36755
## CCB_2019_4_CGAGCGAC-TACGAGAC_L001_R1_001.fastq 40037 35708
## CCB_2019_5_CGAGAGTT-ACGTCTCG_L001_R1_001.fastq 35530 29127
## CCB_2022_1_ACTACGAC-ATCGTACG_L001_R1_001.fastq 75900 69565
## CCB_2022_10_GTCTGCTA-ACTATCTG_L001_R1_001.fastq 63288 59417
## CCB_2022_11_GTCTGCTA-TAGCGAGT_L001_R1_001.fastq 53171 50774
## CCB_2022_12_GTCTGCTA-CTGCGTGT_L001_R1_001.fastq 47290 44704
## CCB_2022_13_CTGCGTAG-ACGTCTCG_L001_R1_001.fastq 70220 66746
## CCB_2022_14_TAGTCTCC-ACGTCTCG_L001_R1_001.fastq 84342 81356
## CCB_2022_15_GTCTGCTA-TCATCGAG_L001_R1_001.fastq 56500 51148
## CCB_2022_16_GTCTGCTA-CTACTATA_L001_R1_001.fastq 51796 48003
## CCB_2022_17_GTCTGCTA-CGTTACTA_L001_R1_001.fastq 44088 40909
## CCB_2022_18_GTCTGCTA-AGAGTCAC_L001_R1_001.fastq 57927 52837
## CCB_2022_19_GTCTATGA-ATCGTACG_L001_R1_001.fastq 54083 51041
## CCB_2022_2_ACTACGAC-ACTATCTG_L001_R1_001.fastq 68442 62604
## CCB_2022_20_CGAGCGAC-ACGTCTCG_L001_R1_001.fastq 70641 67014
## CCB_2022_21_ACTACGAC-ACGTCTCG_L001_R1_001.fastq 72506 69005
## CCB_2022_22_GTCTATGA-ACTATCTG_L001_R1_001.fastq 56156 51735
## CCB_2022_23_GTCTATGA-TAGCGAGT_L001_R1_001.fastq 59604 54747
## CCB_2022_24_GTCTATGA-CTGCGTGT_L001_R1_001.fastq 49110 44391
## CCB_2022_25_GTCTATGA-TCATCGAG_L001_R1_001.fastq 59803 56832
## CCB_2022_26_GTCTGCTA-ACGTCTCG_L001_R1_001.fastq 64911 62229
## CCB_2022_28_GTCTATGA-CTACTATA_L001_R1_001.fastq 76814 70287
## CCB_2022_29_GTCTATGA-CGTTACTA_L001_R1_001.fastq 59206 53937
## CCB_2022_3_ACTACGAC-TAGCGAGT_L001_R1_001.fastq 66206 62081
## CCB_2022_30_GTCTATGA-AGAGTCAC_L001_R1_001.fastq 49946 45401
## CCB_2022_31_TATAGCGA-ATCGTACG_L001_R1_001.fastq 69596 61912
## CCB_2022_32_TATAGCGA-ACTATCTG_L001_R1_001.fastq 61732 57233
## CCB_2022_33_TATAGCGA-TAGCGAGT_L001_R1_001.fastq 55396 50664
## CCB_2022_34_TATAGCGA-CTGCGTGT_L001_R1_001.fastq 46146 42258
## CCB_2022_35_TATAGCGA-TCATCGAG_L001_R1_001.fastq 52883 49396
## CCB_2022_38_TATAGCGA-CTACTATA_L001_R1_001.fastq 59770 56069
## CCB_2022_39_TATAGCGA-CGTTACTA_L001_R1_001.fastq 34410 32040
## CCB_2022_4_ACTACGAC-CTGCGTGT_L001_R1_001.fastq 79567 75071
## CCB_2022_40_TATAGCGA-AGAGTCAC_L001_R1_001.fastq 42126 40024
## CCB_2022_41_CGAGAGTT-TACGAGAC_L001_R1_001.fastq 44416 41769
## CCB_2022_42_ACTACGAC-TACGAGAC_L001_R1_001.fastq 89727 86273
## CCB_2022_43_GTCTGCTA-TACGAGAC_L001_R1_001.fastq 83425 80224
## CCB_2022_44_GACATAGT-TACGAGAC_L001_R1_001.fastq 62971 58905
## CCB_2022_45_ACGCTACT-TACGAGAC_L001_R1_001.fastq 21261 19075
## CCB_2022_46_ACTCACTG-TACGAGAC_L001_R1_001.fastq 45270 39151
## CCB_2022_47_GTCTATGA-TACGAGAC_L001_R1_001.fastq 70879 68547
## CCB_2022_48_CGAGCGAC-AGAGTCAC_L001_R1_001.fastq 73419 70831
## CCB_2022_5_ACTACGAC-TCATCGAG_L001_R1_001.fastq 65706 61235
## CCB_2022_6_ACTACGAC-CTACTATA_L001_R1_001.fastq 50579 48407
## CCB_2022_7_ACTACGAC-CGTTACTA_L001_R1_001.fastq 80768 77504
## CCB_2022_8_ACTACGAC-AGAGTCAC_L001_R1_001.fastq 60839 55952
## CCB_2022_9_GTCTGCTA-ATCGTACG_L001_R1_001.fastq 56304 51753
## DNA_Blank_1_GACATAGT-ACGTCTCG_L001_R1_001.fastq 40182 37618
## DNA_Blank_2_ACGCTACT-ACGTCTCG_L001_R1_001.fastq 39087 35754
## DNA_Blank_3_ACTCACTG-ACGTCTCG_L001_R1_001.fastq 8118 6319
## MockEven_CCB_2022_27_GTCTATGA-ACGTCTCG_L001_R1_001.fastq 49580 43342
## PCR_Ctrl_1_TGAGTACG-ACGTCTCG_L001_R1_001.fastq 915 516
## PCR_Ctrl_2_CGAGCGAC-CGTTACTA_L001_R1_001.fastq 686 423
DADA2 tutorial: “Considerations for your own data: The standard filtering parameters are starting points, not set in stone. If you want to speed up downstream computation, consider tightening maxEE. If too few reads are passing the filter, consider relaxing maxEE, perhaps especially on the reverse reads (eg. maxEE=c(2,5)), and reducing the truncLen to remove low quality tails. Remember though, when choosing truncLen for paired-end reads you must maintain overlap after truncation in order to merge them later.”
DADA2 tutorial: “The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).”
Note: This parameter learning can be computationally intense.
Note: The minimum number of total bases to use for error rate learning is 1e8. Basically, samples are read into memory in the provided order (the order in the sample list above) until at least this number of total bases has been reached. If you would like to increase the number of bases used, you can do that by adding nbases= to the command (make sure you do the same for both forward and reverse). You can also tell the command to randomize the order the samples are read into the memory. I had a dataset in which all lot of my controls were at the top of the order, so I chose to adjust the command so the algorithm was learning the error rates on my samples, not my controls. You can also point the command to specific fastq files to use. The default settings are generally sufficient but I thought you should have a heads-up in case you have a situation like mine where the algorithm only learned errors on controls, not samples. See https://rdrr.io/bioc/dada2/man/learnErrors.html for all of the options for this command.
errF <- learnErrors(filtFs, multithread=TRUE)
## 110580000 total bases in 460750 reads from 10 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread=TRUE)
## 108421200 total bases in 542106 reads from 11 samples will be used for learning the error rates.
DADA2 tutorial: “It is always worthwhile, as a sanity check if nothing else, to visualize the estimated error rates:”
plotErrors(errF, nominalQ=TRUE)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
DADA2 tutorial: “The error rates for each possible transition (A→C, A→G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score.”
DADA2 tutorial: “Considerations for your own data: Parameter learning is computationally intensive, so by default the learnErrors function uses only a subset of the data (the first 100M bases). If you are working with a large dataset and the plotted error model does not look like a good fit, you can try increasing the nbases parameter to see if the fit improves.”
The plots here look very much like those in the tutorial, which says the following about the tutorial plots: “Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected. Everything looks reasonable and we proceed with confidence.” Thus, the plots here also look like the error rates are a good fit to the observed rates and will proceed with confidence as suggested by the tutorial.
This applies the DADA2 sample inference algorithm to the filtered and trimmed sequences. This is a computationally intensive step.
dadaFs <- dada(filtFs, err=errF, multithread=TRUE)
## Sample 1 - 29937 reads in 12610 unique sequences.
## Sample 2 - 38017 reads in 18075 unique sequences.
## Sample 3 - 36755 reads in 15964 unique sequences.
## Sample 4 - 35708 reads in 18186 unique sequences.
## Sample 5 - 29127 reads in 14009 unique sequences.
## Sample 6 - 69565 reads in 27969 unique sequences.
## Sample 7 - 59417 reads in 25277 unique sequences.
## Sample 8 - 50774 reads in 21486 unique sequences.
## Sample 9 - 44704 reads in 19851 unique sequences.
## Sample 10 - 66746 reads in 23051 unique sequences.
## Sample 11 - 81356 reads in 27130 unique sequences.
## Sample 12 - 51148 reads in 22558 unique sequences.
## Sample 13 - 48003 reads in 17339 unique sequences.
## Sample 14 - 40909 reads in 14537 unique sequences.
## Sample 15 - 52837 reads in 22357 unique sequences.
## Sample 16 - 51041 reads in 22122 unique sequences.
## Sample 17 - 62604 reads in 26058 unique sequences.
## Sample 18 - 67014 reads in 24532 unique sequences.
## Sample 19 - 69005 reads in 24117 unique sequences.
## Sample 20 - 51735 reads in 23187 unique sequences.
## Sample 21 - 54747 reads in 24990 unique sequences.
## Sample 22 - 44391 reads in 20221 unique sequences.
## Sample 23 - 56832 reads in 23111 unique sequences.
## Sample 24 - 62229 reads in 20706 unique sequences.
## Sample 25 - 70287 reads in 26908 unique sequences.
## Sample 26 - 53937 reads in 24401 unique sequences.
## Sample 27 - 62081 reads in 28058 unique sequences.
## Sample 28 - 45401 reads in 21286 unique sequences.
## Sample 29 - 61912 reads in 23816 unique sequences.
## Sample 30 - 57233 reads in 25632 unique sequences.
## Sample 31 - 50664 reads in 22048 unique sequences.
## Sample 32 - 42258 reads in 18419 unique sequences.
## Sample 33 - 49396 reads in 21501 unique sequences.
## Sample 34 - 56069 reads in 24288 unique sequences.
## Sample 35 - 32040 reads in 13205 unique sequences.
## Sample 36 - 75071 reads in 32508 unique sequences.
## Sample 37 - 40024 reads in 18248 unique sequences.
## Sample 38 - 41769 reads in 18235 unique sequences.
## Sample 39 - 86273 reads in 28486 unique sequences.
## Sample 40 - 80224 reads in 27696 unique sequences.
## Sample 41 - 58905 reads in 24843 unique sequences.
## Sample 42 - 19075 reads in 7206 unique sequences.
## Sample 43 - 39151 reads in 16445 unique sequences.
## Sample 44 - 68547 reads in 14475 unique sequences.
## Sample 45 - 70831 reads in 17593 unique sequences.
## Sample 46 - 61235 reads in 29097 unique sequences.
## Sample 47 - 48407 reads in 20651 unique sequences.
## Sample 48 - 77504 reads in 27370 unique sequences.
## Sample 49 - 55952 reads in 26206 unique sequences.
## Sample 50 - 51753 reads in 23820 unique sequences.
## Sample 51 - 37618 reads in 15617 unique sequences.
## Sample 52 - 35754 reads in 12769 unique sequences.
## Sample 53 - 6319 reads in 3231 unique sequences.
## Sample 54 - 43342 reads in 17872 unique sequences.
## Sample 55 - 516 reads in 304 unique sequences.
## Sample 56 - 423 reads in 316 unique sequences.
dadaRs <- dada(filtRs, err=errR, multithread=TRUE)
## Sample 1 - 29937 reads in 11883 unique sequences.
## Sample 2 - 38017 reads in 14384 unique sequences.
## Sample 3 - 36755 reads in 13830 unique sequences.
## Sample 4 - 35708 reads in 15247 unique sequences.
## Sample 5 - 29127 reads in 10809 unique sequences.
## Sample 6 - 69565 reads in 24330 unique sequences.
## Sample 7 - 59417 reads in 22738 unique sequences.
## Sample 8 - 50774 reads in 19692 unique sequences.
## Sample 9 - 44704 reads in 18430 unique sequences.
## Sample 10 - 66746 reads in 24375 unique sequences.
## Sample 11 - 81356 reads in 25942 unique sequences.
## Sample 12 - 51148 reads in 20821 unique sequences.
## Sample 13 - 48003 reads in 16557 unique sequences.
## Sample 14 - 40909 reads in 14041 unique sequences.
## Sample 15 - 52837 reads in 20502 unique sequences.
## Sample 16 - 51041 reads in 20997 unique sequences.
## Sample 17 - 62604 reads in 23051 unique sequences.
## Sample 18 - 67014 reads in 24377 unique sequences.
## Sample 19 - 69005 reads in 23464 unique sequences.
## Sample 20 - 51735 reads in 21324 unique sequences.
## Sample 21 - 54747 reads in 23637 unique sequences.
## Sample 22 - 44391 reads in 19928 unique sequences.
## Sample 23 - 56832 reads in 21801 unique sequences.
## Sample 24 - 62229 reads in 21432 unique sequences.
## Sample 25 - 70287 reads in 23562 unique sequences.
## Sample 26 - 53937 reads in 23490 unique sequences.
## Sample 27 - 62081 reads in 23931 unique sequences.
## Sample 28 - 45401 reads in 20383 unique sequences.
## Sample 29 - 61912 reads in 21320 unique sequences.
## Sample 30 - 57233 reads in 23470 unique sequences.
## Sample 31 - 50664 reads in 20952 unique sequences.
## Sample 32 - 42258 reads in 17966 unique sequences.
## Sample 33 - 49396 reads in 20194 unique sequences.
## Sample 34 - 56069 reads in 21985 unique sequences.
## Sample 35 - 32040 reads in 12233 unique sequences.
## Sample 36 - 75071 reads in 27352 unique sequences.
## Sample 37 - 40024 reads in 16458 unique sequences.
## Sample 38 - 41769 reads in 15457 unique sequences.
## Sample 39 - 86273 reads in 24559 unique sequences.
## Sample 40 - 80224 reads in 24618 unique sequences.
## Sample 41 - 58905 reads in 22368 unique sequences.
## Sample 42 - 19075 reads in 5834 unique sequences.
## Sample 43 - 39151 reads in 13078 unique sequences.
## Sample 44 - 68547 reads in 13323 unique sequences.
## Sample 45 - 70831 reads in 16377 unique sequences.
## Sample 46 - 61235 reads in 25549 unique sequences.
## Sample 47 - 48407 reads in 18561 unique sequences.
## Sample 48 - 77504 reads in 23097 unique sequences.
## Sample 49 - 55952 reads in 23124 unique sequences.
## Sample 50 - 51753 reads in 21012 unique sequences.
## Sample 51 - 37618 reads in 15121 unique sequences.
## Sample 52 - 35754 reads in 11388 unique sequences.
## Sample 53 - 6319 reads in 2560 unique sequences.
## Sample 54 - 43342 reads in 16248 unique sequences.
## Sample 55 - 516 reads in 351 unique sequences.
## Sample 56 - 423 reads in 383 unique sequences.
Here I will inspect the output of the algorithm for several samples. The output will show the number of unique sequences in each specified sample and the number of sequence variants that were inferred from those unique sequences.
DADA2 tutorial: “There is much more to the dada-class return object than this (see help(”dada-class“) for some info), including multiple diagnostics about the quality of each denoised sequence variant, but that is beyond the scope of an introductory tutorial.”
#Information about the output of the first sample
cat("Information about the output of the first sample:", "\n", "\n")
## Information about the output of the first sample:
##
dadaFs[[1]]
## dada-class: object describing DADA2 denoising results
## 143 sequence variants were inferred from 12610 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
The DADA2 algorithm inferred 143 true sequence variants from the 12610 unique sequences in the first sample.
DADA2 tutorial: “There is much more to the dada-class return object than this (see help(”dada-class“) for some info), including multiple diagnostics about the quality of each denoised sequence variant, but that is beyond the scope of an introductory tutorial.”
DADA2 tutorial: “Extensions: By default, the dada function processes each sample independently. However, pooling information across samples can increase sensitivity to sequence variants that may be present at very low frequencies in multiple samples. The dada2 package offers two types of pooling. dada(…, pool=TRUE) performs standard pooled processing, in which all samples are pooled together for sample inference. dada(…, pool=”pseudo“) performs pseudo-pooling, in which samples are processed independently after sharing information between samples, approximating pooled sample inference in linear time.”
NOTE: Generally, the Apprill Lab does not run pool = TRUE or pool = pseudo
DADA2 tutorial: “We now merge the forward and reverse reads together to obtain the full denoised sequences. Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding denoised reverse reads, and then constructing the merged “contig” sequences. By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region (but these conditions can be changed via function arguments)."
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
## 28788 paired-reads (in 126 unique pairings) successfully merged out of 29616 (in 194 pairings) input.
## 37208 paired-reads (in 95 unique pairings) successfully merged out of 37706 (in 134 pairings) input.
## 36113 paired-reads (in 166 unique pairings) successfully merged out of 36466 (in 216 pairings) input.
## 34352 paired-reads (in 242 unique pairings) successfully merged out of 35255 (in 374 pairings) input.
## 28313 paired-reads (in 74 unique pairings) successfully merged out of 28790 (in 119 pairings) input.
## 66693 paired-reads (in 231 unique pairings) successfully merged out of 68647 (in 418 pairings) input.
## 57961 paired-reads (in 268 unique pairings) successfully merged out of 58901 (in 437 pairings) input.
## 49082 paired-reads (in 268 unique pairings) successfully merged out of 50049 (in 458 pairings) input.
## 43244 paired-reads (in 227 unique pairings) successfully merged out of 44051 (in 401 pairings) input.
## 64351 paired-reads (in 228 unique pairings) successfully merged out of 65787 (in 490 pairings) input.
## 78981 paired-reads (in 233 unique pairings) successfully merged out of 80302 (in 450 pairings) input.
## 48545 paired-reads (in 196 unique pairings) successfully merged out of 50241 (in 385 pairings) input.
## 46490 paired-reads (in 114 unique pairings) successfully merged out of 47498 (in 237 pairings) input.
## 39568 paired-reads (in 98 unique pairings) successfully merged out of 40416 (in 226 pairings) input.
## 51433 paired-reads (in 183 unique pairings) successfully merged out of 52356 (in 316 pairings) input.
## 49679 paired-reads (in 207 unique pairings) successfully merged out of 50485 (in 401 pairings) input.
## 59536 paired-reads (in 280 unique pairings) successfully merged out of 61662 (in 483 pairings) input.
## 65499 paired-reads (in 211 unique pairings) successfully merged out of 66256 (in 462 pairings) input.
## 67001 paired-reads (in 218 unique pairings) successfully merged out of 68062 (in 475 pairings) input.
## 49543 paired-reads (in 214 unique pairings) successfully merged out of 51028 (in 403 pairings) input.
## 52341 paired-reads (in 204 unique pairings) successfully merged out of 53809 (in 411 pairings) input.
## 42353 paired-reads (in 180 unique pairings) successfully merged out of 43733 (in 367 pairings) input.
## 54835 paired-reads (in 241 unique pairings) successfully merged out of 55953 (in 447 pairings) input.
## 60176 paired-reads (in 244 unique pairings) successfully merged out of 61384 (in 497 pairings) input.
## 69059 paired-reads (in 125 unique pairings) successfully merged out of 69786 (in 241 pairings) input.
## 51588 paired-reads (in 237 unique pairings) successfully merged out of 53181 (in 447 pairings) input.
## 59041 paired-reads (in 343 unique pairings) successfully merged out of 61108 (in 610 pairings) input.
## 43424 paired-reads (in 169 unique pairings) successfully merged out of 44635 (in 341 pairings) input.
## 61061 paired-reads (in 92 unique pairings) successfully merged out of 61546 (in 160 pairings) input.
## 55369 paired-reads (in 235 unique pairings) successfully merged out of 56588 (in 375 pairings) input.
## 48169 paired-reads (in 160 unique pairings) successfully merged out of 50002 (in 301 pairings) input.
## 40694 paired-reads (in 136 unique pairings) successfully merged out of 41678 (in 258 pairings) input.
## 47571 paired-reads (in 163 unique pairings) successfully merged out of 48727 (in 308 pairings) input.
## 54201 paired-reads (in 210 unique pairings) successfully merged out of 55487 (in 369 pairings) input.
## 31653 paired-reads (in 95 unique pairings) successfully merged out of 31799 (in 145 pairings) input.
## 72653 paired-reads (in 314 unique pairings) successfully merged out of 74092 (in 574 pairings) input.
## 38600 paired-reads (in 241 unique pairings) successfully merged out of 39461 (in 343 pairings) input.
## 40464 paired-reads (in 167 unique pairings) successfully merged out of 41312 (in 281 pairings) input.
## 84295 paired-reads (in 295 unique pairings) successfully merged out of 85321 (in 527 pairings) input.
## 78390 paired-reads (in 266 unique pairings) successfully merged out of 79345 (in 518 pairings) input.
## 57166 paired-reads (in 196 unique pairings) successfully merged out of 58435 (in 348 pairings) input.
## 18691 paired-reads (in 37 unique pairings) successfully merged out of 18914 (in 65 pairings) input.
## 37736 paired-reads (in 78 unique pairings) successfully merged out of 38661 (in 111 pairings) input.
## 67646 paired-reads (in 162 unique pairings) successfully merged out of 68052 (in 244 pairings) input.
## 69575 paired-reads (in 215 unique pairings) successfully merged out of 70244 (in 341 pairings) input.
## 58918 paired-reads (in 469 unique pairings) successfully merged out of 60280 (in 691 pairings) input.
## 46755 paired-reads (in 214 unique pairings) successfully merged out of 47832 (in 363 pairings) input.
## 75876 paired-reads (in 232 unique pairings) successfully merged out of 76878 (in 394 pairings) input.
## 53178 paired-reads (in 343 unique pairings) successfully merged out of 55024 (in 610 pairings) input.
## 49565 paired-reads (in 431 unique pairings) successfully merged out of 50862 (in 666 pairings) input.
## 36360 paired-reads (in 136 unique pairings) successfully merged out of 37209 (in 224 pairings) input.
## 35351 paired-reads (in 54 unique pairings) successfully merged out of 35495 (in 90 pairings) input.
## 6018 paired-reads (in 22 unique pairings) successfully merged out of 6148 (in 40 pairings) input.
## 42648 paired-reads (in 34 unique pairings) successfully merged out of 42898 (in 111 pairings) input.
## 393 paired-reads (in 8 unique pairings) successfully merged out of 400 (in 12 pairings) input.
## 191 paired-reads (in 6 unique pairings) successfully merged out of 227 (in 14 pairings) input.
# Inspect the merger data.frame from the 30th sample
head(mergers[[30]])
## sequence
## 1 TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTCGTCTGTGAAATTCCGGGGCTTAACTCCGGGCGTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCATGGGTAGCGAACAGG
## 2 TACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTAATAAGTTAGAGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTTTAAAACTGTTAGTCTTGAGATCGAGAGAGGTGAGTGGAATTCCAAGTGTAGAGGTGAAATTCGTAGATATTTGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTACGAAAGTGTGGGGAGCAAACAGG
## 3 TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGAACGTAGGTGGTTACCTAAGTCAGATGTGAAATCCCTAGGCTCAACCTAGGAACTGCATCTGATACTGGGTGACTAGAGTAGGTGAGAGGAAAGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCTTTCTGGCATCATACTGACACTGAGGTTCGAAAGCGTGGGTAGCAAACAGG
## 4 TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGTTCGTAGGTGGTTGAAAAAGTTGGTGGTGAAATCCCAGAGCTTAACTCTGGAACTGCCATCAAAACTTTTCAGCTAGAGTATGATAGAGGAAAGCAGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAATACCAATTGCGAAGGCAGCTTTCTGGATCATTACTGACACTGAGGAACGAAAGCATGGGTAGCGAAGAGG
## 5 TACGGAGGGTGCAAGCGTTATCCGGAATCATTGGGTTTAAAGGGTCCGCAGGCGGTCGATTAAGTCAGGGGTGAAATCTCACAGCTTAACTGTGAACCTGCCTTTGATACTGGTTGACTTGAGTTATATGGAAGTAGATAGAATAAGTAGTGTAGCGGTGAAATGCATAGATATTACTTAGAATACCGATTGCGAAGGCAGTCTACTACGTATATACTGACGCTCATGGACGAAAGCGTGGGGAGCGAACAGG
## 6 TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTTGCGTCACTTGTGAAATTCCGGGGCTTAACTCCGGGTGTGCAGGTGATACGGGCATAACTTGAGTGCTGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGGAGCGAACAGG
## abundance forward reverse nmatch nmismatch nindel prefer accept
## 1 3413 1 1 186 0 0 1 TRUE
## 2 2259 2 2 187 0 0 2 TRUE
## 3 2225 3 3 187 0 0 2 TRUE
## 4 1539 4 4 187 0 0 2 TRUE
## 5 1354 5 5 187 0 0 2 TRUE
## 6 1129 8 14 186 0 0 1 TRUE
DADA2 tutorial: “The mergers object is a list of data.frames from each sample. Each data.frame contains the merged $sequence, its $abundance, and the indices of the $forward and $reverse sequence variants that were merged. Paired reads that did not exactly overlap were removed by mergePairs, further reducing spurious output.”
DADA2 tutorial: “Considerations for your own data: Most of your reads should successfully merge. If that is not the case upstream parameters may need to be revisited: Did you trim away the overlap between your reads?”
DADA2 tutorial: “We can now construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.”
seqtab <- makeSequenceTable(mergers)
#dim(seqtab)
cat("This table contains", dim(seqtab)[2], "ASVs and the lengths of our merged sequences for the", dim(seqtab)[1], "samples in the dataset \n")
## This table contains 3393 ASVs and the lengths of our merged sequences for the 56 samples in the dataset
Here we can check if the length of the sequences is what we expect.
# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))
##
## 240 244 246 249 250 251 252 253 254 255 256 257 258 259 260 265
## 27 1 2 1 1 13 230 2951 134 10 5 4 3 1 1 1
## 274 286 308 316 321 350 392
## 1 1 1 1 1 1 2
DADA2 tutorial: “The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.”
cat("Not all lengths of our merged sequences for the", dim(seqtab)[2], "ASVs fall within the expected range for this V4 amplicon. \n")
## Not all lengths of our merged sequences for the 3393 ASVs fall within the expected range for this V4 amplicon.
DADA2 tutorial: “Considerations for your own data: Sequences that are much longer or shorter than expected may be the result of non-specific priming. You can remove non-target-length sequences from your sequence table (eg. seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 250:256]). This is analogous to “cutting a band” in-silico to get amplicons of the targeted length."
Because not all of the ASV lengths fall within the expected range for the V4 amplicon, we will trim to the length we want:
seqtab2<- seqtab[,nchar(colnames(seqtab)) %in% 250:256]
dim(seqtab2)
## [1] 56 3344
table(nchar(getSequences(seqtab2)))# Inspect distribution of sequence lengths
##
## 250 251 252 253 254 255 256
## 1 13 230 2951 134 10 5
IMPORTANT NOTE REGARDING MULTIPLE SEQUENCING RUNS: If you have multiple sequencing libraries that you would like to run through DADA2 to use in the same downstream analysis, you will run each sequencing run through the DADA2 pipeline separately until this point (just before the chimera checking step). Once you have each run through the pipeline until this point, you will merge sequence tables with the command: mergetab <- mergeSequenceTables(seqtab2_run1, seqtab2_run2) and then check for chimeras and proceed with the rest of the pipeline
DADA2 tutorial: “The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences."
seqtab2.nochim <- removeBimeraDenovo(seqtab2, method="consensus", multithread=TRUE, verbose=TRUE)
## Identified 149 bimeras out of 3344 input sequences.
# dim(seqtab2.nochim)
cat(round((sum(seqtab2.nochim)/sum(seqtab2))*100, digits=2),"% of the ASVs were not chimeric. \n")
## 99.38 % of the ASVs were not chimeric.
a <- (1-(sum(seqtab2.nochim)/sum(seqtab2)))*100
a <- round(a, digits = 2)
cat(a, "% of the ASVs were chimeric and removed. \n")
## 0.62 % of the ASVs were chimeric and removed.
cat("There are now", dim(seqtab2.nochim)[2], "ASVs with sequence lengths of 250-256 base pairs remaining after chimeras were removed")
## There are now 3195 ASVs with sequence lengths of 250-256 base pairs remaining after chimeras were removed
rm(a)
DADA2 tutorial: “The frequency of chimeric sequences varies substantially from dataset to dataset, and depends on on factors including experimental procedures and sample complexity.”
Here you see what percentage of the merged sequence variants were chimeric.
DADA2 tutorial: “Considerations for your own data: Most of your reads should remain after chimera removal (it is not uncommon for a majority of sequence variants to be removed though). If most of your reads were removed as chimeric, upstream processing may need to be revisited. In almost all cases this is caused by primer sequences with ambiguous nucleotides that were not removed prior to beginning the DADA2 pipeline.”
DADA2 tutorial: “As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:”
Thus, we will track reads through the pipeline so we can make sure that everything went as expected. A red flag would be if one particular step in the pipeline removed more reads than one would expect
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab2.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
#head(track)
track
## input filtered denoisedF denoisedR merged nonchim
## CCB_2019_1 32890 29937 29696 29740 28788 28729
## CCB_2019_2 45172 38017 37785 37865 37208 35768
## CCB_2019_3 38901 36755 36598 36555 36113 35947
## CCB_2019_4 40037 35708 35407 35413 34352 33669
## CCB_2019_5 35530 29127 28961 28854 28313 28141
## CCB_2022_1 75900 69565 68907 69013 66693 65654
## CCB_2022_10 63288 59417 59029 59176 57961 56726
## CCB_2022_11 53171 50774 50262 50251 49082 47323
## CCB_2022_12 47290 44704 44192 44288 43244 42347
## CCB_2022_13 70220 66746 66054 66105 64351 62130
## CCB_2022_14 84342 81356 80609 80680 78981 75742
## CCB_2022_15 56500 51148 50462 50614 48545 48473
## CCB_2022_16 51796 48003 47609 47659 46490 46040
## CCB_2022_17 44088 40909 40573 40530 39568 39007
## CCB_2022_18 57927 52837 52470 52557 51433 50802
## CCB_2022_19 54083 51041 50639 50681 49679 49034
## CCB_2022_2 68442 62604 61937 62043 59536 58769
## CCB_2022_20 70641 67014 66491 66509 65499 65290
## CCB_2022_21 72506 69005 68353 68386 67001 66851
## CCB_2022_22 56156 51735 51168 51404 49543 48968
## CCB_2022_23 59604 54747 54062 54190 52341 51955
## CCB_2022_24 49110 44391 43918 43969 42353 42162
## CCB_2022_25 59803 56832 56278 56251 54835 54248
## CCB_2022_26 64911 62229 61627 61678 60176 59715
## CCB_2022_28 76814 70287 69985 69974 69059 68401
## CCB_2022_29 59206 53937 53431 53467 51588 51497
## CCB_2022_3 66206 62081 61334 61596 59041 58175
## CCB_2022_30 49946 45401 44874 44883 43424 43385
## CCB_2022_31 69596 61912 61657 61676 61061 60960
## CCB_2022_32 61732 57233 56763 56862 55369 54124
## CCB_2022_33 55396 50664 50178 50254 48169 48026
## CCB_2022_34 46146 42258 41833 41865 40694 40610
## CCB_2022_35 52883 49396 48940 48959 47571 47405
## CCB_2022_38 59770 56069 55782 55611 54201 53595
## CCB_2022_39 34410 32040 31871 31895 31653 31016
## CCB_2022_4 79567 75071 74335 74582 72653 71233
## CCB_2022_40 42126 40024 39627 39755 38600 38424
## CCB_2022_41 44416 41769 41505 41488 40464 39726
## CCB_2022_42 89727 86273 85573 85696 84295 83371
## CCB_2022_43 83425 80224 79576 79645 78390 77853
## CCB_2022_44 62971 58905 58609 58581 57166 56658
## CCB_2022_45 21261 19075 18955 18946 18691 18691
## CCB_2022_46 45270 39151 38824 38881 37736 37185
## CCB_2022_47 70879 68547 68159 68248 67646 67369
## CCB_2022_48 73419 70831 70442 70388 69575 69055
## CCB_2022_5 65706 61235 60561 60735 58918 57913
## CCB_2022_6 50579 48407 47989 48065 46755 45758
## CCB_2022_7 80768 77504 77143 77077 75876 74988
## CCB_2022_8 60839 55952 55286 55438 53178 52652
## CCB_2022_9 56304 51753 51160 51207 49565 48795
## DNA_Blank_1 40182 37618 37358 37390 36360 36180
## DNA_Blank_2 39087 35754 35618 35543 35351 33651
## DNA_Blank_3 8118 6319 6212 6168 6018 6018
## MockEven_CCB_2022 49580 43342 43033 43075 42648 42479
## PCR_Ctrl_1 915 516 438 404 393 393
## PCR_Ctrl_2 686 423 305 238 191 191
DADA2 tutorial: “Considerations for your own data: This is a great place to do a last sanity check. Outside of filtering, there should no step in which a majority of reads are lost. If a majority of reads failed to merge, you may need to revisit the truncLen parameter used in the filtering step and make sure that the truncated reads span your amplicon. If a majority of reads were removed as chimeric, you may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification.”
DADA2 tutorial: “It is common at this point, especially in 16S/18S/ITS amplicon sequencing, to assign taxonomy to the sequence variants. The DADA2 package provides a native implementation of the naive Bayesian classifier method for this purpose. The assignTaxonomy function takes as input a set of sequences to be classified and a training set of reference sequences with known taxonomy, and outputs taxonomic assignments with at least minBoot bootstrap confidence.”
“We maintain formatted training fastas for the RDP training set, GreenGenes clustered at 97% identity, and the Silva reference database, and additional trainings fastas suitable for protists and certain specific environments have been contributed.”
Extensions: “The dada2 package also implements a method to make species level assignments based on exact matching between ASVs and sequenced reference strains. Recent analysis suggests that exact matching (or 100% identity) is the only appropriate way to assign species to 16S gene fragments. Currently, species-assignment training fastas are available for the Silva and RDP 16S databases. To follow the optional species addition step, download the silva_species_assignment_v132.fa.gz file, and place it in the directory with the fastq files.”
The Apprill Lab uses The Silva reference database to assign taxonomy. Download the most recent Silva database formatted for use with DADA2 and the file formatted for the species assignment command from https://zenodo.org/record/4587955#.Ya-b3lNOlTZ
It is important this is cited in the methods - see the instructions in the aforementioned link
This document uses the Silva 138.1 reference database.
Here is what the intro on the website says regarding the Silva 138.1 prokaryotic SSU taxonomic training data formatted for DADA2
McLaren, Michael R.; Callahan, Benjamin J.
These training fasta files are derived from the Silva Project’s version 138.1 release and formatted for use with DADA2. These files are intended for use in classifying prokaryotic 16S sequencing data and are not appropriate for classifying eukaryotic ASVs.
See https://benjjneb.github.io/dada2/training.html for information about DADA2 reference databases and https://www.arb-silva.de/documentation/release-138.1/ for database and citation information for Silva 138.1. The Silva 138.1 database is licensed under Creative Commons Attribution 4.0 (CC-BY 4.0); see file “SILVA_LICENSE.txt”. These fasta database files were generated and checked for consistency using the R markdown documents in the silva-138.1 folder in https://zenodo.org/record/4587946.
If you use these files, please cite one or both of the Silva references below (or at the above link) and the DADA2 paper (reference below). I also recommend citing or linking to the Zenodo record for this specific version in your Methods or published source code to record the specific taxonomic database files used in your analysis.
NOTE: These database files have a known problem in 3/895 families and 59/3936 genera. See https://github.com/mikemc/dada2-reference-databases/blob/main/silva-138.1/v1/bad-taxa.csv for a list of affected taxa and https://github.com/benjjneb/dada2/issues/1293 for more information."
NOTE: You will need to set your own path in the following code block to wherever you downloaded the databases
taxa <- assignTaxonomy(seqtab2.nochim, "~/Documents/dada2/silva_nr99_v138.1_train_set.fa.gz", multithread=TRUE)
taxa <- addSpecies(taxa, "~/Documents/dada2/silva_species_assignment_v138.1.fa.gz")
taxa.print <- taxa # Removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print)
## Kingdom Phylum Class Order
## [1,] "Bacteria" "Cyanobacteria" "Cyanobacteriia" "Chloroplast"
## [2,] "Bacteria" "Proteobacteria" "Alphaproteobacteria" "SAR11 clade"
## [3,] "Bacteria" "Actinobacteriota" "Actinobacteria" "Corynebacteriales"
## [4,] "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Pseudomonadales"
## [5,] "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhodobacterales"
## [6,] "Bacteria" "Proteobacteria" "Alphaproteobacteria" "SAR11 clade"
## Family Genus Species
## [1,] NA NA NA
## [2,] "Clade I" "Clade Ia" NA
## [3,] "Corynebacteriaceae" "Corynebacterium" NA
## [4,] "Moraxellaceae" "Moraxella" NA
## [5,] "Rhodobacteraceae" "Amylibacter" "ulvae"
## [6,] "Clade I" "Clade Ia" NA
DADA2 tutorial: “Considerations for your own data: If your reads do not seem to be appropriately assigned, for example lots of your bacterial 16S sequences are being assigned as Eukaryota NA NA NA NA NA, your reads may be in the opposite orientation as the reference database. Tell dada2 to try the reverse-complement orientation with assignTaxonomy(…, tryRC=TRUE) and see if this fixes the assignments.”
There are 20 strains in the mock that we use (HM-782D, BEI Resources, https://www.beiresources.org/Catalog/otherProducts/HM-782D.aspx), so we would expect to see 20 ASVs in the mock community
DADA2 tutorial: “One of the samples included here was a “mock community”, in which a mixture of 20 known strains was sequenced (this mock community is supposed to be 21 strains, but P. acnes is absent from the raw data). Reference sequences corresponding to these strains were provided in the downloaded zip archive. We return to that sample and compare the sequence variants inferred by DADA2 to the expected composition of the community.
NOTE: To evaluate the mock community, you will need to download the reference file https://drive.google.com/file/d/1Ryrhk92v6x8WYRnUAo6YHmJxVlwPCxG_/view?usp=sharing
NOTE: Here the sample name given to the mock is “MockEven_CCB_2022”. You will need to change to the sample name of the mock in your dataset in the following code block. You will also need to adjust the path to wherever you downloaded the file
unqs.mock <- seqtab2.nochim["MockEven_CCB_2022",]
unqs.mock <- sort(unqs.mock[unqs.mock>0], decreasing=TRUE) # Drop ASVs absent in the Mock
cat("DADA2 inferred", length(unqs.mock), "sample sequences present in the Mock community.\n")
## DADA2 inferred 29 sample sequences present in the Mock community.
path2Mock <- "~/Documents/dada2/"
mock.ref <- getSequences(file.path(path2Mock, "HMP_MOCK.v35.fasta"))
match.ref <- sum(sapply(names(unqs.mock), function(x) any(grepl(x, mock.ref))))
cat("Of those,", sum(match.ref), "were exact matches to the expected reference sequences.\n")
## Of those, 21 were exact matches to the expected reference sequences.
uniqs.mock.df <- data.frame(unqs.mock)
Typically with our methods, DADA2 infers between 20-30 sample sequences. An exact match of 19-22 to the reference database is typical and acceptable.
The next steps of this pipeline will include a little more cleanup before we move on to removing contaminants with decontam. Phyloseq will be used to help with the cleanup.
We will:
When assigning taxonomy, taxonomy may only be matched to a certain level. When the code for removing unwanted taxa runs, it will remove anything with NAs at the level at which you are removing. Thus, NAs in the taxonomy matrix need to be replaced with either the lowest level to which a sequence matched or with the character string “unknown”. Here is how to do both ways. The method showing replacing with the lowest level to which a sample matched is commented out and the example replacing NAs with the word “unknown” is run below.
# Create a dataframe of the taxonomy table
taxa_df <- data.frame(taxa)
# METHOD 1. Fill the NAs with the lowest level taxonomy assigned (otherwise when NAs might be removed when removing unwanted taxa)
# taxa_df$Kingdom <- as.character(taxa_df$Kingdom)
# king_na <- which(is.na(taxa_df$Kingdom))
# taxa_df[king_na, "Kingdom"] <- 'Unknown'
#
# taxa_df$Phylum <- as.character(taxa_df$Phylum)
# phy_na <- which(is.na(taxa_df$Phylum))
# taxa_df[phy_na, "Phylum"] <- taxa_df$Kingdom[phy_na]
#
# taxa_df$Class <- as.character(taxa_df$Class)
# cl_na <- which(is.na(taxa_df$Class))
# taxa_df[cl_na, "Class"] <- taxa_df$Phylum[cl_na]
#
# taxa_df$Order <- as.character(taxa_df$Order)
# ord_na <- which(is.na(taxa_df$Order))
# taxa_df[ord_na, "Order"] <- taxa_df$Class[ord_na]
#
# taxa_df$Family <- as.character(taxa_df$Family)
# fam_na <- which(is.na(taxa_df$Family))
# taxa_df[fam_na, "Family"] <- taxa_df$Order[fam_na]
#
# taxa_df$Genus <- as.character(taxa_df$Genus)
# gen_na <- which(is.na(taxa_df$Genus))
# taxa_df[gen_na, "Genus"] <- taxa_df$Family[gen_na]
# METHOD 2. Fill all NAs with the word 'unknown' (otherwise when NAs might be removed when removing unwanted taxa)
taxa_df$Kingdom <- as.character(taxa_df$Kingdom)
king_na <- which(is.na(taxa_df$Kingdom))
taxa_df[king_na, "Kingdom"] <- 'unknown'
taxa_df$Phylum <- as.character(taxa_df$Phylum)
phy_na <- which(is.na(taxa_df$Phylum))
taxa_df[phy_na, "Phylum"] <- 'unknown'
taxa_df$Class <- as.character(taxa_df$Class)
cl_na <- which(is.na(taxa_df$Class))
taxa_df[cl_na, "Class"] <- 'unknown'
taxa_df$Order <- as.character(taxa_df$Order)
ord_na <- which(is.na(taxa_df$Order))
taxa_df[ord_na, "Order"] <- 'unknown'
taxa_df$Family <- as.character(taxa_df$Family)
fam_na <- which(is.na(taxa_df$Family))
taxa_df[fam_na, "Family"] <- 'unknown'
taxa_df$Genus <- as.character(taxa_df$Genus)
gen_na <- which(is.na(taxa_df$Genus))
taxa_df[gen_na, "Genus"] <- 'unknown'
taxa_df$Species <- as.character(taxa_df$Species)
spe_na <- which(is.na(taxa_df$Species))
taxa_df[spe_na, "Species"] <- 'unknown'
# Convert the taxonomy table back to matrix form so it can be used by phyloseq
taxa <- as.matrix(taxa_df)
# You may want to conduct a preliminary exploration of your data to see how the controls fall out, make a temporary metadata file. Here I create one with just the sample names.
metadata <- rownames(seqtab2.nochim)
metadata <- data.frame(metadata)
rownames(metadata) <- metadata[,1] # assign values from all rows of the first column as the dataframe rownames
# Now I will create a phyloseq object
OTU <- otu_table(seqtab2.nochim, taxa_are_rows = FALSE)
TAX <- tax_table(taxa)
META = sample_data(metadata)
###CREATE PHYLOSEQ OBJECT and CHECK THAT IT IS CORRECT
ps <- phyloseq(OTU, TAX, META)
ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 3195 taxa and 56 samples ]
## sample_data() Sample Data: [ 56 samples by 1 sample variables ]
## tax_table() Taxonomy Table: [ 3195 taxa by 7 taxonomic ranks ]
#remove variables that aren't needed
rm(OTU)
rm(TAX)
rm(META)
pre <- ntaxa(ps)
cat("The total number of taxa in the dataset is", ntaxa(ps), "\n")
## The total number of taxa in the dataset is 3195
cat("There are: ", nrow(subset(taxa_df, Kingdom =="Eukaryota")), "Eukaryotes,", nrow(subset(taxa_df, Order =="Chloroplast")), "Chloroplasts,", "and", nrow(subset(taxa_df, Family =="Mitochondria")), "Mitochondria in the dataset \n")
## There are: 0 Eukaryotes, 157 Chloroplasts, and 63 Mitochondria in the dataset
#Checking to make sure that is the correct amount of taxa that should have been removed
# If you have Eukaryotes, run the next line. If not, run the following line
chck <- nrow(subset(taxa_df, Kingdom =="Eukaryota")) + nrow(subset(taxa_df, Order =="Chloroplast")) + nrow(subset(taxa_df, Family =="Mitochondria"))
#chck <- nrow(subset(taxa_df, Order =="Chloroplast")) + nrow(subset(taxa_df, Family =="Mitochondria"))
ps <- subset_taxa(ps, Kingdom!="Eukaroyta")
ps <- subset_taxa(ps, Family!="Mitochondria")
ps <- subset_taxa(ps, Order!="Chloroplast")
cat("The total number of taxa in the dataset after removing unwanted taxa is", ntaxa(ps), "\nThus,", pre-ntaxa(ps), "were removed from the dataset.\n")
## The total number of taxa in the dataset after removing unwanted taxa is 2975
## Thus, 220 were removed from the dataset.
cat("This should be equal to", chck)
## This should be equal to 220
rm(pre)
rm(chck)
NOTE: Here the sample name given to the mock is “MockEven_CCB_2022”. You will need to change to the sample name of the mock in your dataset in the following code block.
ps <- prune_samples(sample_names(ps) != "MockEven_CCB_2022", ps) # Remove mock samples
#testmergetab <- rownames(mergetab.nochim[,]) != "Mock"
#mergetab.nochim_test <- mergetab.nochim[testmergetab,]
#dim(mergetab.nochim_test)
ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2975 taxa and 55 samples ]
## sample_data() Sample Data: [ 55 samples by 1 sample variables ]
## tax_table() Taxonomy Table: [ 2975 taxa by 7 taxonomic ranks ]
cat("After removing the mock community sample, there are", nsamples(ps), "in the dataset.\n")
## After removing the mock community sample, there are 55 in the dataset.
Sometimes certain ASVs are only detected in the samples that will be removed, such as the mock and PCR controls. These ASVs will be removed in this step
ps <- prune_taxa(taxa_sums(ps) > 0, ps)
cat("There are now", ntaxa(ps), "ASVs in the dataset.\n")
## There are now 2948 ASVs in the dataset.
cat("There are", nsamples(ps), "samples and", ntaxa(ps), "ASVs in the datasets.\n")
## There are 55 samples and 2948 ASVs in the datasets.
ASV_CountTable <- data.frame(otu_table(ps))
ASV_Taxonomy <- data.frame(tax_table(ps))
idx <- match(rownames(ASV_Taxonomy), colnames(ASV_CountTable)) #check that both columns are in the same order
if (any(is.na(idx)) == FALSE)
{cat("All ASV names match between the row names of the taxonomy table and the column names of the count table", '\n', '\n')
#looks like they were all aligned, but doesn't hurt just to make it very safe.
ASV_CountTable <- ASV_CountTable[,idx]
} else
{cat("WARNING! There are mismatches in ASV names between row names of the taxonomy table and the column names of the count table in positions", which(is.na(idx) == TRUE), '\n', '\n')
}
## All ASV names match between the row names of the taxonomy table and the column names of the count table
##
#make a dataframe so you can reference the sequences in the future
ASVseqs <- data.frame("ASV_ID" = paste0("ASV", seq(from = 1, to = ncol(ASV_CountTable), by = 1)), "ASV_Sequence" = rownames(ASV_Taxonomy))
#rename ASV and taxa dataframe so they are easier to read
colnames(ASV_CountTable) <- ASVseqs$ASV_ID
rownames(ASV_Taxonomy) <- ASVseqs$ASV_ID
# Add the taxonomy information to the table with the ASV sequences
ASV_Taxonomy <- tibble::rownames_to_column(ASV_Taxonomy, var = "ASV_ID")# First need to add row names to a variable (column) called 'ASV_ID' in the dataframe called ASV_Taxonomy
ASVseqs <- left_join(ASVseqs, ASV_Taxonomy, by = "ASV_ID") #then join the tables
rownames(ASV_Taxonomy) <- ASV_Taxonomy[,1] # assign values from all rows of the first column as the dataframe rownames
#ASV_Taxonomy[,1] <- NULL # delete the first column because now the values are rownames
ASV_CountTable <- tibble::rownames_to_column(ASV_CountTable, var = "SampleID")# add row names to a variable (column) called 'Sample_ID' in the dataframe called ASV_CountTable
Save these tables and put them in a directory made for the project
Here I commented them out because I already have them saved
#write out these as .txt files to my local computer - done on [INSERT DATE]
# write.table(ASV_CountTable, file = "/Users/cmiller/Dropbox/Miller_Documents/Projects/RightWhale_BlowAndPhotogrammetry/DADA2_Analyses/ASV_CountTable_MiSeq_Library_L2_RightWhaleBlowProject.txt", sep = "\t", row.names = TRUE, col.names = TRUE)
#
# write.table(ASV_Taxonomy, file = "/Users/cmiller/Dropbox/Miller_Documents/Projects/RightWhale_BlowAndPhotogrammetry/DADA2_Analyses/ASV_Taxonomy_MiSeq_Library_L2_RightWhaleBlowProject.txt", sep = "\t", row.names = TRUE, col.names = TRUE)
#
# write.table(ASVseqs, file = "/Users/cmiller/Dropbox/Miller_Documents/Projects/RightWhale_BlowAndPhotogrammetry/ASV_Sequences_MiSeq_Library_L2_RightWhaleBlowProject.txt", sep = "\t", row.names = TRUE, col.names = TRUE)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Identifying and removing contaminant sequences in the samples from the CDA project using the ‘decontam’ R package. Please Note: This is a different project than the one used as an example in the DADA2 pipeline above because it offers a better example.
Contaminating DNA sequences may have been introduced at during the process of collecting and processing samples. The control samples collected from the experiment (both biological and technical replicates) and during DNA extraction and PCR amplification can be used to help statistically identify contaminants in the samples.
Following the decontam tutorial: https://benjjneb.github.io/decontam/vignettes/decontam_intro.html
decontam tutorial:
“The first decontam ingredient is a feature table derived from your raw data, i.e. a table of the relative abundances of sequence features (columns) in each sample (rows). These “sequence features” can be any of a wide variety of feature types, including amplicon sequence variants (ASVs), operational taxonomic units (OTUs), taxonomic groups or phylotypes (e.g. genera), orthologous genes or metagenome-assembled-genomes (MAGs) – anything with a quantitative abundance derived from marker-gene or metagenomics data.""The second decontam ingredient is one of two types of metadata:
DNA quantitation data recording the concentration of DNA in each sample. Most often this is collected in the form of fluorescent intensities measured prior to mixing samples into equimolar ratios for sequencing (and after PCR in amplicon sequencing), but sometimes it is collected via qPCR or other approach on extracted DNA.
A defined set of “negative control” samples in which sequencing was performed on blanks without any biological sample added. Extraction controls are preferred, and in amplicon sequencing the negative controls should also be carried through the PCR step, as each step in the workflow has the potential to introduce new contaminants.
“Finally, this data needs to be imported into R. The feature table should be imported as a sample-by-feature matrix and the sample metadata as a vector with length the number of samples. Alternatively, you can use the phyloseq package to import the data as a phyloseq object.”
If you are starting a fresh session in r-studio and need to re-import your data, here is an example of how to do that. But also you will need to import your metadata file with your post PCR-purification concentrations. In case you have any not detectables in your column concentrations, this code block has a line to change those to 0.001 because decontam requires a number that is not zero.
NOTE SUMMARY: decontam uses one of two methods to identify contaminants:
More details can be found in the sections below.
IMPORTANT NOTE: It is important to have a look at your controls (both in terms of taxonomy and similarity) before you remove them. One way to get an idea for the similarity among the controls and between the controls and the samples is to run decontam but before removing the contaminants that were identified and before removing your controls from the dataset, run an ordination with the decontam scores as colors so you can see how the controls fall out in terms of similarity.
#Clear the global environment for a fresh start to ensure I do not use any carry-over variables.
rm(list=ls())
# Import Count Table
MATRIX_COUNT <- read.table("/Users/cmiller/Dropbox/Miller_Documents/Projects_NonWhale/CDA_main/dada2_CDA/ASV_and_Tax_Tables/pre_decontam/ASV_CountTable_MiSeq_Libraries_O1_P1_CDAProject.txt", header=TRUE)
#Import Taxonomy Table
taxonomy <- read.table("/Users/cmiller/Dropbox/Miller_Documents/Projects_NonWhale/CDA_main/dada2_CDA/ASV_and_Tax_Tables/pre_decontam/ASV_Taxonomy_MiSeq_Libraries_O1_P1_CDAProject.txt", header=TRUE)
#taxonomy table needs to be a matrix for phyloseq
taxonomy <- as.matrix(taxonomy)
# Import Sample Data
metadata <- read.table("/Users/cmiller/Dropbox/Miller_Documents/Projects_NonWhale/CDA_main/dada2_CDA/ASV_and_Tax_Tables/CDA_metadata.txt",sep="\t",header=T,row.names=1)
#remove mock community - will do this in phyloseq
#one of the PCR controls did not result in any fastq files so I will remove it
metadata <- subset(metadata, Sample_LabID != "PCR_ctrl_Lib_P1") # now fatsq file from the sequencer for this sample so will remove from metadata
#decontam is going to need the concentrations of the amplicons to be positive numerics so we will have to substitute the 'nd' we used for not detectable to a zero:
metadata$Amplicon_Concentration_ng_per_uL <- gsub('nd', "0.001", metadata$Amplicon_Concentration_ng_per_uL)
metadata$Amplicon_Concentration_ng_per_uL <- as.numeric(metadata$Amplicon_Concentration_ng_per_uL)
# Verify that all samples match between the relative abundance matrix and the metadata
MATRIX_COUNT_forcheck <- MATRIX_COUNT
MATRIX_COUNT_forcheck <- tibble::rownames_to_column(MATRIX_COUNT, var = "Sample_LabID")
metadata_forcheck <- metadata
#metadata_forcheck <- tibble::rownames_to_column(metadata, var = "SampleID")
if (is.na(anti_join(metadata_forcheck, MATRIX_COUNT_forcheck, by = "Sample_LabID")[1,1]) == TRUE){
print("All samples that are in the metadata are in the count table")
} else {
print("WARNING! Not all samples match between metadata and relative abundance data")
}
## [1] "All samples that are in the metadata are in the count table"
rm(MATRIX_COUNT_forcheck)
rm(metadata_forcheck)
decontam tutorial: “The decontam package works with feature tables in the form of a standard R matrix, but it is even easier to work with phyloseq objects from the phyloseq package, which is designed to ease the analysis of marker-gene and metagenomics datasets.”
OTU <- otu_table(MATRIX_COUNT, taxa_are_rows = FALSE)
TAX <- tax_table(taxonomy)
META = sample_data(metadata)
###CREATE PHYLOSEQ OBJECT and CHECK THAT IT IS CORRECT
ps <- phyloseq(OTU, TAX, META)
ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 31103 taxa and 210 samples ]
## sample_data() Sample Data: [ 210 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 31103 taxa by 6 taxonomic ranks ]
rm(OTU)
rm(TAX)
rm(META)
cat("\n")
cat("This phyloseq objects has a table of", ntaxa(ps), "amplicon sequence variants (ASVs) inferred by the DADA2 algorithm from amplicon sequencing data of the V4 region of the 16S rRNA gene. The phyloseq objects also includes the sample metadata information needed to use decontam.")
## This phyloseq objects has a table of 31103 amplicon sequence variants (ASVs) inferred by the DADA2 algorithm from amplicon sequencing data of the V4 region of the 16S rRNA gene. The phyloseq objects also includes the sample metadata information needed to use decontam.
Make sure the concentrations are numeric
head(sample_data(ps))
## Sample_LabID SampleName SampleCode FreezerBox_Number
## DNA_Blank_1 DNA_Blank_1 DNA_Blank_1 DNA_Blank_1 NA
## DNA_Blank_10 DNA_Blank_10 DNA_Blank_10 DNA_Blank_10 NA
## DNA_Blank_11 DNA_Blank_11 DNA_Blank_11 DNA_Blank_11 NA
## DNA_Blank_12 DNA_Blank_12 DNA_Blank_12 DNA_Blank_12 NA
## DNA_Blank_13 DNA_Blank_13 DNA_Blank_13 DNA_Blank_13 NA
## DNA_Blank_14 DNA_Blank_14 DNA_Blank_14 DNA_Blank_14 NA
## Material_Type Sample_or_Control Time_w
## DNA_Blank_1 DNA extraction blank Control NA
## DNA_Blank_10 DNA extraction blank Control NA
## DNA_Blank_11 DNA extraction blank Control NA
## DNA_Blank_12 DNA extraction blank Control NA
## DNA_Blank_13 DNA extraction blank Control NA
## DNA_Blank_14 DNA extraction blank Control NA
## Biological_replicate_number Technical_replicate_number
## DNA_Blank_1 NA NA
## DNA_Blank_10 NA NA
## DNA_Blank_11 NA NA
## DNA_Blank_12 NA NA
## DNA_Blank_13 NA NA
## DNA_Blank_14 NA NA
## DNA_Extract_Batch DNA_Extract_Date PCR_Number PCR_Date
## DNA_Blank_1 2 4_29_2021 Barcode_2 06_Sept_2021
## DNA_Blank_10 11 6_11_2021 Barcode_4 21_Sept_2021
## DNA_Blank_11 12 6_14_2021 Barcode_2 06_Sept_2021
## DNA_Blank_12 13 6_15_2021 Barcode_4 21_Sept_2021
## DNA_Blank_13 14 6_16_2021 Barcode_5_Lib1 18_Oct_2021
## DNA_Blank_14 15 6_18_2021 Barcode_4 21_Sept_2021
## PCR_Plate_Location Forward_Primer Reverse_Primer
## DNA_Blank_1 Plate_2_8C B504 703
## DNA_Blank_10 Plate_4_2H B503 705
## DNA_Blank_11 Plate_2_8H B504 708
## DNA_Blank_12 Plate_4_3H B503 706
## DNA_Blank_13 Plate_5_5G B504 709
## DNA_Blank_14 Plate_4_4H B503 707
## Quantification_Date Amplicon_Concentration_ng_per_uL
## DNA_Blank_1 23_Sept_2021 0.001
## DNA_Blank_10 24_Sept_2021 0.001
## DNA_Blank_11 23_Sept_2021 0.001
## DNA_Blank_12 28_Sept_2021 0.001
## DNA_Blank_13 22_Oct_2021 14.200
## DNA_Blank_14 29_Sept_2021 0.296
## Library_Assignment
## DNA_Blank_1 Library_O1
## DNA_Blank_10 Library_P1
## DNA_Blank_11 Library_O1
## DNA_Blank_12 Library_P1
## DNA_Blank_13 Library_O1
## DNA_Blank_14 Library_P1
The key sample variables are Amplicon_Concentration_ng_per_µL, which is the DNA concentration in each sample as measured by fluorescent intensity after PCR purification, and Sample_or_Control which tells us which samples are the negative controls.
df <- as.data.frame(sample_data(ps)) # Put sample_data into a ggplot-friendly data.frame
df$LibrarySize <- sample_sums(ps)
df <- df[order(df$LibrarySize),]
df$Index <- seq(nrow(df))
ggplot(data=df, aes(x=Index, y=LibrarySize, color=Sample_or_Control)) + geom_point()
decontam tutorial: "The library sizes of the positive samples primarily fall from 15,000 to 40,000 reads, but there are some low-read outliers. The negative control samples have fewer reads as expected. Note: It is important keep the low-read samples for now, because we want to use those negative controls to help identify contaminants!
cat("The library sizes here range from", range(df$LibrarySize)[1], "to", range(df$LibrarySize)[2], "reads. Just like the tutorial, there are some low-read outliers. And, the negative controls have fewer reads as expected. Please look at the plot in the decontam tutorial https://benjjneb.github.io/decontam/vignettes/decontam_intro.html")
## The library sizes here range from 37 to 155070 reads. Just like the tutorial, there are some low-read outliers. And, the negative controls have fewer reads as expected. Please look at the plot in the decontam tutorial https://benjjneb.github.io/decontam/vignettes/decontam_intro.html
decontam tutorial: “In this method, the distribution of the frequency of each sequence feature as a function of the input DNA concentration is used to identify contaminants.”
IMPORTANT NOTES for consideration about the frequency method from Ben Callahan’s lecture st STAMPS summer 2023:
Before we can use the frequency method. we need to remove the controls:
ps.noctrl <- subset_samples(ps, Sample_or_Control != "Control")
One of the considerations from Ben Callahan’s lecture and from the decontam issue tracker https://github.com/benjjneb/decontam/issues/91 indicates that there should be a three-fold range in the amplicon concentrations for the frequency method. In the issue tracker, B. Callahan wrote, “B. Callahan wrote the following about the range presented in the issue tracker:”a wider concentration would make frequency-based contaminant classification more powerful and accurate"
Thus, let’s run summary stats on our concentrations to see what the concentrations look like:
# provides the summary stats on the amplicon concentrations
summary(sample_data(ps.noctrl)$Amplicon_Concentration_ng_per_uL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.180 5.165 6.570 6.735 7.855 16.200
# calculates the fold range in the concentrations
cat("\n","The amplicon concentrations have a", round(max(sample_data(ps.noctrl)$Amplicon_Concentration_ng_per_uL)/min(sample_data(ps.noctrl)$Amplicon_Concentration_ng_per_uL), digits = 1), "fold range.")
##
## The amplicon concentrations have a 7.4 fold range.
Relative to the DNA concentrations in the aforementioned issue in the decontam issue tracker (https://github.com/benjjneb/decontam/issues/91), we have a limited range in our amplicon DNA concentrations but we do have a 7.4 fold range so it might be worth a try.
contamdf.freq <- isContaminant(ps.noctrl, method="frequency", conc="Amplicon_Concentration_ng_per_uL")
head(contamdf.freq)
## freq prev p.freq p.prev p contaminant
## ASV1 0.14005666 169 0.7679470 NA 0.7679470 FALSE
## ASV2 0.05561513 126 0.7152153 NA 0.7152153 FALSE
## ASV3 0.06167253 132 0.3821514 NA 0.3821514 FALSE
## ASV4 0.03046965 145 0.4543270 NA 0.4543270 FALSE
## ASV5 0.01599135 145 0.5573144 NA 0.5573144 FALSE
## ASV6 0.02414447 42 0.4667109 NA 0.4667109 FALSE
decontam tutorial: “This calculation has returned a data.frame with several columns, the most important being $p which contains the probability that was used for classifying contaminants, and $contaminant which contains TRUE/FALSE classification values with TRUE indicating that the statistical evidence that the associated sequence feature is a contaminant exceeds the user-settable threshold. As we did not specify the threshold, the default value of threshold = 0.1 was used, and $contaminant=TRUE if $p < 0.1.”
table(contamdf.freq$contaminant)
##
## FALSE TRUE
## 30999 104
cat("\n")
#head(which(contamdf.freq$contaminant))
which(contamdf.freq$contaminant)
## [1] 863 1439 2060 2397 2975 3053 3104 3218 3227 3320 3353 3449
## [13] 3451 3521 3788 3874 3999 4031 4051 4171 4438 4513 4602 4629
## [25] 4645 4656 4700 4936 5038 5067 5088 5206 5387 5409 5507 5528
## [37] 5586 5744 5794 6004 6020 6069 6109 6198 6348 6480 6481 6499
## [49] 6815 6902 6980 7100 7186 7224 7406 7522 7636 7647 7696 7835
## [61] 7851 7984 8020 8037 8074 8077 8365 8380 8868 8886 9317 9418
## [73] 9465 9644 9692 10759 10952 11062 11610 11698 11905 12087 12111 12380
## [85] 12418 12419 12425 12597 12628 12831 13011 13028 13352 13452 14099 14231
## [97] 14531 14945 15346 15723 15817 16133 16251 16533
cat("\n")
cat(sum(contamdf.freq$contaminant),"ASVs were identified as contaminants using the ferquency method with the default settings ", "\n")
## 104 ASVs were identified as contaminants using the ferquency method with the default settings
The decontam tutorial visualizes an example of a clear non-contaminant (the 1st ASV) and a clear contaminant (the 3rd ASV).
We will do the same here:
Here we are plotting 2 non-contaminants (ASV 1 and ASV 2) and 7 of the identified contaminants
plot_frequency(ps.noctrl, taxa_names(ps)[c(1,863, 1439, 2060, 2397, 2975, 3053, 3104, 3218, 3227, 3320, 3353)], conc="Amplicon_Concentration_ng_per_uL") +
xlab("DNA Concentration (ng/µL)")
## Warning: Transformation introduced infinite values in continuous y-axis
decontam tutorial: “in this plot the dashed black line shows the model of a noncontaminant sequence feature for which frequency is expected to be independent of the input DNA concentration. The red line shows the model of a contaminant sequence feature, for which frequency is expected to be inversely proportional to input DNA concentration, as contaminating DNA will make up a larger fraction of the total DNA in samples with very little total DNA.”
Here I plotted an ASV that was NOT identified as contaminants (ASV1) and eleven that were identified as contaminants.
And now let’s plot a few more non-contaminants with which we can compare
plot_frequency(ps.noctrl, taxa_names(ps)[c(1,2,3, 4,5, 6, 7, 8, 90)], conc="Amplicon_Concentration_ng_per_uL") +
xlab("DNA Concentration (ng/uL)")
## Warning: Transformation introduced infinite values in continuous y-axis
The frequency contaminant identification method generates plots that are somewhat different for the identified contaminants than the decontam tutorial. However, the plots of the identified contaminants do look quite different from those of non-contaminants.
Let’s try the prevalence method.
decontam tutorial: “In this method, the prevalence (presence/absence across samples) of each sequence feature in true positive samples is compared to the prevalence in negative controls to identify contaminants. In our phyloseq object,”Sample_or_Control" is the sample variable that holds the negative control information. We’ll summarize that data as a logical variable, with TRUE for control samples, as that is the form required by isContaminant."
IMPORTANT NOTES for consideration about the prevalence method from Ben Callahan’s lecture st STAMPS summer 2023:
the ideal negative control, is one that goes through the entire process, including the sampling environment. Such a control will capture any contaminants in the field AND DNA extraction AND PCR.
One should have five negative controls/96 well plate but if you work in low biomass environments, then you want about one to four ratio of negative controls to actual samples
sample_data(ps)$is.neg <- sample_data(ps)$Sample_or_Control == "Control"
contamdf.prev <- isContaminant(ps, method="prevalence", neg="is.neg")
table(contamdf.prev$contaminant)
##
## FALSE TRUE
## 31039 64
cat("\n")
#head(which(contamdf.prev$contaminant))
which(contamdf.prev$contaminant)
## [1] 48 113 160 173 185 198 204 223 240 244 299 307 314 360 369
## [16] 374 513 611 640 695 777 841 853 866 979 1056 1128 1460 1479 1562
## [31] 1566 1595 1607 1654 1665 1669 1726 1727 1781 1833 2034 2175 2283 2442 2457
## [46] 2491 2501 2525 2698 2876 2903 2931 3166 3470 3500 3530 3550 3552 3947 4332
## [61] 4634 4964 5989 6448
cat("\n")
cat(sum(contamdf.prev$contaminant),"ASVs were identified as contaminants using the prevalence method with the default settings ", "\n")
## 64 ASVs were identified as contaminants using the prevalence method with the default settings
Interestingly, all the contaminants identified by the prevalence method were different from those identified by the frequency method.
Let’s visualize these results by plotting the number of times the ASVs were observed in negative controls and true samples.
ps.pa <- transform_sample_counts(ps, function(abund) 1*(abund>0))
ps.pa.neg <- prune_samples(sample_data(ps.pa)$Sample_or_Control == "Control", ps.pa)
ps.pa.pos <- prune_samples(sample_data(ps.pa)$Sample_or_Control == "True_sample", ps.pa)
# Make data.frame of prevalence in positive and negative samples
df.pa <- data.frame(pa.pos=taxa_sums(ps.pa.pos), pa.neg=taxa_sums(ps.pa.neg),
contaminant=contamdf.prev$contaminant)
ggplot(data=df.pa, aes(x=pa.neg, y=pa.pos, color=contaminant)) + geom_point() +
xlab("Prevalence (Negative Controls)") + ylab("Prevalence (True Samples)")
The ASVs appear to separate into a branch that shows up mostly in positive samples and another mostly in negative samples. There is the concerning ASVs that shows up in most/all of the negative controls and most of the samples
decontam tutorial: “Note that as before, the default threshold for a contaminant is that it reaches a probability of 0.1 in the statistical test being performed. In the prevalence test there is a special value worth knowing, threshold=0.5, that will identify as contaminants all sequences thare are more prevalent in negative controls than in positive samples. Let’s try using this more aggressive classification threshold rather than the default.”
contamdf.prev05 <- isContaminant(ps, method="prevalence", neg="is.neg", threshold=0.5)
table(contamdf.prev05$contaminant)
##
## FALSE TRUE
## 30897 206
#head(which(contamdf.prev05$contaminant))
which(contamdf.prev05$contaminant)
## [1] 48 62 113 160 173 185 198 204 223 240 244 299
## [13] 307 314 360 369 374 513 606 611 640 695 737 777
## [25] 797 841 853 866 979 1056 1096 1128 1175 1210 1316 1325
## [37] 1371 1389 1439 1460 1479 1562 1566 1595 1607 1624 1654 1665
## [49] 1669 1677 1726 1727 1742 1748 1781 1801 1833 1860 1878 1884
## [61] 1893 1941 1989 1998 2020 2026 2034 2071 2073 2083 2090 2175
## [73] 2203 2227 2241 2247 2283 2356 2401 2422 2442 2457 2465 2491
## [85] 2501 2525 2584 2607 2608 2652 2692 2698 2740 2776 2824 2876
## [97] 2903 2931 2952 2996 3046 3056 3104 3110 3117 3118 3120 3128
## [109] 3130 3166 3167 3180 3187 3209 3230 3231 3233 3255 3256 3316
## [121] 3369 3416 3470 3500 3530 3545 3546 3550 3551 3552 3583 3599
## [133] 3613 3614 3628 3644 3648 3681 3737 3790 3804 3811 3850 3883
## [145] 3888 3947 4058 4149 4157 4164 4174 4245 4332 4364 4378 4380
## [157] 4407 4437 4485 4486 4513 4514 4569 4603 4605 4621 4634 4635
## [169] 4732 4740 4815 4850 4964 4965 5054 5073 5079 5147 5238 5291
## [181] 5393 5503 5552 5764 5819 5989 6197 6208 6209 6211 6354 6448
## [193] 6531 6605 6670 6770 6887 7243 7244 7394 7962 8212 8796 10059
## [205] 10713 15582
cat("\n")
cat(sum(contamdf.prev05$contaminant),"ASVs were identified as contaminants using the prevalence method with the threshold=0.5 setting ", "\n")
## 206 ASVs were identified as contaminants using the prevalence method with the threshold=0.5 setting
Let’s again visualize these results by plotting the number of times the ASVs were observed in negative controls and true samples.
#ps.pa <- transform_sample_counts(ps, function(abund) 1*(abund>0))
#ps.pa.neg <- prune_samples(sample_data(ps.pa)$Sample_or_Control == "Control", ps.pa)
#ps.pa.pos <- prune_samples(sample_data(ps.pa)$Sample_or_Control == "True_sample", ps.pa)
# Make data.frame of prevalence in positive and negative samples
df.pa.prev05 <- data.frame(pa.pos=taxa_sums(ps.pa.pos), pa.neg=taxa_sums(ps.pa.neg),
contaminant=contamdf.prev05$contaminant)
ggplot(data=df.pa.prev05, aes(x=pa.neg, y=pa.pos, color=contaminant)) + geom_point() +
xlab("Prevalence (Negative Controls)") + ylab("Prevalence (True Samples)")
For our purposes, not much is gained by increasing the threshold to 0.5 so we will proceed with results of the prevalence method with the default settings (threshold = 0.1)
Before we remove contaminants, let’s do a little cleanup of the environment
rm(contamdf.freq)
rm(contamdf.prev05)
rm(df.pa.prev05)
Here we will remove the contaminants identified by the prevalence method (threshold = 0.1).
ps.nocontam <- prune_taxa(!contamdf.prev$contaminant, ps)
# Verify that the contaminants were removed:
cat("Number of ASVs in the dataset prior to removing decontam identified contaminants =", ntaxa(ps), "\n")
## Number of ASVs in the dataset prior to removing decontam identified contaminants = 31103
cat("Number of ASVs considered contaminants by decontam =",length(which(contamdf.prev$contaminant)), "\n")
## Number of ASVs considered contaminants by decontam = 64
cat("The difference between these two numbers =", ntaxa(ps)-length(which(contamdf.prev$contaminant)), ",","\n")
## The difference between these two numbers = 31039 ,
#print("...which should equal the total number of ASVs in the dataset after removing the contaminants, and that is", ntaxa(ps.nocontam), "ASVs")
#ntaxa(ps.nocontam)
cat("which should equal the number of ASVs in the dataset from which contaminants were removed =", ntaxa(ps.nocontam), "\n", "\n")
## which should equal the number of ASVs in the dataset from which contaminants were removed = 31039
##
cat("Another check: The difference between the original dataset and the all-contaminants-removed dataset =", ntaxa(ps)-ntaxa(ps.nocontam),",", "\n")
## Another check: The difference between the original dataset and the all-contaminants-removed dataset = 64 ,
cat("which should equal the total number of ASVs identified as contaminants by decontam =", length(which(contamdf.prev$contaminant)), "\n", "\n")
## which should equal the total number of ASVs identified as contaminants by decontam = 64
##
cat("check the code if the numbers above don't match where they are supposed to match!", "\n")
## check the code if the numbers above don't match where they are supposed to match!
Removing ASVs not present in any sample might be important for some of your upcoming analyses
ps.nocontam.filt <- filter_taxa(ps.nocontam, function(x) sum(x) > 0, TRUE)
ps.nocontam.filt
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 31039 taxa and 210 samples ]
## sample_data() Sample Data: [ 210 samples by 20 sample variables ]
## tax_table() Taxonomy Table: [ 31039 taxa by 6 taxonomic ranks ]
cat("Number of ASVs in the dataset (without contaminants) prior to removing ASVs not present in any sample =", ntaxa(ps.nocontam), "\n", "\n")
## Number of ASVs in the dataset (without contaminants) prior to removing ASVs not present in any sample = 31039
##
cat("Number of ASVs in the dataset (without contaminants) after removing ASVs not present in any sample =", ntaxa(ps.nocontam.filt), "\n", "\n")
## Number of ASVs in the dataset (without contaminants) after removing ASVs not present in any sample = 31039
##
cat("The number of ASVs not present in any sample that were removed =", ntaxa(ps.nocontam)-ntaxa(ps.nocontam.filt), "\n", "\n")
## The number of ASVs not present in any sample that were removed = 0
##
Pulling it all together - decontam
decontam tutorial: “The two basic methods implemented are the frequency and prevalence methods shown above, but a number of additional ways to utilize those methods are available. The combined, minimum and independent modes all use both the frequency and prevalence methods to identify contaminants, but combine the results of the two methods in different ways (see ?isContaminant for more information). There is also a batch functionality that allows contaminants to be identified independently in different batches (e.g. sequencing runs, or different studies) that are expected to have different contaminant profiles.”
NOTE: Deciding which method you want to use to identify contaminants will be project specific.
decontam Conclusions
decontam tutorial: "We all know that contaminants are a problem in marker-gene and metagenomics data. Right now, it is widespread practice to address contamition by sequencing negative controls, and then doing nothing, in large part because there haven’t been easy to use tools available to identify contaminants in this kind of data. That’s where the decontam package comes in. decontam provides a simple interface that takes in your table of sequence features (ASVs, OTUs, MAGs, genera, etc), and classifies each sequence feature as a contaminant based on signatures of contamination that have been demonstrated over many previous studies.
Removing contaminants makes the characterizations provided by marker-gene and metagenomics sequencing more accurate. It prevent false positives in exploratory analyses. It reduces batch effects between different studies and sequencing runs. It increases statistical power by reducing the additional hypotheses spent on contaminant sequencing features.
Removing contaminants makes your data better, and that’s always worth doing."
Here is an example of how to save your tables to text files for future analyses
# Comment: write out these as .txt files to my local computer - done on INSERT DATE
# ASV_Count <- data.frame(otu_table(ps.nocontam.filt))
# write.table(ASV_Count, file = "~/Documents/Projects_NonWhale/CDA_main_local/ASV_and_Tax_Tables/post_decontam/ASV_CountTableNoContam_MiSeq_Libraries_O1_P1_CDAProject.txt", sep = "\t", row.names = TRUE, col.names = TRUE)
#
# ASV_Taxonomy <- data.frame(tax_table(ps.nocontam.filt))
# write.table(ASV_Taxonomy, file = "~/Documents/Projects_NonWhale/CDA_main_local/ASV_and_Tax_Tables/post_decontam/ASV_TaxonomyNoContam_MiSeq_Libraries_O1_P1_CDAProject.txt", sep = "\t", row.names = TRUE, col.names = TRUE)
#
# rm(ASV_Count)
# rm(ASV_Taxonomy)
V1 - drafted 26 Jan - 01 Feb 2024 Carolyn Miller