Apprill Lab Protocol: Bioinformatic pipeline for 16S rRNA gene sequences using DADA2 and decontam

This bioinformatics pipeline protocol is intended for processing the V4 region of the 16S rRNA gene amplicons sequenced by Illumina MiSeq sequencing 2x250 bp. The pipeline uses DADA2 (Callahan et al 2016) to infer amplicon sequence variants (ASV) from the amplicon sequences and “resolves differences of as little as 1 nucleotide”. The pipeline also uses DADA2 to assign taxonomy to the ASVs using using the Silva reference database. In addition, the pipeline uses decontam to “provide simple statistical methods to identify and visualize contaminating DNA features, allowing them to be removed and a more accurate picture of sampled communities to be constructed from marker-gene…data” (Davis et al 2018). Thus, this pipeline generates ASV and taxonomy tables with statistically identified contaminating taxa removed that will be ready for downstream analyses.

This protocol is written for use in a r-markdown document. I usually start the pipeline by adding a brief description of the project and the sample processing methods. I also like to specify the people who helped collect samples and process them. I find doing this helps when it comes to writing the manuscript.

Project Description

Please provide a brief description of the project here.

People who have worked on the project

Please provide a list of the people who have worked in the project and their affiliations. Don’t forgot those who have collected the samples, people who have helped with the sample processing and so on. hanges for your datasets.

Brief summary of sample processing of the samples:

Please provide a brief description of the sample processing here, a basic summary of the lab methods.

Here is an example:
DNA was extracted from the samples using the DNeasy PowerBiofilm kit (Qiagen, Germantown, MD, USA). The v4 region of the 16srRNA gene was amplified with barcoded 515FY and 806RB primers using the MiSeq protocol on the thermocycler for 37-38 cycles depending on the sample. 1 or 2 µL sample DNA (volume was sample dependent) was used in duplicate 25 µL reactions. PCR product was purified by gel purification with the Qiagen MinElute Gel Extraction Kit (Qiagen Inc., Germantown, MD, USA). Purified product was quantified using Qubit High Sensitivity dsDNA fluorometric assay and pooled to equimolar concentrations such that 5 ng purified amplicons per sample was put into the pooled library. Samples were sequenced in two sequencing runs (one for each pool of samples) with Illumina MiSeq sequencing (250 bp paired end) by the Roy J. Carver Biotechnology Center Core Sequencing Facility at the University of Illinois Urbana-Champaign as part of Apprill Lab sequencing library P2. The library prep details can be found in an excel workbook at [INSERT PATH or LINK]

Please provide the locations of the fastq files:
An example: The 2022 fastq files are currently located in two places:
1. Apprill Lab Google Drive /Sequencing related/Sequencing_Data/MiSeq_Sequencing_Data
2. /Users/cmiller/

Please provide the location of this Rmd file:
An example:
Location of this Rmd file: [INSERT PATH TO YOUR RMD FILE]

Preparing for DADA2

Details about DADA2 can be found here: https://benjjneb.github.io/dada2/index.html

For our purposes, the DADA2 Tutorial will be followed almost line by line. The tutorial can be located at https://benjjneb.github.io/dada2/tutorial.html.

You will need to first install DADA2, if you have not already

DADA2 requires the following:

Samples have been demultiplexed, i.e. split into individual per-sample fastq files.
Non-biological nucleotides have been removed, e.g. primers, adapters, linkers, etc.
If paired-end sequencing data, the forward and reverse fastq files contain reads in matched order.

All criteria are met.

IMPORTANT NOTE REGARDING MULTIPLE SEQUENCING RUNS: If you have multiple sequencing libraries that you would like to run through DADA2 to use in the same downstream analysis, you will run each sequencing run through the DADA2 pipeline separately until just before the chimera checking step. Once you have each run through the pipeline until that point, you will merge sequence tables with the command: mergetab <- mergeSequenceTables(seqtab2_run1, seqtab2_run2) and then check for chimeras and proceed with the rest of the pipeline

Prepare fastq files for dada2

Here you will need to fill in your specific directories


#change directory to the one with the fastq files (example commented out below)
   #cd /Users/cmiller/Dropbox/Miller_Documents/Projects/RightWhale_BlowAndPhotogrammetry/DADA2_Analyses/2022Sequences_RightWhaleBlowProject

#unzip the fastq files
    #gunzip *.fastq.gz

# These packages are needed for dada2

library(dada2); packageVersion("dada2")

## Loading required package: Rcpp

## [1] '1.20.0'

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(phyloseq); packageVersion("phyloseq") # load phyloseq library

## Warning: package 'phyloseq' was built under R version 4.0.3

## [1] '1.34.0'

library(ggplot2)
library(decontam); packageVersion("decontam")

## Warning: package 'decontam' was built under R version 4.0.3

## [1] '1.10.0'

#Clear the global environment for a fresh start to ensure I do not use any carry-over variables.
rm(list=ls())

Define path to sequences

Here is an example - you will need to define your own path.

path <- "/Users/cmiller/Dropbox/Miller_Documents/Projects/RightWhale_BlowAndPhotogrammetry/DADA2_Analyses/2022Sequences_RightWhaleBlowProject"

head(list.files(path))

## [1] "CCB_2019_1_TGAGTACG-TACGAGAC_L001_R1_001.fastq"
## [2] "CCB_2019_1_TGAGTACG-TACGAGAC_L001_R2_001.fastq"
## [3] "CCB_2019_2_CTGCGTAG-TACGAGAC_L001_R1_001.fastq"
## [4] "CCB_2019_2_CTGCGTAG-TACGAGAC_L001_R2_001.fastq"
## [5] "CCB_2019_3_TAGTCTCC-TACGAGAC_L001_R1_001.fastq"
## [6] "CCB_2019_3_TAGTCTCC-TACGAGAC_L001_R2_001.fastq"

Read names of fastq files and create matched list of forward and reverse reads

This step reads in the fastq filenames and creates a matched list of the forward and reverse filenames.

The second part of this step “extracting sample names” assumes the sample names have the format: SAMPLENAME_XXX.fastq
If your sample names look different, you will need to do some string manipulations to extract the sample name from the fastq filename.

A helpful reference for string manipulation is: https://r4ds.had.co.nz/strings.html#strings
and within that reference: https://r4ds.had.co.nz/strings.html#splitting

# Comment: DADA2 tutorial (you will need to remove the # to use):
    # Comment: Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq
    #fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))
    #fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
    
    # Comment: Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
    #sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)

# Comment: Modified from the DADA2 tutorial to accommodate the differences in my sample names
    # Comment: Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq
    fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))
    fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
    # Comment: Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
    sample.names22_1 <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
    sample.names22_2 <- sapply(strsplit(basename(fnFs), "_"), `[`, 2)
    sample.names22_3 <- sapply(strsplit(basename(fnFs), "_"), `[`, 3)
    sample.names <- paste(sample.names22_1, sample.names22_2, sample.names22_3, sep='_')

# Comment: clean up work space
    rm(sample.names22_1)
    rm(sample.names22_2)
    rm(sample.names22_3)

Inspect the read quality profiles

The command for doing this in the DADA2 tutorial is: plotQualityProfile(fnFs[1:2]) which plots the read quality profiles of the first two samples in the list of files.

Things to consider for your own data: You might consider choosing representative samples from each type of sample in your dataset instead of the first two samples on the list. Make sure you are viewing samples and not controls (controls can be messy looking).

Forward reads of blow samples

# Comment: This is the command copied from DADA2 tutorial:
    #plotQualityProfile(fnFs[1:2])

# Comment: This is an example of choosing and plotting specific samples
    a <- c(1,6,28:30)
    plotQualityProfile(fnFs[a])

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

# Comment: clean up work space
    rm(a)

Forward reads of seawater samples

# Comment: This is the command copied from DADA2 tutorial:
    #plotQualityProfile(fnFs[1:2])

# Comment: This is an example of choosing and plotting specific samples
    a <- c(8,9,24)
    plotQualityProfile(fnFs[a])

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

# Comment: clean up work space
    rm(a)

Reverse reads of blow samples

# Comment: This is the command copied from DADA2 tutorial:
    #plotQualityProfile(fnFs[1:2])

# Comment: This is an example of choosing and plotting specific samples
    a <- c(1,6,28:30)
    plotQualityProfile(fnRs[a])

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

# Comment: clean up work space
    rm(a)

Reverse reads of seawater samples

# Comment: This is the command copied from DADA2 tutorial:
    #plotQualityProfile(fnFs[1:2])

# Comment: This is an example of choosing and plotting specific samples
    a <- c(8,9,24)
    plotQualityProfile(fnRs[a])

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

# Comment: clean up work space
    rm(a)

DADA2 tutorial: “Green is median quality and orange is the quartiles of quality distribution.”In gray-scale is a heat map of the frequency of each quality score at each base position. The median quality score at each position is shown by the green line, and the quartiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same length, hence the flat red line)."

The forward reads in the tutorial were of good quality. The tutorial recommends trimming the last 10 nucleotides, i.e., truncating the read at position 240. The forward read quality profiles look like those in the tutorial, with the quartiles of the quality score distribution falling off-ish at the last 10 bp so I will trim the forward reads to 240 as recommended.

Like the reads in the tutorial, the reverse reads are lower and more variable quality (apparently common in Illumina sequencing).

DADA2 tutorial: “The reverse reads are of significantly worse quality, especially at the end, which is common in Illumina sequencing. This is not too worrisome, as DADA2 incorporates quality information into its error model which makes the algorithm robust to lower quality sequence, but trimming as the average qualities crash will improve the algorithm’s sensitivity to rare sequence variants. Based on these profiles, we will truncate the reverse reads at position 160 where the quality distribution crashes.”

Some of the reverse reads look better than those in the tutorial and do not look like they need as much trimming so I think I will trim at 200, which is different from the tutorial, which trims to 160.

Look at your plots and decide what the best trim length would be for your own dataset.

Filter and trim files

Put all the files in a sub-directory called filtered

filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names

maxN= is the number of Ns (dada2 requires none)
maxEE= is the number of expected error allowed in a read.
This example uses the same settings as the tutorial.

Filter and trim

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,200),
              maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
              compress=TRUE, multithread=TRUE) 
#head(out)
out

##                                                          reads.in reads.out
## CCB_2019_1_TGAGTACG-TACGAGAC_L001_R1_001.fastq              32890     29937
## CCB_2019_2_CTGCGTAG-TACGAGAC_L001_R1_001.fastq              45172     38017
## CCB_2019_3_TAGTCTCC-TACGAGAC_L001_R1_001.fastq              38901     36755
## CCB_2019_4_CGAGCGAC-TACGAGAC_L001_R1_001.fastq              40037     35708
## CCB_2019_5_CGAGAGTT-ACGTCTCG_L001_R1_001.fastq              35530     29127
## CCB_2022_1_ACTACGAC-ATCGTACG_L001_R1_001.fastq              75900     69565
## CCB_2022_10_GTCTGCTA-ACTATCTG_L001_R1_001.fastq             63288     59417
## CCB_2022_11_GTCTGCTA-TAGCGAGT_L001_R1_001.fastq             53171     50774
## CCB_2022_12_GTCTGCTA-CTGCGTGT_L001_R1_001.fastq             47290     44704
## CCB_2022_13_CTGCGTAG-ACGTCTCG_L001_R1_001.fastq             70220     66746
## CCB_2022_14_TAGTCTCC-ACGTCTCG_L001_R1_001.fastq             84342     81356
## CCB_2022_15_GTCTGCTA-TCATCGAG_L001_R1_001.fastq             56500     51148
## CCB_2022_16_GTCTGCTA-CTACTATA_L001_R1_001.fastq             51796     48003
## CCB_2022_17_GTCTGCTA-CGTTACTA_L001_R1_001.fastq             44088     40909
## CCB_2022_18_GTCTGCTA-AGAGTCAC_L001_R1_001.fastq             57927     52837
## CCB_2022_19_GTCTATGA-ATCGTACG_L001_R1_001.fastq             54083     51041
## CCB_2022_2_ACTACGAC-ACTATCTG_L001_R1_001.fastq              68442     62604
## CCB_2022_20_CGAGCGAC-ACGTCTCG_L001_R1_001.fastq             70641     67014
## CCB_2022_21_ACTACGAC-ACGTCTCG_L001_R1_001.fastq             72506     69005
## CCB_2022_22_GTCTATGA-ACTATCTG_L001_R1_001.fastq             56156     51735
## CCB_2022_23_GTCTATGA-TAGCGAGT_L001_R1_001.fastq             59604     54747
## CCB_2022_24_GTCTATGA-CTGCGTGT_L001_R1_001.fastq             49110     44391
## CCB_2022_25_GTCTATGA-TCATCGAG_L001_R1_001.fastq             59803     56832
## CCB_2022_26_GTCTGCTA-ACGTCTCG_L001_R1_001.fastq             64911     62229
## CCB_2022_28_GTCTATGA-CTACTATA_L001_R1_001.fastq             76814     70287
## CCB_2022_29_GTCTATGA-CGTTACTA_L001_R1_001.fastq             59206     53937
## CCB_2022_3_ACTACGAC-TAGCGAGT_L001_R1_001.fastq              66206     62081
## CCB_2022_30_GTCTATGA-AGAGTCAC_L001_R1_001.fastq             49946     45401
## CCB_2022_31_TATAGCGA-ATCGTACG_L001_R1_001.fastq             69596     61912
## CCB_2022_32_TATAGCGA-ACTATCTG_L001_R1_001.fastq             61732     57233
## CCB_2022_33_TATAGCGA-TAGCGAGT_L001_R1_001.fastq             55396     50664
## CCB_2022_34_TATAGCGA-CTGCGTGT_L001_R1_001.fastq             46146     42258
## CCB_2022_35_TATAGCGA-TCATCGAG_L001_R1_001.fastq             52883     49396
## CCB_2022_38_TATAGCGA-CTACTATA_L001_R1_001.fastq             59770     56069
## CCB_2022_39_TATAGCGA-CGTTACTA_L001_R1_001.fastq             34410     32040
## CCB_2022_4_ACTACGAC-CTGCGTGT_L001_R1_001.fastq              79567     75071
## CCB_2022_40_TATAGCGA-AGAGTCAC_L001_R1_001.fastq             42126     40024
## CCB_2022_41_CGAGAGTT-TACGAGAC_L001_R1_001.fastq             44416     41769
## CCB_2022_42_ACTACGAC-TACGAGAC_L001_R1_001.fastq             89727     86273
## CCB_2022_43_GTCTGCTA-TACGAGAC_L001_R1_001.fastq             83425     80224
## CCB_2022_44_GACATAGT-TACGAGAC_L001_R1_001.fastq             62971     58905
## CCB_2022_45_ACGCTACT-TACGAGAC_L001_R1_001.fastq             21261     19075
## CCB_2022_46_ACTCACTG-TACGAGAC_L001_R1_001.fastq             45270     39151
## CCB_2022_47_GTCTATGA-TACGAGAC_L001_R1_001.fastq             70879     68547
## CCB_2022_48_CGAGCGAC-AGAGTCAC_L001_R1_001.fastq             73419     70831
## CCB_2022_5_ACTACGAC-TCATCGAG_L001_R1_001.fastq              65706     61235
## CCB_2022_6_ACTACGAC-CTACTATA_L001_R1_001.fastq              50579     48407
## CCB_2022_7_ACTACGAC-CGTTACTA_L001_R1_001.fastq              80768     77504
## CCB_2022_8_ACTACGAC-AGAGTCAC_L001_R1_001.fastq              60839     55952
## CCB_2022_9_GTCTGCTA-ATCGTACG_L001_R1_001.fastq              56304     51753
## DNA_Blank_1_GACATAGT-ACGTCTCG_L001_R1_001.fastq             40182     37618
## DNA_Blank_2_ACGCTACT-ACGTCTCG_L001_R1_001.fastq             39087     35754
## DNA_Blank_3_ACTCACTG-ACGTCTCG_L001_R1_001.fastq              8118      6319
## MockEven_CCB_2022_27_GTCTATGA-ACGTCTCG_L001_R1_001.fastq    49580     43342
## PCR_Ctrl_1_TGAGTACG-ACGTCTCG_L001_R1_001.fastq                915       516
## PCR_Ctrl_2_CGAGCGAC-CGTTACTA_L001_R1_001.fastq                686       423

DADA2 tutorial: “Considerations for your own data: The standard filtering parameters are starting points, not set in stone. If you want to speed up downstream computation, consider tightening maxEE. If too few reads are passing the filter, consider relaxing maxEE, perhaps especially on the reverse reads (eg. maxEE=c(2,5)), and reducing the truncLen to remove low quality tails. Remember though, when choosing truncLen for paired-end reads you must maintain overlap after truncation in order to merge them later.”

Learn the error rates

DADA2 tutorial: “The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).”

Note: This parameter learning can be computationally intense.

Note: The minimum number of total bases to use for error rate learning is 1e8. Basically, samples are read into memory in the provided order (the order in the sample list above) until at least this number of total bases has been reached. If you would like to increase the number of bases used, you can do that by adding nbases= to the command (make sure you do the same for both forward and reverse). You can also tell the command to randomize the order the samples are read into the memory. I had a dataset in which all lot of my controls were at the top of the order, so I chose to adjust the command so the algorithm was learning the error rates on my samples, not my controls. You can also point the command to specific fastq files to use. The default settings are generally sufficient but I thought you should have a heads-up in case you have a situation like mine where the algorithm only learned errors on controls, not samples. See https://rdrr.io/bioc/dada2/man/learnErrors.html for all of the options for this command.

Learn the error rates of forward reads

errF <- learnErrors(filtFs, multithread=TRUE)

## 110580000 total bases in 460750 reads from 10 samples will be used for learning the error rates.

Learn the error rates of reverse reads

errR <- learnErrors(filtRs, multithread=TRUE)

## 108421200 total bases in 542106 reads from 11 samples will be used for learning the error rates.

Visualize the estimated error rates of forward reads

DADA2 tutorial: “It is always worthwhile, as a sanity check if nothing else, to visualize the estimated error rates:”

plotErrors(errF, nominalQ=TRUE)

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis

DADA2 tutorial: “The error rates for each possible transition (A→C, A→G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score.”

DADA2 tutorial: “Considerations for your own data: Parameter learning is computationally intensive, so by default the learnErrors function uses only a subset of the data (the first 100M bases). If you are working with a large dataset and the plotted error model does not look like a good fit, you can try increasing the nbases parameter to see if the fit improves.”

The plots here look very much like those in the tutorial, which says the following about the tutorial plots: “Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected. Everything looks reasonable and we proceed with confidence.” Thus, the plots here also look like the error rates are a good fit to the observed rates and will proceed with confidence as suggested by the tutorial.

Sample Inference

This applies the DADA2 sample inference algorithm to the filtered and trimmed sequences. This is a computationally intensive step.

Sample Inference - forward reads

dadaFs <- dada(filtFs, err=errF, multithread=TRUE)

## Sample 1 - 29937 reads in 12610 unique sequences.
## Sample 2 - 38017 reads in 18075 unique sequences.
## Sample 3 - 36755 reads in 15964 unique sequences.
## Sample 4 - 35708 reads in 18186 unique sequences.
## Sample 5 - 29127 reads in 14009 unique sequences.
## Sample 6 - 69565 reads in 27969 unique sequences.
## Sample 7 - 59417 reads in 25277 unique sequences.
## Sample 8 - 50774 reads in 21486 unique sequences.
## Sample 9 - 44704 reads in 19851 unique sequences.
## Sample 10 - 66746 reads in 23051 unique sequences.
## Sample 11 - 81356 reads in 27130 unique sequences.
## Sample 12 - 51148 reads in 22558 unique sequences.
## Sample 13 - 48003 reads in 17339 unique sequences.
## Sample 14 - 40909 reads in 14537 unique sequences.
## Sample 15 - 52837 reads in 22357 unique sequences.
## Sample 16 - 51041 reads in 22122 unique sequences.
## Sample 17 - 62604 reads in 26058 unique sequences.
## Sample 18 - 67014 reads in 24532 unique sequences.
## Sample 19 - 69005 reads in 24117 unique sequences.
## Sample 20 - 51735 reads in 23187 unique sequences.
## Sample 21 - 54747 reads in 24990 unique sequences.
## Sample 22 - 44391 reads in 20221 unique sequences.
## Sample 23 - 56832 reads in 23111 unique sequences.
## Sample 24 - 62229 reads in 20706 unique sequences.
## Sample 25 - 70287 reads in 26908 unique sequences.
## Sample 26 - 53937 reads in 24401 unique sequences.
## Sample 27 - 62081 reads in 28058 unique sequences.
## Sample 28 - 45401 reads in 21286 unique sequences.
## Sample 29 - 61912 reads in 23816 unique sequences.
## Sample 30 - 57233 reads in 25632 unique sequences.
## Sample 31 - 50664 reads in 22048 unique sequences.
## Sample 32 - 42258 reads in 18419 unique sequences.
## Sample 33 - 49396 reads in 21501 unique sequences.
## Sample 34 - 56069 reads in 24288 unique sequences.
## Sample 35 - 32040 reads in 13205 unique sequences.
## Sample 36 - 75071 reads in 32508 unique sequences.
## Sample 37 - 40024 reads in 18248 unique sequences.
## Sample 38 - 41769 reads in 18235 unique sequences.
## Sample 39 - 86273 reads in 28486 unique sequences.
## Sample 40 - 80224 reads in 27696 unique sequences.
## Sample 41 - 58905 reads in 24843 unique sequences.
## Sample 42 - 19075 reads in 7206 unique sequences.
## Sample 43 - 39151 reads in 16445 unique sequences.
## Sample 44 - 68547 reads in 14475 unique sequences.
## Sample 45 - 70831 reads in 17593 unique sequences.
## Sample 46 - 61235 reads in 29097 unique sequences.
## Sample 47 - 48407 reads in 20651 unique sequences.
## Sample 48 - 77504 reads in 27370 unique sequences.
## Sample 49 - 55952 reads in 26206 unique sequences.
## Sample 50 - 51753 reads in 23820 unique sequences.
## Sample 51 - 37618 reads in 15617 unique sequences.
## Sample 52 - 35754 reads in 12769 unique sequences.
## Sample 53 - 6319 reads in 3231 unique sequences.
## Sample 54 - 43342 reads in 17872 unique sequences.
## Sample 55 - 516 reads in 304 unique sequences.
## Sample 56 - 423 reads in 316 unique sequences.

Sample Inference - reverse reads

dadaRs <- dada(filtRs, err=errR, multithread=TRUE)

## Sample 1 - 29937 reads in 11883 unique sequences.
## Sample 2 - 38017 reads in 14384 unique sequences.
## Sample 3 - 36755 reads in 13830 unique sequences.
## Sample 4 - 35708 reads in 15247 unique sequences.
## Sample 5 - 29127 reads in 10809 unique sequences.
## Sample 6 - 69565 reads in 24330 unique sequences.
## Sample 7 - 59417 reads in 22738 unique sequences.
## Sample 8 - 50774 reads in 19692 unique sequences.
## Sample 9 - 44704 reads in 18430 unique sequences.
## Sample 10 - 66746 reads in 24375 unique sequences.
## Sample 11 - 81356 reads in 25942 unique sequences.
## Sample 12 - 51148 reads in 20821 unique sequences.
## Sample 13 - 48003 reads in 16557 unique sequences.
## Sample 14 - 40909 reads in 14041 unique sequences.
## Sample 15 - 52837 reads in 20502 unique sequences.
## Sample 16 - 51041 reads in 20997 unique sequences.
## Sample 17 - 62604 reads in 23051 unique sequences.
## Sample 18 - 67014 reads in 24377 unique sequences.
## Sample 19 - 69005 reads in 23464 unique sequences.
## Sample 20 - 51735 reads in 21324 unique sequences.
## Sample 21 - 54747 reads in 23637 unique sequences.
## Sample 22 - 44391 reads in 19928 unique sequences.
## Sample 23 - 56832 reads in 21801 unique sequences.
## Sample 24 - 62229 reads in 21432 unique sequences.
## Sample 25 - 70287 reads in 23562 unique sequences.
## Sample 26 - 53937 reads in 23490 unique sequences.
## Sample 27 - 62081 reads in 23931 unique sequences.
## Sample 28 - 45401 reads in 20383 unique sequences.
## Sample 29 - 61912 reads in 21320 unique sequences.
## Sample 30 - 57233 reads in 23470 unique sequences.
## Sample 31 - 50664 reads in 20952 unique sequences.
## Sample 32 - 42258 reads in 17966 unique sequences.
## Sample 33 - 49396 reads in 20194 unique sequences.
## Sample 34 - 56069 reads in 21985 unique sequences.
## Sample 35 - 32040 reads in 12233 unique sequences.
## Sample 36 - 75071 reads in 27352 unique sequences.
## Sample 37 - 40024 reads in 16458 unique sequences.
## Sample 38 - 41769 reads in 15457 unique sequences.
## Sample 39 - 86273 reads in 24559 unique sequences.
## Sample 40 - 80224 reads in 24618 unique sequences.
## Sample 41 - 58905 reads in 22368 unique sequences.
## Sample 42 - 19075 reads in 5834 unique sequences.
## Sample 43 - 39151 reads in 13078 unique sequences.
## Sample 44 - 68547 reads in 13323 unique sequences.
## Sample 45 - 70831 reads in 16377 unique sequences.
## Sample 46 - 61235 reads in 25549 unique sequences.
## Sample 47 - 48407 reads in 18561 unique sequences.
## Sample 48 - 77504 reads in 23097 unique sequences.
## Sample 49 - 55952 reads in 23124 unique sequences.
## Sample 50 - 51753 reads in 21012 unique sequences.
## Sample 51 - 37618 reads in 15121 unique sequences.
## Sample 52 - 35754 reads in 11388 unique sequences.
## Sample 53 - 6319 reads in 2560 unique sequences.
## Sample 54 - 43342 reads in 16248 unique sequences.
## Sample 55 - 516 reads in 351 unique sequences.
## Sample 56 - 423 reads in 383 unique sequences.

Inspect algorithm output

Here I will inspect the output of the algorithm for several samples. The output will show the number of unique sequences in each specified sample and the number of sequence variants that were inferred from those unique sequences.

DADA2 tutorial: “There is much more to the dada-class return object than this (see help(”dada-class“) for some info), including multiple diagnostics about the quality of each denoised sequence variant, but that is beyond the scope of an introductory tutorial.”

#Information about the output of the first sample
cat("Information about the output of the first sample:", "\n", "\n")

## Information about the output of the first sample: 
##

dadaFs[[1]]

## dada-class: object describing DADA2 denoising results
## 143 sequence variants were inferred from 12610 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

The DADA2 algorithm inferred 143 true sequence variants from the 12610 unique sequences in the first sample.

DADA2 tutorial: “There is much more to the dada-class return object than this (see help(”dada-class“) for some info), including multiple diagnostics about the quality of each denoised sequence variant, but that is beyond the scope of an introductory tutorial.”

DADA2 tutorial: “Extensions: By default, the dada function processes each sample independently. However, pooling information across samples can increase sensitivity to sequence variants that may be present at very low frequencies in multiple samples. The dada2 package offers two types of pooling. dada(…, pool=TRUE) performs standard pooled processing, in which all samples are pooled together for sample inference. dada(…, pool=”pseudo“) performs pseudo-pooling, in which samples are processed independently after sharing information between samples, approximating pooled sample inference in linear time.”

NOTE: Generally, the Apprill Lab does not run pool = TRUE or pool = pseudo

Merging paired reads

DADA2 tutorial: “We now merge the forward and reverse reads together to obtain the full denoised sequences. Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding denoised reverse reads, and then constructing the merged “contig” sequences. By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region (but these conditions can be changed via function arguments)."

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)

## 28788 paired-reads (in 126 unique pairings) successfully merged out of 29616 (in 194 pairings) input.

## 37208 paired-reads (in 95 unique pairings) successfully merged out of 37706 (in 134 pairings) input.

## 36113 paired-reads (in 166 unique pairings) successfully merged out of 36466 (in 216 pairings) input.

## 34352 paired-reads (in 242 unique pairings) successfully merged out of 35255 (in 374 pairings) input.

## 28313 paired-reads (in 74 unique pairings) successfully merged out of 28790 (in 119 pairings) input.

## 66693 paired-reads (in 231 unique pairings) successfully merged out of 68647 (in 418 pairings) input.

## 57961 paired-reads (in 268 unique pairings) successfully merged out of 58901 (in 437 pairings) input.

## 49082 paired-reads (in 268 unique pairings) successfully merged out of 50049 (in 458 pairings) input.

## 43244 paired-reads (in 227 unique pairings) successfully merged out of 44051 (in 401 pairings) input.

## 64351 paired-reads (in 228 unique pairings) successfully merged out of 65787 (in 490 pairings) input.

## 78981 paired-reads (in 233 unique pairings) successfully merged out of 80302 (in 450 pairings) input.

## 48545 paired-reads (in 196 unique pairings) successfully merged out of 50241 (in 385 pairings) input.

## 46490 paired-reads (in 114 unique pairings) successfully merged out of 47498 (in 237 pairings) input.

## 39568 paired-reads (in 98 unique pairings) successfully merged out of 40416 (in 226 pairings) input.

## 51433 paired-reads (in 183 unique pairings) successfully merged out of 52356 (in 316 pairings) input.

## 49679 paired-reads (in 207 unique pairings) successfully merged out of 50485 (in 401 pairings) input.

## 59536 paired-reads (in 280 unique pairings) successfully merged out of 61662 (in 483 pairings) input.

## 65499 paired-reads (in 211 unique pairings) successfully merged out of 66256 (in 462 pairings) input.

## 67001 paired-reads (in 218 unique pairings) successfully merged out of 68062 (in 475 pairings) input.

## 49543 paired-reads (in 214 unique pairings) successfully merged out of 51028 (in 403 pairings) input.

## 52341 paired-reads (in 204 unique pairings) successfully merged out of 53809 (in 411 pairings) input.

## 42353 paired-reads (in 180 unique pairings) successfully merged out of 43733 (in 367 pairings) input.

## 54835 paired-reads (in 241 unique pairings) successfully merged out of 55953 (in 447 pairings) input.

## 60176 paired-reads (in 244 unique pairings) successfully merged out of 61384 (in 497 pairings) input.

## 69059 paired-reads (in 125 unique pairings) successfully merged out of 69786 (in 241 pairings) input.

## 51588 paired-reads (in 237 unique pairings) successfully merged out of 53181 (in 447 pairings) input.

## 59041 paired-reads (in 343 unique pairings) successfully merged out of 61108 (in 610 pairings) input.

## 43424 paired-reads (in 169 unique pairings) successfully merged out of 44635 (in 341 pairings) input.

## 61061 paired-reads (in 92 unique pairings) successfully merged out of 61546 (in 160 pairings) input.

## 55369 paired-reads (in 235 unique pairings) successfully merged out of 56588 (in 375 pairings) input.

## 48169 paired-reads (in 160 unique pairings) successfully merged out of 50002 (in 301 pairings) input.

## 40694 paired-reads (in 136 unique pairings) successfully merged out of 41678 (in 258 pairings) input.

## 47571 paired-reads (in 163 unique pairings) successfully merged out of 48727 (in 308 pairings) input.

## 54201 paired-reads (in 210 unique pairings) successfully merged out of 55487 (in 369 pairings) input.

## 31653 paired-reads (in 95 unique pairings) successfully merged out of 31799 (in 145 pairings) input.

## 72653 paired-reads (in 314 unique pairings) successfully merged out of 74092 (in 574 pairings) input.

## 38600 paired-reads (in 241 unique pairings) successfully merged out of 39461 (in 343 pairings) input.

## 40464 paired-reads (in 167 unique pairings) successfully merged out of 41312 (in 281 pairings) input.

## 84295 paired-reads (in 295 unique pairings) successfully merged out of 85321 (in 527 pairings) input.

## 78390 paired-reads (in 266 unique pairings) successfully merged out of 79345 (in 518 pairings) input.

## 57166 paired-reads (in 196 unique pairings) successfully merged out of 58435 (in 348 pairings) input.

## 18691 paired-reads (in 37 unique pairings) successfully merged out of 18914 (in 65 pairings) input.

## 37736 paired-reads (in 78 unique pairings) successfully merged out of 38661 (in 111 pairings) input.

## 67646 paired-reads (in 162 unique pairings) successfully merged out of 68052 (in 244 pairings) input.

## 69575 paired-reads (in 215 unique pairings) successfully merged out of 70244 (in 341 pairings) input.

## 58918 paired-reads (in 469 unique pairings) successfully merged out of 60280 (in 691 pairings) input.

## 46755 paired-reads (in 214 unique pairings) successfully merged out of 47832 (in 363 pairings) input.

## 75876 paired-reads (in 232 unique pairings) successfully merged out of 76878 (in 394 pairings) input.

## 53178 paired-reads (in 343 unique pairings) successfully merged out of 55024 (in 610 pairings) input.

## 49565 paired-reads (in 431 unique pairings) successfully merged out of 50862 (in 666 pairings) input.

## 36360 paired-reads (in 136 unique pairings) successfully merged out of 37209 (in 224 pairings) input.

## 35351 paired-reads (in 54 unique pairings) successfully merged out of 35495 (in 90 pairings) input.

## 6018 paired-reads (in 22 unique pairings) successfully merged out of 6148 (in 40 pairings) input.

## 42648 paired-reads (in 34 unique pairings) successfully merged out of 42898 (in 111 pairings) input.

## 393 paired-reads (in 8 unique pairings) successfully merged out of 400 (in 12 pairings) input.

## 191 paired-reads (in 6 unique pairings) successfully merged out of 227 (in 14 pairings) input.

# Inspect the merger data.frame from the 30th sample
head(mergers[[30]])

##                                                                                                                                                                                                                                                         sequence
## 1 TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTCGTCTGTGAAATTCCGGGGCTTAACTCCGGGCGTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCATGGGTAGCGAACAGG
## 2  TACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTAATAAGTTAGAGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTTTAAAACTGTTAGTCTTGAGATCGAGAGAGGTGAGTGGAATTCCAAGTGTAGAGGTGAAATTCGTAGATATTTGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTACGAAAGTGTGGGGAGCAAACAGG
## 3  TACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGAACGTAGGTGGTTACCTAAGTCAGATGTGAAATCCCTAGGCTCAACCTAGGAACTGCATCTGATACTGGGTGACTAGAGTAGGTGAGAGGAAAGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCTTTCTGGCATCATACTGACACTGAGGTTCGAAAGCGTGGGTAGCAAACAGG
## 4  TACGAAGGGACCTAGCGTAGTTCGGAATTACTGGGCTTAAAGAGTTCGTAGGTGGTTGAAAAAGTTGGTGGTGAAATCCCAGAGCTTAACTCTGGAACTGCCATCAAAACTTTTCAGCTAGAGTATGATAGAGGAAAGCAGAATTTCTAGTGTAGAGGTGAAATTCGTAGATATTAGAAAGAATACCAATTGCGAAGGCAGCTTTCTGGATCATTACTGACACTGAGGAACGAAAGCATGGGTAGCGAAGAGG
## 5  TACGGAGGGTGCAAGCGTTATCCGGAATCATTGGGTTTAAAGGGTCCGCAGGCGGTCGATTAAGTCAGGGGTGAAATCTCACAGCTTAACTGTGAACCTGCCTTTGATACTGGTTGACTTGAGTTATATGGAAGTAGATAGAATAAGTAGTGTAGCGGTGAAATGCATAGATATTACTTAGAATACCGATTGCGAAGGCAGTCTACTACGTATATACTGACGCTCATGGACGAAAGCGTGGGGAGCGAACAGG
## 6 TACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTTGCGTCACTTGTGAAATTCCGGGGCTTAACTCCGGGTGTGCAGGTGATACGGGCATAACTTGAGTGCTGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGGAGCGAACAGG
##   abundance forward reverse nmatch nmismatch nindel prefer accept
## 1      3413       1       1    186         0      0      1   TRUE
## 2      2259       2       2    187         0      0      2   TRUE
## 3      2225       3       3    187         0      0      2   TRUE
## 4      1539       4       4    187         0      0      2   TRUE
## 5      1354       5       5    187         0      0      2   TRUE
## 6      1129       8      14    186         0      0      1   TRUE

DADA2 tutorial: “The mergers object is a list of data.frames from each sample. Each data.frame contains the merged $sequence, its $abundance, and the indices of the $forward and $reverse sequence variants that were merged. Paired reads that did not exactly overlap were removed by mergePairs, further reducing spurious output.”

DADA2 tutorial: “Considerations for your own data: Most of your reads should successfully merge. If that is not the case upstream parameters may need to be revisited: Did you trim away the overlap between your reads?”

Make the ASV sequence count table

DADA2 tutorial: “We can now construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.”

seqtab <- makeSequenceTable(mergers)
#dim(seqtab)
cat("This table contains", dim(seqtab)[2], "ASVs and the lengths of our merged sequences for the", dim(seqtab)[1], "samples in the dataset  \n")

## This table contains 3393 ASVs and the lengths of our merged sequences for the 56 samples in the dataset

Inspect distribution of sequence lengths

Here we can check if the length of the sequences is what we expect.

# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))

## 
##  240  244  246  249  250  251  252  253  254  255  256  257  258  259  260  265 
##   27    1    2    1    1   13  230 2951  134   10    5    4    3    1    1    1 
##  274  286  308  316  321  350  392 
##    1    1    1    1    1    1    2

DADA2 tutorial: “The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.”

cat("Not all lengths of our merged sequences for the", dim(seqtab)[2], "ASVs fall within the expected range for this V4 amplicon. \n")

## Not all lengths of our merged sequences for the 3393 ASVs fall within the expected range for this V4 amplicon.

DADA2 tutorial: “Considerations for your own data: Sequences that are much longer or shorter than expected may be the result of non-specific priming. You can remove non-target-length sequences from your sequence table (eg. seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 250:256]). This is analogous to “cutting a band” in-silico to get amplicons of the targeted length."

Because not all of the ASV lengths fall within the expected range for the V4 amplicon, we will trim to the length we want:

Re-make the ASV table while removing the non target length sequences

seqtab2<- seqtab[,nchar(colnames(seqtab)) %in% 250:256]
dim(seqtab2)

## [1]   56 3344

table(nchar(getSequences(seqtab2)))# Inspect distribution of sequence lengths

## 
##  250  251  252  253  254  255  256 
##    1   13  230 2951  134   10    5

IMPORTANT NOTE REGARDING MULTIPLE SEQUENCING RUNS: If you have multiple sequencing libraries that you would like to run through DADA2 to use in the same downstream analysis, you will run each sequencing run through the DADA2 pipeline separately until this point (just before the chimera checking step). Once you have each run through the pipeline until this point, you will merge sequence tables with the command: mergetab <- mergeSequenceTables(seqtab2_run1, seqtab2_run2) and then check for chimeras and proceed with the rest of the pipeline

Remove Chimeras

DADA2 tutorial: “The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences."

seqtab2.nochim <- removeBimeraDenovo(seqtab2, method="consensus", multithread=TRUE, verbose=TRUE)

## Identified 149 bimeras out of 3344 input sequences.

# dim(seqtab2.nochim)

cat(round((sum(seqtab2.nochim)/sum(seqtab2))*100, digits=2),"% of the ASVs were not chimeric. \n")

## 99.38 % of the ASVs were not chimeric.

a <- (1-(sum(seqtab2.nochim)/sum(seqtab2)))*100
a <- round(a, digits = 2)
cat(a, "% of the ASVs were chimeric and removed. \n")

## 0.62 % of the ASVs were chimeric and removed.

cat("There are now", dim(seqtab2.nochim)[2], "ASVs with sequence lengths of 250-256 base pairs remaining after chimeras were removed")

## There are now 3195 ASVs with sequence lengths of 250-256 base pairs remaining after chimeras were removed

rm(a)

DADA2 tutorial: “The frequency of chimeric sequences varies substantially from dataset to dataset, and depends on on factors including experimental procedures and sample complexity.”

Here you see what percentage of the merged sequence variants were chimeric.

DADA2 tutorial: “Considerations for your own data: Most of your reads should remain after chimera removal (it is not uncommon for a majority of sequence variants to be removed though). If most of your reads were removed as chimeric, upstream processing may need to be revisited. In almost all cases this is caused by primer sequences with ambiguous nucleotides that were not removed prior to beginning the DADA2 pipeline.”

Track reads through the pipeline

DADA2 tutorial: “As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:”

Thus, we will track reads through the pipeline so we can make sure that everything went as expected. A red flag would be if one particular step in the pipeline removed more reads than one would expect

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab2.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
#head(track)
track

##                   input filtered denoisedF denoisedR merged nonchim
## CCB_2019_1        32890    29937     29696     29740  28788   28729
## CCB_2019_2        45172    38017     37785     37865  37208   35768
## CCB_2019_3        38901    36755     36598     36555  36113   35947
## CCB_2019_4        40037    35708     35407     35413  34352   33669
## CCB_2019_5        35530    29127     28961     28854  28313   28141
## CCB_2022_1        75900    69565     68907     69013  66693   65654
## CCB_2022_10       63288    59417     59029     59176  57961   56726
## CCB_2022_11       53171    50774     50262     50251  49082   47323
## CCB_2022_12       47290    44704     44192     44288  43244   42347
## CCB_2022_13       70220    66746     66054     66105  64351   62130
## CCB_2022_14       84342    81356     80609     80680  78981   75742
## CCB_2022_15       56500    51148     50462     50614  48545   48473
## CCB_2022_16       51796    48003     47609     47659  46490   46040
## CCB_2022_17       44088    40909     40573     40530  39568   39007
## CCB_2022_18       57927    52837     52470     52557  51433   50802
## CCB_2022_19       54083    51041     50639     50681  49679   49034
## CCB_2022_2        68442    62604     61937     62043  59536   58769
## CCB_2022_20       70641    67014     66491     66509  65499   65290
## CCB_2022_21       72506    69005     68353     68386  67001   66851
## CCB_2022_22       56156    51735     51168     51404  49543   48968
## CCB_2022_23       59604    54747     54062     54190  52341   51955
## CCB_2022_24       49110    44391     43918     43969  42353   42162
## CCB_2022_25       59803    56832     56278     56251  54835   54248
## CCB_2022_26       64911    62229     61627     61678  60176   59715
## CCB_2022_28       76814    70287     69985     69974  69059   68401
## CCB_2022_29       59206    53937     53431     53467  51588   51497
## CCB_2022_3        66206    62081     61334     61596  59041   58175
## CCB_2022_30       49946    45401     44874     44883  43424   43385
## CCB_2022_31       69596    61912     61657     61676  61061   60960
## CCB_2022_32       61732    57233     56763     56862  55369   54124
## CCB_2022_33       55396    50664     50178     50254  48169   48026
## CCB_2022_34       46146    42258     41833     41865  40694   40610
## CCB_2022_35       52883    49396     48940     48959  47571   47405
## CCB_2022_38       59770    56069     55782     55611  54201   53595
## CCB_2022_39       34410    32040     31871     31895  31653   31016
## CCB_2022_4        79567    75071     74335     74582  72653   71233
## CCB_2022_40       42126    40024     39627     39755  38600   38424
## CCB_2022_41       44416    41769     41505     41488  40464   39726
## CCB_2022_42       89727    86273     85573     85696  84295   83371
## CCB_2022_43       83425    80224     79576     79645  78390   77853
## CCB_2022_44       62971    58905     58609     58581  57166   56658
## CCB_2022_45       21261    19075     18955     18946  18691   18691
## CCB_2022_46       45270    39151     38824     38881  37736   37185
## CCB_2022_47       70879    68547     68159     68248  67646   67369
## CCB_2022_48       73419    70831     70442     70388  69575   69055
## CCB_2022_5        65706    61235     60561     60735  58918   57913
## CCB_2022_6        50579    48407     47989     48065  46755   45758
## CCB_2022_7        80768    77504     77143     77077  75876   74988
## CCB_2022_8        60839    55952     55286     55438  53178   52652
## CCB_2022_9        56304    51753     51160     51207  49565   48795
## DNA_Blank_1       40182    37618     37358     37390  36360   36180
## DNA_Blank_2       39087    35754     35618     35543  35351   33651
## DNA_Blank_3        8118     6319      6212      6168   6018    6018
## MockEven_CCB_2022 49580    43342     43033     43075  42648   42479
## PCR_Ctrl_1          915      516       438       404    393     393
## PCR_Ctrl_2          686      423       305       238    191     191

DADA2 tutorial: “Considerations for your own data: This is a great place to do a last sanity check. Outside of filtering, there should no step in which a majority of reads are lost. If a majority of reads failed to merge, you may need to revisit the truncLen parameter used in the filtering step and make sure that the truncated reads span your amplicon. If a majority of reads were removed as chimeric, you may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification.”

Assign taxonomy to the ASVs for the merged datasets

DADA2 tutorial: “It is common at this point, especially in 16S/18S/ITS amplicon sequencing, to assign taxonomy to the sequence variants. The DADA2 package provides a native implementation of the naive Bayesian classifier method for this purpose. The assignTaxonomy function takes as input a set of sequences to be classified and a training set of reference sequences with known taxonomy, and outputs taxonomic assignments with at least minBoot bootstrap confidence.”

“We maintain formatted training fastas for the RDP training set, GreenGenes clustered at 97% identity, and the Silva reference database, and additional trainings fastas suitable for protists and certain specific environments have been contributed.”

Extensions: “The dada2 package also implements a method to make species level assignments based on exact matching between ASVs and sequenced reference strains. Recent analysis suggests that exact matching (or 100% identity) is the only appropriate way to assign species to 16S gene fragments. Currently, species-assignment training fastas are available for the Silva and RDP 16S databases. To follow the optional species addition step, download the silva_species_assignment_v132.fa.gz file, and place it in the directory with the fastq files.”

The Apprill Lab uses The Silva reference database to assign taxonomy. Download the most recent Silva database formatted for use with DADA2 and the file formatted for the species assignment command from https://zenodo.org/record/4587955#.Ya-b3lNOlTZ

It is important this is cited in the methods - see the instructions in the aforementioned link

This document uses the Silva 138.1 reference database.

Here is what the intro on the website says regarding the Silva 138.1 prokaryotic SSU taxonomic training data formatted for DADA2

McLaren, Michael R.; Callahan, Benjamin J.

These training fasta files are derived from the Silva Project’s version 138.1 release and formatted for use with DADA2. These files are intended for use in classifying prokaryotic 16S sequencing data and are not appropriate for classifying eukaryotic ASVs.

See https://benjjneb.github.io/dada2/training.html for information about DADA2 reference databases and https://www.arb-silva.de/documentation/release-138.1/ for database and citation information for Silva 138.1. The Silva 138.1 database is licensed under Creative Commons Attribution 4.0 (CC-BY 4.0); see file “SILVA_LICENSE.txt”. These fasta database files were generated and checked for consistency using the R markdown documents in the silva-138.1 folder in https://zenodo.org/record/4587946.

If you use these files, please cite one or both of the Silva references below (or at the above link) and the DADA2 paper (reference below). I also recommend citing or linking to the Zenodo record for this specific version in your Methods or published source code to record the specific taxonomic database files used in your analysis.

NOTE: These database files have a known problem in 3/895 families and 59/3936 genera. See https://github.com/mikemc/dada2-reference-databases/blob/main/silva-138.1/v1/bad-taxa.csv for a list of affected taxa and https://github.com/benjjneb/dada2/issues/1293 for more information."

NOTE: You will need to set your own path in the following code block to wherever you downloaded the databases

taxa <- assignTaxonomy(seqtab2.nochim, "~/Documents/dada2/silva_nr99_v138.1_train_set.fa.gz", multithread=TRUE)
taxa <- addSpecies(taxa, "~/Documents/dada2/silva_species_assignment_v138.1.fa.gz")

Inspect the taxonomic assignments:

taxa.print <- taxa # Removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print)

##      Kingdom    Phylum             Class                 Order              
## [1,] "Bacteria" "Cyanobacteria"    "Cyanobacteriia"      "Chloroplast"      
## [2,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "SAR11 clade"      
## [3,] "Bacteria" "Actinobacteriota" "Actinobacteria"      "Corynebacteriales"
## [4,] "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Pseudomonadales"  
## [5,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Rhodobacterales"  
## [6,] "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "SAR11 clade"      
##      Family               Genus             Species
## [1,] NA                   NA                NA     
## [2,] "Clade I"            "Clade Ia"        NA     
## [3,] "Corynebacteriaceae" "Corynebacterium" NA     
## [4,] "Moraxellaceae"      "Moraxella"       NA     
## [5,] "Rhodobacteraceae"   "Amylibacter"     "ulvae"
## [6,] "Clade I"            "Clade Ia"        NA

DADA2 tutorial: “Considerations for your own data: If your reads do not seem to be appropriately assigned, for example lots of your bacterial 16S sequences are being assigned as Eukaryota NA NA NA NA NA, your reads may be in the opposite orientation as the reference database. Tell dada2 to try the reverse-complement orientation with assignTaxonomy(…, tryRC=TRUE) and see if this fixes the assignments.”

Evaluate accuracy of the mock community

There are 20 strains in the mock that we use (HM-782D, BEI Resources, https://www.beiresources.org/Catalog/otherProducts/HM-782D.aspx), so we would expect to see 20 ASVs in the mock community

DADA2 tutorial: “One of the samples included here was a “mock community”, in which a mixture of 20 known strains was sequenced (this mock community is supposed to be 21 strains, but P. acnes is absent from the raw data). Reference sequences corresponding to these strains were provided in the downloaded zip archive. We return to that sample and compare the sequence variants inferred by DADA2 to the expected composition of the community.

NOTE: To evaluate the mock community, you will need to download the reference file https://drive.google.com/file/d/1Ryrhk92v6x8WYRnUAo6YHmJxVlwPCxG_/view?usp=sharing

NOTE: Here the sample name given to the mock is “MockEven_CCB_2022”. You will need to change to the sample name of the mock in your dataset in the following code block. You will also need to adjust the path to wherever you downloaded the file

unqs.mock <- seqtab2.nochim["MockEven_CCB_2022",]
unqs.mock <- sort(unqs.mock[unqs.mock>0], decreasing=TRUE) # Drop ASVs absent in the Mock
cat("DADA2 inferred", length(unqs.mock), "sample sequences present in the Mock community.\n")

## DADA2 inferred 29 sample sequences present in the Mock community.

path2Mock <- "~/Documents/dada2/"
mock.ref <- getSequences(file.path(path2Mock, "HMP_MOCK.v35.fasta"))
match.ref <- sum(sapply(names(unqs.mock), function(x) any(grepl(x, mock.ref))))
cat("Of those,", sum(match.ref), "were exact matches to the expected reference sequences.\n")

## Of those, 21 were exact matches to the expected reference sequences.

uniqs.mock.df <- data.frame(unqs.mock)

Typically with our methods, DADA2 infers between 20-30 sample sequences. An exact match of 19-22 to the reference database is typical and acceptable.

End of the DADA2 portion of the pipeline

Further cleanup of dataset

The next steps of this pipeline will include a little more cleanup before we move on to removing contaminants with decontam. Phyloseq will be used to help with the cleanup.

We will:

Prepare to remove unwanted taxa
Create phyloseq objects
Remove unwanted taxa, such as Eukaroyta, Chloroplasts, and Mitochondria
Remove mock community
Remove samples with less than a certain number of sequences

Preparing for removing unwanted taxa

When assigning taxonomy, taxonomy may only be matched to a certain level. When the code for removing unwanted taxa runs, it will remove anything with NAs at the level at which you are removing. Thus, NAs in the taxonomy matrix need to be replaced with either the lowest level to which a sequence matched or with the character string “unknown”. Here is how to do both ways. The method showing replacing with the lowest level to which a sample matched is commented out and the example replacing NAs with the word “unknown” is run below.

# Create a dataframe of the taxonomy table
  taxa_df <- data.frame(taxa)

# METHOD 1. Fill the NAs with the lowest level taxonomy assigned (otherwise when NAs might be removed when removing unwanted taxa)

    # taxa_df$Kingdom <- as.character(taxa_df$Kingdom)
    # king_na <- which(is.na(taxa_df$Kingdom))
    # taxa_df[king_na, "Kingdom"] <- 'Unknown'
    # 
    # taxa_df$Phylum <- as.character(taxa_df$Phylum)
    # phy_na <- which(is.na(taxa_df$Phylum))
    # taxa_df[phy_na, "Phylum"] <- taxa_df$Kingdom[phy_na]
    # 
    # taxa_df$Class <- as.character(taxa_df$Class)
    # cl_na <- which(is.na(taxa_df$Class))
    # taxa_df[cl_na, "Class"] <- taxa_df$Phylum[cl_na]
    # 
    # taxa_df$Order <- as.character(taxa_df$Order)
    # ord_na <- which(is.na(taxa_df$Order))
    # taxa_df[ord_na, "Order"] <- taxa_df$Class[ord_na]
    # 
    # taxa_df$Family <- as.character(taxa_df$Family)
    # fam_na <- which(is.na(taxa_df$Family))
    # taxa_df[fam_na, "Family"] <- taxa_df$Order[fam_na]
    # 
    # taxa_df$Genus <- as.character(taxa_df$Genus)
    # gen_na <- which(is.na(taxa_df$Genus))
    # taxa_df[gen_na, "Genus"] <- taxa_df$Family[gen_na]


# METHOD 2. Fill all NAs with the word 'unknown' (otherwise when NAs might be removed when removing unwanted taxa) 

        taxa_df$Kingdom <- as.character(taxa_df$Kingdom)
        king_na <- which(is.na(taxa_df$Kingdom))
        taxa_df[king_na, "Kingdom"] <- 'unknown'
        
        taxa_df$Phylum <- as.character(taxa_df$Phylum)
        phy_na <- which(is.na(taxa_df$Phylum))
        taxa_df[phy_na, "Phylum"] <- 'unknown'
        
        taxa_df$Class <- as.character(taxa_df$Class)
        cl_na <- which(is.na(taxa_df$Class))
        taxa_df[cl_na, "Class"] <- 'unknown'
        
        taxa_df$Order <- as.character(taxa_df$Order)
        ord_na <- which(is.na(taxa_df$Order))
        taxa_df[ord_na, "Order"] <- 'unknown'
        
        taxa_df$Family <- as.character(taxa_df$Family)
        fam_na <- which(is.na(taxa_df$Family))
        taxa_df[fam_na, "Family"] <- 'unknown'
        
        taxa_df$Genus <- as.character(taxa_df$Genus)
        gen_na <- which(is.na(taxa_df$Genus))
        taxa_df[gen_na, "Genus"] <- 'unknown'

        taxa_df$Species <- as.character(taxa_df$Species)
        spe_na <- which(is.na(taxa_df$Species))
        taxa_df[spe_na, "Species"] <- 'unknown'

# Convert the taxonomy table back to matrix form so it can be used by phyloseq
    taxa <- as.matrix(taxa_df)

Create Phyloseq objects

# You may want to conduct a preliminary exploration of your data to see how the controls fall out, make a temporary metadata file. Here I create one with just the sample names. 

    metadata <- rownames(seqtab2.nochim)
    metadata <- data.frame(metadata)
    rownames(metadata) <- metadata[,1] # assign values from all rows of the first column as the dataframe rownames 

  

# Now I will create a phyloseq object    
    
    OTU <- otu_table(seqtab2.nochim, taxa_are_rows = FALSE)
    
    TAX <- tax_table(taxa)
    
    META = sample_data(metadata)

    ###CREATE PHYLOSEQ OBJECT and CHECK THAT IT IS CORRECT
    ps <- phyloseq(OTU, TAX, META)
    ps

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 3195 taxa and 56 samples ]
## sample_data() Sample Data:       [ 56 samples by 1 sample variables ]
## tax_table()   Taxonomy Table:    [ 3195 taxa by 7 taxonomic ranks ]

    #remove variables that aren't needed
    rm(OTU)
    rm(TAX)
    rm(META)

Remove unwanted taxa, such as Eukaryota, chloroplasts and mitochondria

pre <- ntaxa(ps)
cat("The total number of taxa in the dataset is", ntaxa(ps), "\n")

## The total number of taxa in the dataset is 3195

cat("There are: ", nrow(subset(taxa_df, Kingdom =="Eukaryota")), "Eukaryotes,", nrow(subset(taxa_df, Order =="Chloroplast")), "Chloroplasts,", "and", nrow(subset(taxa_df, Family =="Mitochondria")), "Mitochondria in the dataset \n")

## There are:  0 Eukaryotes, 157 Chloroplasts, and 63 Mitochondria in the dataset

#Checking to make sure that is the correct amount of taxa that should have been removed
  # If you have Eukaryotes, run the next line. If not, run the following line
  chck <- nrow(subset(taxa_df, Kingdom =="Eukaryota")) + nrow(subset(taxa_df, Order =="Chloroplast")) + nrow(subset(taxa_df, Family =="Mitochondria"))

#chck <- nrow(subset(taxa_df, Order =="Chloroplast")) + nrow(subset(taxa_df, Family =="Mitochondria"))

ps <- subset_taxa(ps, Kingdom!="Eukaroyta")  
ps <- subset_taxa(ps, Family!="Mitochondria")
ps <- subset_taxa(ps, Order!="Chloroplast")

cat("The total number of taxa in the dataset after removing unwanted taxa is", ntaxa(ps), "\nThus,", pre-ntaxa(ps), "were removed from the dataset.\n")

## The total number of taxa in the dataset after removing unwanted taxa is 2975 
## Thus, 220 were removed from the dataset.

cat("This should be equal to", chck)

## This should be equal to 220

rm(pre)
rm(chck)

Remove the mock community from the count table

NOTE: Here the sample name given to the mock is “MockEven_CCB_2022”. You will need to change to the sample name of the mock in your dataset in the following code block.

ps <- prune_samples(sample_names(ps) != "MockEven_CCB_2022", ps) # Remove mock samples

#testmergetab <- rownames(mergetab.nochim[,]) != "Mock"
#mergetab.nochim_test <- mergetab.nochim[testmergetab,]
#dim(mergetab.nochim_test)

ps

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 2975 taxa and 55 samples ]
## sample_data() Sample Data:       [ 55 samples by 1 sample variables ]
## tax_table()   Taxonomy Table:    [ 2975 taxa by 7 taxonomic ranks ]

cat("After removing the mock community sample, there are", nsamples(ps), "in the dataset.\n")

## After removing the mock community sample, there are 55 in the dataset.

remove ASVs not present in any samples

Sometimes certain ASVs are only detected in the samples that will be removed, such as the mock and PCR controls. These ASVs will be removed in this step

ps <- prune_taxa(taxa_sums(ps) > 0, ps)

cat("There are now", ntaxa(ps), "ASVs in the dataset.\n")

## There are now 2948 ASVs in the dataset.

summary of the dataset

cat("There are", nsamples(ps), "samples and", ntaxa(ps), "ASVs in the datasets.\n")

## There are 55 samples and 2948 ASVs in the datasets.

Make ASV sequence count, taxonomy and ASV sequence tables

ASV_CountTable <- data.frame(otu_table(ps))
ASV_Taxonomy <- data.frame(tax_table(ps))

idx <- match(rownames(ASV_Taxonomy), colnames(ASV_CountTable)) #check that both columns are in the same order

    if (any(is.na(idx)) == FALSE)
      {cat("All ASV names match between the row names of the taxonomy table and the column names of the count table", '\n', '\n')
      #looks like they were all aligned, but doesn't hurt just to make it very safe.
      ASV_CountTable <- ASV_CountTable[,idx] 
      } else
      {cat("WARNING! There are mismatches in ASV names between row names of the taxonomy table and the column names of the count table in positions", which(is.na(idx) == TRUE), '\n', '\n')
          }

## All ASV names match between the row names of the taxonomy table and the column names of the count table 
##

    #make a dataframe so you can reference the sequences in the future
    ASVseqs <- data.frame("ASV_ID" = paste0("ASV", seq(from = 1, to = ncol(ASV_CountTable), by = 1)), "ASV_Sequence" = rownames(ASV_Taxonomy))

    #rename ASV and taxa dataframe so they are easier to read
    colnames(ASV_CountTable) <- ASVseqs$ASV_ID
    rownames(ASV_Taxonomy) <- ASVseqs$ASV_ID


# Add the taxonomy information to the table with the ASV sequences
      
      ASV_Taxonomy <- tibble::rownames_to_column(ASV_Taxonomy, var = "ASV_ID")# First need to add row names to a variable (column) called 'ASV_ID' in the dataframe called ASV_Taxonomy
      ASVseqs <- left_join(ASVseqs, ASV_Taxonomy, by = "ASV_ID") #then join the tables

      rownames(ASV_Taxonomy) <- ASV_Taxonomy[,1] # assign values from all rows of the first column as the dataframe rownames 
      #ASV_Taxonomy[,1] <- NULL # delete the first column because now the values are rownames

      ASV_CountTable <- tibble::rownames_to_column(ASV_CountTable, var = "SampleID")# add row names to a variable (column) called 'Sample_ID' in the dataframe called ASV_CountTable

Save these tables and put them in a directory made for the project
Here I commented them out because I already have them saved

#write out these as .txt files to my local computer - done on [INSERT DATE]

# write.table(ASV_CountTable, file = "/Users/cmiller/Dropbox/Miller_Documents/Projects/RightWhale_BlowAndPhotogrammetry/DADA2_Analyses/ASV_CountTable_MiSeq_Library_L2_RightWhaleBlowProject.txt", sep = "\t", row.names = TRUE, col.names = TRUE)
# 
# write.table(ASV_Taxonomy, file = "/Users/cmiller/Dropbox/Miller_Documents/Projects/RightWhale_BlowAndPhotogrammetry/DADA2_Analyses/ASV_Taxonomy_MiSeq_Library_L2_RightWhaleBlowProject.txt", sep = "\t", row.names = TRUE, col.names = TRUE)
# 
# write.table(ASVseqs, file = "/Users/cmiller/Dropbox/Miller_Documents/Projects/RightWhale_BlowAndPhotogrammetry/ASV_Sequences_MiSeq_Library_L2_RightWhaleBlowProject.txt", sep = "\t", row.names = TRUE, col.names = TRUE)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

decontam

Identifying and removing contaminant sequences in the samples from the CDA project using the ‘decontam’ R package. Please Note: This is a different project than the one used as an example in the DADA2 pipeline above because it offers a better example.

Contaminating DNA sequences may have been introduced at during the process of collecting and processing samples. The control samples collected from the experiment (both biological and technical replicates) and during DNA extraction and PCR amplification can be used to help statistically identify contaminants in the samples.

Following the decontam tutorial: https://benjjneb.github.io/decontam/vignettes/decontam_intro.html

Necessary Ingredients:

decontam tutorial:
“The first decontam ingredient is a feature table derived from your raw data, i.e. a table of the relative abundances of sequence features (columns) in each sample (rows). These “sequence features” can be any of a wide variety of feature types, including amplicon sequence variants (ASVs), operational taxonomic units (OTUs), taxonomic groups or phylotypes (e.g. genera), orthologous genes or metagenome-assembled-genomes (MAGs) – anything with a quantitative abundance derived from marker-gene or metagenomics data."

"The second decontam ingredient is one of two types of metadata:

DNA quantitation data recording the concentration of DNA in each sample. Most often this is collected in the form of fluorescent intensities measured prior to mixing samples into equimolar ratios for sequencing (and after PCR in amplicon sequencing), but sometimes it is collected via qPCR or other approach on extracted DNA.

A defined set of “negative control” samples in which sequencing was performed on blanks without any biological sample added. Extraction controls are preferred, and in amplicon sequencing the negative controls should also be carried through the PCR step, as each step in the workflow has the potential to introduce new contaminants.

“Finally, this data needs to be imported into R. The feature table should be imported as a sample-by-feature matrix and the sample metadata as a vector with length the number of samples. Alternatively, you can use the phyloseq package to import the data as a phyloseq object.”

If you are starting a fresh session in r-studio and need to re-import your data, here is an example of how to do that. But also you will need to import your metadata file with your post PCR-purification concentrations. In case you have any not detectables in your column concentrations, this code block has a line to change those to 0.001 because decontam requires a number that is not zero.

NOTE SUMMARY: decontam uses one of two methods to identify contaminants:

the frequency method, which is applied to only your samples (NOT controls) and identifies based on DNA concentrations
the prevalence method, which uses both the samples and the controls to identify contaminants.

More details can be found in the sections below.

IMPORTANT NOTE: It is important to have a look at your controls (both in terms of taxonomy and similarity) before you remove them. One way to get an idea for the similarity among the controls and between the controls and the samples is to run decontam but before removing the contaminants that were identified and before removing your controls from the dataset, run an ordination with the decontam scores as colors so you can see how the controls fall out in terms of similarity.

#Clear the global environment for a fresh start to ensure I do not use any carry-over variables.
rm(list=ls())

# Import Count Table
MATRIX_COUNT <- read.table("/Users/cmiller/Dropbox/Miller_Documents/Projects_NonWhale/CDA_main/dada2_CDA/ASV_and_Tax_Tables/pre_decontam/ASV_CountTable_MiSeq_Libraries_O1_P1_CDAProject.txt", header=TRUE)

#Import Taxonomy Table
taxonomy <- read.table("/Users/cmiller/Dropbox/Miller_Documents/Projects_NonWhale/CDA_main/dada2_CDA/ASV_and_Tax_Tables/pre_decontam/ASV_Taxonomy_MiSeq_Libraries_O1_P1_CDAProject.txt", header=TRUE)

  #taxonomy table needs to be a matrix for phyloseq
  taxonomy <- as.matrix(taxonomy) 

# Import Sample Data  
metadata <- read.table("/Users/cmiller/Dropbox/Miller_Documents/Projects_NonWhale/CDA_main/dada2_CDA/ASV_and_Tax_Tables/CDA_metadata.txt",sep="\t",header=T,row.names=1)
  #remove mock community - will do this in phyloseq

  #one of the PCR controls did not result in any fastq files so I will remove it
  metadata <- subset(metadata, Sample_LabID != "PCR_ctrl_Lib_P1") # now fatsq file from the sequencer for this sample so will remove from metadata

    #decontam is going to need the concentrations of the amplicons to be positive numerics so we will have to substitute the 'nd' we used for not detectable to a zero:
  metadata$Amplicon_Concentration_ng_per_uL <- gsub('nd', "0.001", metadata$Amplicon_Concentration_ng_per_uL)
  metadata$Amplicon_Concentration_ng_per_uL <- as.numeric(metadata$Amplicon_Concentration_ng_per_uL)
  
  

# Verify that all samples match between the relative abundance matrix and the metadata
MATRIX_COUNT_forcheck <- MATRIX_COUNT
MATRIX_COUNT_forcheck <- tibble::rownames_to_column(MATRIX_COUNT, var = "Sample_LabID")

metadata_forcheck <- metadata
#metadata_forcheck <- tibble::rownames_to_column(metadata, var = "SampleID")

if (is.na(anti_join(metadata_forcheck, MATRIX_COUNT_forcheck, by = "Sample_LabID")[1,1]) == TRUE){
print("All samples that are in the metadata are in the count table")
} else {
print("WARNING! Not all samples match between metadata and relative abundance data")
}

## [1] "All samples that are in the metadata are in the count table"

rm(MATRIX_COUNT_forcheck)
rm(metadata_forcheck)

Setting up

decontam tutorial: “The decontam package works with feature tables in the form of a standard R matrix, but it is even easier to work with phyloseq objects from the phyloseq package, which is designed to ease the analysis of marker-gene and metagenomics datasets.”

We start by creating a phyloseq object to work with:

OTU <- otu_table(MATRIX_COUNT, taxa_are_rows = FALSE)

TAX <- tax_table(taxonomy)

META = sample_data(metadata)


###CREATE PHYLOSEQ OBJECT and CHECK THAT IT IS CORRECT
ps <- phyloseq(OTU, TAX, META)
ps

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 31103 taxa and 210 samples ]
## sample_data() Sample Data:       [ 210 samples by 19 sample variables ]
## tax_table()   Taxonomy Table:    [ 31103 taxa by 6 taxonomic ranks ]

rm(OTU)
rm(TAX)
rm(META)


cat("\n")

cat("This phyloseq objects has a table of", ntaxa(ps), "amplicon sequence variants (ASVs) inferred by the DADA2 algorithm from amplicon sequencing data of the V4 region of the 16S rRNA gene. The phyloseq objects also includes the sample metadata information needed to use decontam.")

## This phyloseq objects has a table of 31103 amplicon sequence variants (ASVs) inferred by the DADA2 algorithm from amplicon sequencing data of the V4 region of the 16S rRNA gene. The phyloseq objects also includes the sample metadata information needed to use decontam.

Check the metadata

Make sure the concentrations are numeric

head(sample_data(ps))

##              Sample_LabID   SampleName   SampleCode FreezerBox_Number
## DNA_Blank_1   DNA_Blank_1  DNA_Blank_1  DNA_Blank_1                NA
## DNA_Blank_10 DNA_Blank_10 DNA_Blank_10 DNA_Blank_10                NA
## DNA_Blank_11 DNA_Blank_11 DNA_Blank_11 DNA_Blank_11                NA
## DNA_Blank_12 DNA_Blank_12 DNA_Blank_12 DNA_Blank_12                NA
## DNA_Blank_13 DNA_Blank_13 DNA_Blank_13 DNA_Blank_13                NA
## DNA_Blank_14 DNA_Blank_14 DNA_Blank_14 DNA_Blank_14                NA
##                     Material_Type Sample_or_Control Time_w
## DNA_Blank_1  DNA extraction blank           Control     NA
## DNA_Blank_10 DNA extraction blank           Control     NA
## DNA_Blank_11 DNA extraction blank           Control     NA
## DNA_Blank_12 DNA extraction blank           Control     NA
## DNA_Blank_13 DNA extraction blank           Control     NA
## DNA_Blank_14 DNA extraction blank           Control     NA
##              Biological_replicate_number Technical_replicate_number
## DNA_Blank_1                           NA                         NA
## DNA_Blank_10                          NA                         NA
## DNA_Blank_11                          NA                         NA
## DNA_Blank_12                          NA                         NA
## DNA_Blank_13                          NA                         NA
## DNA_Blank_14                          NA                         NA
##              DNA_Extract_Batch DNA_Extract_Date     PCR_Number     PCR_Date
## DNA_Blank_1                  2        4_29_2021      Barcode_2 06_Sept_2021
## DNA_Blank_10                11        6_11_2021      Barcode_4 21_Sept_2021
## DNA_Blank_11                12        6_14_2021      Barcode_2 06_Sept_2021
## DNA_Blank_12                13        6_15_2021      Barcode_4 21_Sept_2021
## DNA_Blank_13                14        6_16_2021 Barcode_5_Lib1  18_Oct_2021
## DNA_Blank_14                15        6_18_2021      Barcode_4 21_Sept_2021
##              PCR_Plate_Location Forward_Primer Reverse_Primer
## DNA_Blank_1          Plate_2_8C           B504            703
## DNA_Blank_10         Plate_4_2H           B503            705
## DNA_Blank_11         Plate_2_8H           B504            708
## DNA_Blank_12         Plate_4_3H           B503            706
## DNA_Blank_13         Plate_5_5G           B504            709
## DNA_Blank_14         Plate_4_4H           B503            707
##              Quantification_Date Amplicon_Concentration_ng_per_uL
## DNA_Blank_1         23_Sept_2021                            0.001
## DNA_Blank_10        24_Sept_2021                            0.001
## DNA_Blank_11        23_Sept_2021                            0.001
## DNA_Blank_12        28_Sept_2021                            0.001
## DNA_Blank_13         22_Oct_2021                           14.200
## DNA_Blank_14        29_Sept_2021                            0.296
##              Library_Assignment
## DNA_Blank_1          Library_O1
## DNA_Blank_10         Library_P1
## DNA_Blank_11         Library_O1
## DNA_Blank_12         Library_P1
## DNA_Blank_13         Library_O1
## DNA_Blank_14         Library_P1

The key sample variables are Amplicon_Concentration_ng_per_µL, which is the DNA concentration in each sample as measured by fluorescent intensity after PCR purification, and Sample_or_Control which tells us which samples are the negative controls.

Inspect Library Sizes

df <- as.data.frame(sample_data(ps)) # Put sample_data into a ggplot-friendly data.frame
df$LibrarySize <- sample_sums(ps)
df <- df[order(df$LibrarySize),]
df$Index <- seq(nrow(df))
ggplot(data=df, aes(x=Index, y=LibrarySize, color=Sample_or_Control)) + geom_point()

decontam tutorial: "The library sizes of the positive samples primarily fall from 15,000 to 40,000 reads, but there are some low-read outliers. The negative control samples have fewer reads as expected. Note: It is important keep the low-read samples for now, because we want to use those negative controls to help identify contaminants!

cat("The library sizes here range from",  range(df$LibrarySize)[1], "to", range(df$LibrarySize)[2], "reads. Just like the tutorial, there are some low-read outliers. And, the negative controls have fewer reads as expected. Please look at the plot in the decontam tutorial https://benjjneb.github.io/decontam/vignettes/decontam_intro.html")

## The library sizes here range from 37 to 155070 reads. Just like the tutorial, there are some low-read outliers. And, the negative controls have fewer reads as expected. Please look at the plot in the decontam tutorial https://benjjneb.github.io/decontam/vignettes/decontam_intro.html

Identify contaminants using the frequency method

decontam tutorial: “In this method, the distribution of the frequency of each sequence feature as a function of the input DNA concentration is used to identify contaminants.”

IMPORTANT NOTES for consideration about the frequency method from Ben Callahan’s lecture st STAMPS summer 2023:

Any blanks/controls should not be included in this method – only samples should be run with this method
One should have a three-fold range in amplicon concentrations across the sample set
One should NOT combine samples that have differences in biomasses (e.g. a fecal sample versus a sample of air)
it would be great if a DNA dilution series was included in the sequencing library (Callahan said to see https://journals.asm.org/doi/full/10.1128/msystems.00290-19)

Before we can use the frequency method. we need to remove the controls:

Remove controls for using the frequency method

ps.noctrl <- subset_samples(ps, Sample_or_Control != "Control")

One of the considerations from Ben Callahan’s lecture and from the decontam issue tracker https://github.com/benjjneb/decontam/issues/91 indicates that there should be a three-fold range in the amplicon concentrations for the frequency method. In the issue tracker, B. Callahan wrote, “B. Callahan wrote the following about the range presented in the issue tracker:”a wider concentration would make frequency-based contaminant classification more powerful and accurate"

Thus, let’s run summary stats on our concentrations to see what the concentrations look like:

Summary stats of amplicon concentrations

# provides the summary stats on the amplicon concentrations
  summary(sample_data(ps.noctrl)$Amplicon_Concentration_ng_per_uL)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.180   5.165   6.570   6.735   7.855  16.200

# calculates the fold range in the concentrations
  cat("\n","The amplicon concentrations have a", round(max(sample_data(ps.noctrl)$Amplicon_Concentration_ng_per_uL)/min(sample_data(ps.noctrl)$Amplicon_Concentration_ng_per_uL), digits = 1), "fold range.")

## 
##  The amplicon concentrations have a 7.4 fold range.

Relative to the DNA concentrations in the aforementioned issue in the decontam issue tracker (https://github.com/benjjneb/decontam/issues/91), we have a limited range in our amplicon DNA concentrations but we do have a 7.4 fold range so it might be worth a try.

identify contaminants using the frequency method with default settings

contamdf.freq <- isContaminant(ps.noctrl, method="frequency", conc="Amplicon_Concentration_ng_per_uL")
head(contamdf.freq)

##            freq prev    p.freq p.prev         p contaminant
## ASV1 0.14005666  169 0.7679470     NA 0.7679470       FALSE
## ASV2 0.05561513  126 0.7152153     NA 0.7152153       FALSE
## ASV3 0.06167253  132 0.3821514     NA 0.3821514       FALSE
## ASV4 0.03046965  145 0.4543270     NA 0.4543270       FALSE
## ASV5 0.01599135  145 0.5573144     NA 0.5573144       FALSE
## ASV6 0.02414447   42 0.4667109     NA 0.4667109       FALSE

decontam tutorial: “This calculation has returned a data.frame with several columns, the most important being $p which contains the probability that was used for classifying contaminants, and $contaminant which contains TRUE/FALSE classification values with TRUE indicating that the statistical evidence that the associated sequence feature is a contaminant exceeds the user-settable threshold. As we did not specify the threshold, the default value of threshold = 0.1 was used, and $contaminant=TRUE if $p < 0.1.”

show results of frequency method

table(contamdf.freq$contaminant)

## 
## FALSE  TRUE 
## 30999   104

cat("\n")

#head(which(contamdf.freq$contaminant))
which(contamdf.freq$contaminant)

##   [1]   863  1439  2060  2397  2975  3053  3104  3218  3227  3320  3353  3449
##  [13]  3451  3521  3788  3874  3999  4031  4051  4171  4438  4513  4602  4629
##  [25]  4645  4656  4700  4936  5038  5067  5088  5206  5387  5409  5507  5528
##  [37]  5586  5744  5794  6004  6020  6069  6109  6198  6348  6480  6481  6499
##  [49]  6815  6902  6980  7100  7186  7224  7406  7522  7636  7647  7696  7835
##  [61]  7851  7984  8020  8037  8074  8077  8365  8380  8868  8886  9317  9418
##  [73]  9465  9644  9692 10759 10952 11062 11610 11698 11905 12087 12111 12380
##  [85] 12418 12419 12425 12597 12628 12831 13011 13028 13352 13452 14099 14231
##  [97] 14531 14945 15346 15723 15817 16133 16251 16533

    cat("\n")

    cat(sum(contamdf.freq$contaminant),"ASVs were identified as contaminants using the ferquency method with the default settings ", "\n")

## 104 ASVs were identified as contaminants using the ferquency method with the default settings

The decontam tutorial visualizes an example of a clear non-contaminant (the 1st ASV) and a clear contaminant (the 3rd ASV).

We will do the same here:

Plot some of the results of frequency method

Here we are plotting 2 non-contaminants (ASV 1 and ASV 2) and 7 of the identified contaminants

plot_frequency(ps.noctrl, taxa_names(ps)[c(1,863, 1439, 2060, 2397, 2975, 3053, 3104, 3218, 3227, 3320, 3353)], conc="Amplicon_Concentration_ng_per_uL") + 
  xlab("DNA Concentration (ng/µL)")

## Warning: Transformation introduced infinite values in continuous y-axis

decontam tutorial: “in this plot the dashed black line shows the model of a noncontaminant sequence feature for which frequency is expected to be independent of the input DNA concentration. The red line shows the model of a contaminant sequence feature, for which frequency is expected to be inversely proportional to input DNA concentration, as contaminating DNA will make up a larger fraction of the total DNA in samples with very little total DNA.”

Here I plotted an ASV that was NOT identified as contaminants (ASV1) and eleven that were identified as contaminants.

And now let’s plot a few more non-contaminants with which we can compare

plot_frequency(ps.noctrl, taxa_names(ps)[c(1,2,3, 4,5, 6, 7, 8, 90)], conc="Amplicon_Concentration_ng_per_uL") + 
  xlab("DNA Concentration (ng/uL)")

## Warning: Transformation introduced infinite values in continuous y-axis

The frequency contaminant identification method generates plots that are somewhat different for the identified contaminants than the decontam tutorial. However, the plots of the identified contaminants do look quite different from those of non-contaminants.

Let’s try the prevalence method.

Identify contaminants using the prevalence method

decontam tutorial: “In this method, the prevalence (presence/absence across samples) of each sequence feature in true positive samples is compared to the prevalence in negative controls to identify contaminants. In our phyloseq object,”Sample_or_Control" is the sample variable that holds the negative control information. We’ll summarize that data as a logical variable, with TRUE for control samples, as that is the form required by isContaminant."

IMPORTANT NOTES for consideration about the prevalence method from Ben Callahan’s lecture st STAMPS summer 2023:

the ideal negative control, is one that goes through the entire process, including the sampling environment. Such a control will capture any contaminants in the field AND DNA extraction AND PCR.
One should have five negative controls/96 well plate but if you work in low biomass environments, then you want about one to four ratio of negative controls to actual samples

identify contaminants - prevalence method using default settings

sample_data(ps)$is.neg <- sample_data(ps)$Sample_or_Control == "Control"
contamdf.prev <- isContaminant(ps, method="prevalence", neg="is.neg")

show results of prevalence method - default settings

table(contamdf.prev$contaminant)

## 
## FALSE  TRUE 
## 31039    64

cat("\n")

#head(which(contamdf.prev$contaminant))
which(contamdf.prev$contaminant)

##  [1]   48  113  160  173  185  198  204  223  240  244  299  307  314  360  369
## [16]  374  513  611  640  695  777  841  853  866  979 1056 1128 1460 1479 1562
## [31] 1566 1595 1607 1654 1665 1669 1726 1727 1781 1833 2034 2175 2283 2442 2457
## [46] 2491 2501 2525 2698 2876 2903 2931 3166 3470 3500 3530 3550 3552 3947 4332
## [61] 4634 4964 5989 6448

    cat("\n")

    cat(sum(contamdf.prev$contaminant),"ASVs were identified as contaminants using the prevalence method with the default settings ", "\n")

## 64 ASVs were identified as contaminants using the prevalence method with the default settings

Interestingly, all the contaminants identified by the prevalence method were different from those identified by the frequency method.

Plot some of the results of the prevalence method - default settings

Let’s visualize these results by plotting the number of times the ASVs were observed in negative controls and true samples.

ps.pa <- transform_sample_counts(ps, function(abund) 1*(abund>0))
ps.pa.neg <- prune_samples(sample_data(ps.pa)$Sample_or_Control == "Control", ps.pa)
ps.pa.pos <- prune_samples(sample_data(ps.pa)$Sample_or_Control == "True_sample", ps.pa)
# Make data.frame of prevalence in positive and negative samples
df.pa <- data.frame(pa.pos=taxa_sums(ps.pa.pos), pa.neg=taxa_sums(ps.pa.neg),
                      contaminant=contamdf.prev$contaminant)
ggplot(data=df.pa, aes(x=pa.neg, y=pa.pos, color=contaminant)) + geom_point() +
  xlab("Prevalence (Negative Controls)") + ylab("Prevalence (True Samples)")

The ASVs appear to separate into a branch that shows up mostly in positive samples and another mostly in negative samples. There is the concerning ASVs that shows up in most/all of the negative controls and most of the samples

decontam tutorial: “Note that as before, the default threshold for a contaminant is that it reaches a probability of 0.1 in the statistical test being performed. In the prevalence test there is a special value worth knowing, threshold=0.5, that will identify as contaminants all sequences thare are more prevalent in negative controls than in positive samples. Let’s try using this more aggressive classification threshold rather than the default.”

identify contaminants - prevalence method using more aggressive threshold of 0.5:

contamdf.prev05 <- isContaminant(ps, method="prevalence", neg="is.neg", threshold=0.5)

show results of prevalence method - threshold = 0.5:

table(contamdf.prev05$contaminant)

## 
## FALSE  TRUE 
## 30897   206

#head(which(contamdf.prev05$contaminant))
which(contamdf.prev05$contaminant)

##   [1]    48    62   113   160   173   185   198   204   223   240   244   299
##  [13]   307   314   360   369   374   513   606   611   640   695   737   777
##  [25]   797   841   853   866   979  1056  1096  1128  1175  1210  1316  1325
##  [37]  1371  1389  1439  1460  1479  1562  1566  1595  1607  1624  1654  1665
##  [49]  1669  1677  1726  1727  1742  1748  1781  1801  1833  1860  1878  1884
##  [61]  1893  1941  1989  1998  2020  2026  2034  2071  2073  2083  2090  2175
##  [73]  2203  2227  2241  2247  2283  2356  2401  2422  2442  2457  2465  2491
##  [85]  2501  2525  2584  2607  2608  2652  2692  2698  2740  2776  2824  2876
##  [97]  2903  2931  2952  2996  3046  3056  3104  3110  3117  3118  3120  3128
## [109]  3130  3166  3167  3180  3187  3209  3230  3231  3233  3255  3256  3316
## [121]  3369  3416  3470  3500  3530  3545  3546  3550  3551  3552  3583  3599
## [133]  3613  3614  3628  3644  3648  3681  3737  3790  3804  3811  3850  3883
## [145]  3888  3947  4058  4149  4157  4164  4174  4245  4332  4364  4378  4380
## [157]  4407  4437  4485  4486  4513  4514  4569  4603  4605  4621  4634  4635
## [169]  4732  4740  4815  4850  4964  4965  5054  5073  5079  5147  5238  5291
## [181]  5393  5503  5552  5764  5819  5989  6197  6208  6209  6211  6354  6448
## [193]  6531  6605  6670  6770  6887  7243  7244  7394  7962  8212  8796 10059
## [205] 10713 15582

    cat("\n")

    cat(sum(contamdf.prev05$contaminant),"ASVs were identified as contaminants using the prevalence method with the threshold=0.5 setting ", "\n")

## 206 ASVs were identified as contaminants using the prevalence method with the threshold=0.5 setting

Plot some of the results of the prevalence method - threshold = 0.5:

Let’s again visualize these results by plotting the number of times the ASVs were observed in negative controls and true samples.

#ps.pa <- transform_sample_counts(ps, function(abund) 1*(abund>0))
#ps.pa.neg <- prune_samples(sample_data(ps.pa)$Sample_or_Control == "Control", ps.pa)
#ps.pa.pos <- prune_samples(sample_data(ps.pa)$Sample_or_Control == "True_sample", ps.pa)
# Make data.frame of prevalence in positive and negative samples
df.pa.prev05 <- data.frame(pa.pos=taxa_sums(ps.pa.pos), pa.neg=taxa_sums(ps.pa.neg),
                      contaminant=contamdf.prev05$contaminant)
ggplot(data=df.pa.prev05, aes(x=pa.neg, y=pa.pos, color=contaminant)) + geom_point() +
  xlab("Prevalence (Negative Controls)") + ylab("Prevalence (True Samples)")

For our purposes, not much is gained by increasing the threshold to 0.5 so we will proceed with results of the prevalence method with the default settings (threshold = 0.1)

Before we remove contaminants, let’s do a little cleanup of the environment

rm(contamdf.freq)
rm(contamdf.prev05)
rm(df.pa.prev05)

Remove contaminants from the dataset

Here we will remove the contaminants identified by the prevalence method (threshold = 0.1).

ps.nocontam <- prune_taxa(!contamdf.prev$contaminant, ps)

# Verify that the contaminants were removed:  
    cat("Number of ASVs in the dataset prior to removing decontam identified contaminants =", ntaxa(ps), "\n")

## Number of ASVs in the dataset prior to removing decontam identified contaminants = 31103

    cat("Number of ASVs considered contaminants by decontam =",length(which(contamdf.prev$contaminant)), "\n")

## Number of ASVs considered contaminants by decontam = 64

    cat("The difference between these two numbers =", ntaxa(ps)-length(which(contamdf.prev$contaminant)), ",","\n")

## The difference between these two numbers = 31039 ,

    #print("...which should equal the total number of ASVs in the dataset after removing the contaminants, and that is", ntaxa(ps.nocontam), "ASVs")
    #ntaxa(ps.nocontam)
    
    cat("which should equal the number of ASVs in the dataset from which contaminants were removed =", ntaxa(ps.nocontam), "\n", "\n")

## which should equal the number of ASVs in the dataset from which contaminants were removed = 31039 
##

    cat("Another check: The difference between the original dataset and the all-contaminants-removed dataset =", ntaxa(ps)-ntaxa(ps.nocontam),",", "\n")

## Another check: The difference between the original dataset and the all-contaminants-removed dataset = 64 ,

    cat("which should equal the total number of ASVs identified as contaminants by decontam =", length(which(contamdf.prev$contaminant)), "\n", "\n")

## which should equal the total number of ASVs identified as contaminants by decontam = 64 
##

    cat("check the code if the numbers above don't match where they are supposed to match!", "\n")

## check the code if the numbers above don't match where they are supposed to match!

Remove ASVs not present in any samples

Removing ASVs not present in any sample might be important for some of your upcoming analyses

ps.nocontam.filt <- filter_taxa(ps.nocontam, function(x) sum(x) > 0, TRUE) 
ps.nocontam.filt

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 31039 taxa and 210 samples ]
## sample_data() Sample Data:       [ 210 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 31039 taxa by 6 taxonomic ranks ]

    cat("Number of ASVs in the dataset (without contaminants) prior to removing ASVs not present in any sample =", ntaxa(ps.nocontam), "\n", "\n")

## Number of ASVs in the dataset (without contaminants) prior to removing ASVs not present in any sample = 31039 
##

    cat("Number of ASVs in the dataset (without contaminants) after removing ASVs not present in any sample =", ntaxa(ps.nocontam.filt), "\n", "\n")

## Number of ASVs in the dataset (without contaminants) after removing ASVs not present in any sample = 31039 
##

    cat("The number of ASVs not present in any sample that were removed =", ntaxa(ps.nocontam)-ntaxa(ps.nocontam.filt), "\n", "\n")

## The number of ASVs not present in any sample that were removed = 0 
##

Pulling it all together - decontam

decontam tutorial: “The two basic methods implemented are the frequency and prevalence methods shown above, but a number of additional ways to utilize those methods are available. The combined, minimum and independent modes all use both the frequency and prevalence methods to identify contaminants, but combine the results of the two methods in different ways (see ?isContaminant for more information). There is also a batch functionality that allows contaminants to be identified independently in different batches (e.g. sequencing runs, or different studies) that are expected to have different contaminant profiles.”

NOTE: Deciding which method you want to use to identify contaminants will be project specific.

decontam Conclusions

decontam tutorial: "We all know that contaminants are a problem in marker-gene and metagenomics data. Right now, it is widespread practice to address contamition by sequencing negative controls, and then doing nothing, in large part because there haven’t been easy to use tools available to identify contaminants in this kind of data. That’s where the decontam package comes in. decontam provides a simple interface that takes in your table of sequence features (ASVs, OTUs, MAGs, genera, etc), and classifies each sequence feature as a contaminant based on signatures of contamination that have been demonstrated over many previous studies.

Removing contaminants makes the characterizations provided by marker-gene and metagenomics sequencing more accurate. It prevent false positives in exploratory analyses. It reduces batch effects between different studies and sequencing runs. It increases statistical power by reducing the additional hypotheses spent on contaminant sequencing features.

Removing contaminants makes your data better, and that’s always worth doing."

Output the tables to your project directory

Here is an example of how to save your tables to text files for future analyses

# Comment: write out these as .txt files to my local computer - done on INSERT DATE
# ASV_Count <- data.frame(otu_table(ps.nocontam.filt))
# write.table(ASV_Count, file = "~/Documents/Projects_NonWhale/CDA_main_local/ASV_and_Tax_Tables/post_decontam/ASV_CountTableNoContam_MiSeq_Libraries_O1_P1_CDAProject.txt", sep = "\t", row.names = TRUE, col.names = TRUE)
# 
# ASV_Taxonomy <- data.frame(tax_table(ps.nocontam.filt))
# write.table(ASV_Taxonomy, file = "~/Documents/Projects_NonWhale/CDA_main_local/ASV_and_Tax_Tables/post_decontam/ASV_TaxonomyNoContam_MiSeq_Libraries_O1_P1_CDAProject.txt", sep = "\t", row.names = TRUE, col.names = TRUE)
# 
# rm(ASV_Count)
# rm(ASV_Taxonomy)

Version:

V1 - drafted 26 Jan - 01 Feb 2024 Carolyn Miller