core biodiversity analysis (e.g., alpha-diversity, beta-diversity and differential abundance analyses) and comparisons between groups
classification analysis
Getting started
If you are new to shotgun metagenomics processing, or have not heard of Apptainer/Singularity before, then start with “What is Container Image?”
1 - What is Container Image?
brief introduction to image Containers
…in a nutshell
A Container Image is an image file with packaged code (software applications) along with its dependencies (e.g., specific operating system, system libraries) that can be deployed consistently as a container in any environment.
Since the image is unchangeable after it is built, it allows for consistent and repeatable deployment. For example, let’s say an image was built with Ubuntu as the operating system, and you deploy this image on your Windows 11 PC (host). A Ubuntu “container” will be created and the programs inside this container will run under the Ubuntu settings.
Different platforms are available for building and deploying these container images, for example, Apptainer, SingularityCE or Docker.
For the MIMA Pipeline we use Apptainer.
Apptainer == Singularity
We recently changed to Apptainer from Singularity, any overlooked mention of ‘Singularity’ in the tutorials can be changed to ‘Apptainer’.
Noteworthy points
Build a Sandbox
Every time you deploy a container from an image file (*.sif extension), the platform needs to unpack the system files each time. If the container is complex and have many dependencies or programs, this can be time consuming.
You can bypass this by building a sandbox that unpacks the files once into a directory (container in a directory) and then repeatedly use that directory. This removes the need to keep unpacking each time.
Containers need explicit instructions to access your files
A container is deployed with the bare minimum filesystem to operate, meaning that it only has access to your home directory by default.
It does not automatically have access to other filesystems or disks. If your data is located on another drive or path, then you need to explicitly provide this information using either
Option 1)-B parameter: the binding only exists during the life of the container
apptainer run -B /home/jsmith/studyABC:/study/data,/srv/shared/refDB:/reference/data
Option 2)APPTAINER_BIND environmental variable: the binding exists during your terminal session (multiple deployed containers)
The path binding format is: <host-PC>:<container> where the:
Left of the colon (:) is the absolute path of the directory found on the host PC
Right of the colon (:) is the path that the Container sees when it’s deployed
The above example:
Path on Host PC (left)
Path in Container (right)
Description
/home/jsmith/studyABC
/study/data
The studyABC directory on the host PC is in the home directory of user jsmith (/home/jsmith). Applications running in the Container can access the data from /study/data
/srv/shared/refDB
/reference/data
The refDB directory on the hostPC is in a shared located /srv/shared. Applications running in the Container can access this data from /reference/data
Shortcuts/softlinks also need binding
If you have shortcuts or softlinks in your home directory pointing to elsewhere, you also need to bind the original locations using the -B parameter or the environment variable setting.
2 - Installation
MIMA Container image
All tools required by the MIMA pipeline are encapsulated into a Container image file with the mimaX_vX.X.X_yyyyddmm.sif naming scheme, where XXX denotes different versions.
The latest MIMA container version (mima4_v2.0.0) supports HUMAnN v3.8 and MetaPhlAn v4.1.0 tools and their reference database (older reference databases will not work).
We assume you are on a HPC environment with a job scheduling system. Many HPC environments will not allow running containers from the login or head node. You need to start an interactive PBS job first.
In OpenPBS, specify the following to request an interactive job with 4 CPUs, 4GB ram for 6 hours
Create a SANDBOX environment variable to store the full directory path (saves typing a long paths!)
exportSANDBOX=`pwd`/mima-pipeline
Confirm installation
Test SANDBOX environment variable is working
If this command is not working then check your SANDBOX environment variable using echo $SANDBOX which should output the path where you build the sandbox
apptainer run $SANDBOX
Below is the output, check the line active environment : mima is the same as below
----
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-binutils_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-gcc_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-gfortran_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-gxx_linux-64.sh:10:40: parameter expansion requires a literal
Auto-activate MIMA conda environment
source: /opt/miniconda/envs/mima/etc/conda/deactivate.d/deactivate-gxx_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/deactivate.d/deactivate-gfortran_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/deactivate.d/deactivate-gcc_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/deactivate.d/deactivate-binutils_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-binutils_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-gcc_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-gfortran_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-gxx_linux-64.sh:10:40: parameter expansion requires a literal
----
This is the MIMA pipeline Container
v2.0.0 - build: 20240409
active environment : mima
active env location : /opt/miniconda/envs/mima
shell level : 1
user config file : /home/z3534482/.condarc
populated config files : /home/z3534482/.condarc
conda version : 24.1.2
conda-build version : not installed
python version : 3.12.1.final.0
solver : libmamba (default)
virtual packages : __archspec=1=zen2
__conda=24.1.2=0
__glibc=2.35=0
__linux=4.18.0=0
__unix=0=0
base environment : /opt/miniconda (read only)
conda av data dir : /opt/miniconda/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/biobakery/linux-64
https://conda.anaconda.org/biobakery/noarch
https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
https://conda.anaconda.org/bioconda/linux-64
https://conda.anaconda.org/bioconda/noarch
https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
https://conda.anaconda.org/default/linux-64
https://conda.anaconda.org/default/noarch
package cache : /opt/miniconda/pkgs
/home/z3534482/.conda/pkgs
envs directories : /home/z3534482/.conda/envs
/opt/miniconda/envs
platform : linux-64
user-agent : conda/24.1.2 requests/2.31.0 CPython/3.12.1 Linux/4.18.0-513.24.1.el8_9.x86_64 ubuntu/22.04.4 glibc/2.35 solver/libmamba conda-libmamba-solver/23.12.0 libmambapy/1.5.3
UID:GID : 13534482:5000
netrc file : None
offline mode : False
Python 3.10.8
Rscript (R) version 4.2.1 (2022-06-23)
humann v3.8
MetaPhlAn version 4.1.0 (23 Aug 2023)
java -ea -Xmx97655m -Xms97655m -cp /opt/miniconda/envs/mima/opt/bbmap-39.01-1/current/ clump.Clumpify --version
BBMap version 39.01
For help, please run the shellscript with no parameters, or look in /docs/.
fastp 0.23.4
2.28-r1209
Kraken version 2.1.3
Copyright 2013-2023, Derrick Wood (dwood@cs.jhu.edu)
Reminder
By default Containers are deployed with very minimum filesystem access and you might need to bind paths
Many steps in the pipeline require access to reference databases. These reference databases are very big and are often already downloaded by the system administrators. As such they are not included in the container images.
To run the pipeline you need to know the absolute paths for the below reference databases.
minimap2
kraken2 and bracken
humann
CHOCOPhlAn
uniref
metaphlan CHOCOPhlAn database
You might also need to set up path binding when deploying the containers.
If you are missing any reference datasets, see below for download information.
QC: decontamination step
Tool
Description
URL
Minimap2
requires reference genome, we used the Humann reference genome GRCh38.p14 from NCBI (~800MB)
Tourlousse, et al. (2022), Characterization and Demonstration of Mock Communities as Control Reagents for Accurate Human Microbiome Community Measures, Microbiology Spectrum.
This data set consists of two mock communities: DNA-mock and Cell-mock. The mock communities consists of bacteria that are mainly detected the human gastrointestinal tract ecosystem with a small mixture of some skin microbiota. The data was processed in three different labs: A, B and C. In the previous tutorial, , we only processed a subset of the samples (n=9). In this tutorial we will be working with the full data set which has been pre-processed using the same pipeline. In total there were 56 samples of which 4 samples fell below the abundance threshold and therefore the final taxonomy abundance table has 52 samples. We will train the random forest classifier to distinguish between the three labs.
The raw reads are available from NCBI SRA Project PRJNA747117, and there are 56 paired-end samples (112 fastq files)
As the data is very big we will work with a subset (n=9 samples, 18 fastq files)
This tutorial teaches you how to prepare the required raw fastq files
Note
You will need about 80GB of disk space depending on which option you used for downloading.
This tutorial assumes the ~/mima_tutorial directory is located in your home directory as indicated by the ~ (tilde) sign. If you have put the files in another location then replace all occurrences of ~/mima_tutorial with your location (remember to use absolute paths).
Below is the output, wait until all files are downloaded
2022-09-08T05:50:42 prefetch.3.0.0: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2022-09-08T05:50:42 prefetch.3.0.0: 1) Downloading 'SRR17380209'...
2022-09-08T05:50:42 prefetch.3.0.0: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2022-09-08T05:50:42 prefetch.3.0.0: Downloading via HTTPS...
...
After download finish, check the downloaded files with the tree command
Below is the output, wait until all files are downloaded
2022-09-08T05:50:42 prefetch.3.0.0: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2022-09-08T05:50:42 prefetch.3.0.0: 1) Downloading 'SRR17380209'...
2022-09-08T05:50:42 prefetch.3.0.0: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2022-09-08T05:50:42 prefetch.3.0.0: Downloading via HTTPS...
...
After download finishes, check the downloaded files
the first row is the header and is case sensitive, it must have the three columns: Sample_ID,FileID_R1,FileID_R2
the filenames in columns 2 and 3 do not need to be absolute paths as the directory where the files are located will be specified during quality checking
After downloading the tutorial data, we assume that the mima_tutorial is the working directory located in your home directory (specified by the tilde, ~). Hence, we will try to always make sure we are in the right directory first before executing a command, for example, run the following commands:
$ cd ~/mima_tutorial
$ tree .
the starting directory structure for mima_tutorial should look something like:
From here on, ~/mima_tutorial will refer to the project directory as depicted above. Replace this path if you saved the tutorial data in another location.
TIP: Text editors
There are several places where you may need to edit the commands, scripts or files. You can use the vim text editors to edit text files directly on the terminal.
For example, the command below lets you edit the pbs_head_qc.cfg text file
vim pbs_header_qc.cfg
Containers and binding paths
When deploying images, make sure you check if you need to bind any paths.
PBS configuration files
The three modules (QC, Taxonomy profiling and Function profiling) in the data-processing pipeline require access to a job queue and instructions about the resources required for the job. For example, the number of CPUs, the RAM size, the time required for execution etc. These parameters are defined in PBS configuration text files.
Three such configuration files are in provided after you have downloaded the tutorial data. There are 3 configuration files, one for each module as they require different PBS settings indicated by lines starting with the #PBS tags.
APPTAINER_BIND is the environment variable you set when binding file paths to the container.
Use absolute paths
When running the pipeline it is best to use full paths when specifying the locations of input files, output files and reference data to avoid any ambiguity.
Absolute/Full paths
always start with the root directory, indicated by the forward slash (/) on Unix based systems.
e.g., below changes directory (cd) to a folder named scripts that is located in the user jsmith’s home directory. Provided this folder exists, then this command will work from anywhere on the filesystem."
[~/metadata] $ cd /home/jsmith/scripts
Relative paths
are relative to the current working directory
Now imagine the following file system structure in the user john_doe’s home directory
The asterisks marks his current location, which is inside the /home/user/john_doe/studyAB/metadata sub-folder
In this example we are currently in the metadata directory, and change directory to a folder named data that is located in the parent directory (..)
This command only works provided there is a data folder in the parent directory above metadata
According to the filesystem above, the parent directory of metadata is studyABC and there is no data subfolder in this directory, so this command will fail with an error message
[/home/user/john_doe/studyABC/metadata] $ cd ../data
-bash: cd: ../data: No such file or directory
Now that you have installed the data and know the basics, you can begin data processing with quality control.
4.1.3 - Quality control (QC)
QC module in MIMA, check reads are of high standard
This step must be done before Taxonomy and Function profiling.
Introduction: QC module
Quality control (QC) checks the sequenced reads obtained from the sequencing machine is of good quality. Bad quality reads or artefacts due to sequencing error if not removed can lead to spurious results and affect downstream analyses. There are a number of tools available for checking the read quality of high-throughput sequencing data.
One bash (*.sh) script per sample is generated by this module that performs the following steps. The module also generates $N$ number of PBS scripts (set using the --num-pbs-jobs parameter).
This tutorial has 9 samples spread across 3 PBS jobs. One PBS job will execute three bash scripts, one per sample.
Step
1. repair.sh from BBTools/BBMap tool suite, repairs the sequenced reads and separates out any singleton reads (orphaned reads that are missing either the forward or reverse partner read) into its own text file.
2. dereplication using clumpify.sh from BBTools/BBMap tool suite, removes duplicate reads and clusters reads for faster downstream processing
3. quality checking is done with fastp.sh and checks the quality of the reads and removes any reads that are of low quality, too long or too short
4. decontamination with minimap2.sh maps the sequenced reads against a reference genome (e.g., human) defined by the user. It removes these host-reads from the data and generates a cleaned output file. This becomes the input for taxonomy profiling and function profiling modules.
The (optional) last step generates a QC report text file with summary statistics for each file. This step is run after all PBS scripts have been executed for all samples in the study.
Find the absolute path to the host reference genome file (e.g., Human host might use GRCh38_latest_genome.fna)
Replace the highlighted line --ref <path/to/human/GRCh38_latest_genome.fna> to the location of your host reference genome file
Enter the following command
the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
note if you enter the command as one line, remove the backslashes
must be absolute path to the directory where fastq files are stored. FastQ files often have the *.fastq.gz or *.fq.gz extension. This path is used to find the FileID_R1 and FileID_R2 columns specified in the manifest.csv file
-o <output>
yes
absolute path output directory where results will be saved. The <output> path will be created if it does not exists.
Note if there are already existing subdirectories in the <output> path, then this step will fail.
-m <manifest.csv>
yes
a comma-separated file (*.csv) that has three columns with the headers: Sample_ID, FileID_R1, FileID_R2 see the example below.
Note the fileID_R1 and fileID_R2 are relative to the -i <input> path provided.
--num-pbs-jobs
no (default=4)
number of PBS scripts to generate by default 4 jobs are created with samples split equally between the jobs
--ref
yes
path to the host genome file GRCh38_latest_genome.fna
--mode container
no (default=‘single’)
set this if you are running as a Container
--pbs-config
yes if --mode container
path to the pbs configuration file, must specify this parameter if --mode container is set
tip
These commands can get really long, you can use Notepad to
cd ~/mima_tutorial/output/QC_module
qsub qcModule_0.pbs
You can check the job has been submitted and their status with qstat
Repeat the qsub command for each of the qcModule_1.pbs and qcModule_2.pbs files
Wait until all PBS jobs have completed
PBS log files
Usually a PBS job will generate a log file in the same directory from where the job was submitted
As such we changed directory to where the PBS scripts are saved to submit the jobs because the log files are required for the Step 5. Generating the QC report.
Step 4. Check QC outputs
After all PBS job completes, you should have the following outputs
tree ~/mima_tutorial/output/QC_module
We only show outputs for one sample below with ... meaning “and others”
You’ll have a set of output files for each sample listed in the manifest.csv (provided the corresponding *.fastq files exists)
assign reads to taxa, generating a taxonomy feature table ready for analysis
Introduction to Taxonomy Profiling
Taxonomy profiling takes the quality controlled (clean) sequenced reads as input and matches them against a reference database of previously characterised sequences for taxonomy classification.
There are many different classification tools, for example: Kraken2, MetaPhlAn, Clark, Centrifuge, MEGAN, and many more.
This pipeline uses Kraken2 and Bracken abundance estimation. Both require access to reference database or you can generate your own reference data. In this pipeline, we will use the GTDB database (release 95) and have built a Kraken2 database.
Workflow description
Kraken2 requires big memory (~300GB) to run, so there is only one PBS script that will execute each sample sequentially.
In this tutorial, we have 9 samples which will be executed within one PBS job.
Steps in the taxonomy profiling module:
Step
Kraken2 assigns each read to a lowest common ancestor by matching the sequenced reads against a reference database of previously classified species
Bracken takes the Kraken2 output to estimate abundances for a given taxonomic rank. This is repeated from Phylum to Species rank.
Generate feature table is performed after all samples assigned to a taxa and abundances estimated. This step combines the output for all samples and generates a feature table for a given taxonomic rank. The feature table contains discrete counts and relative abundances of "taxon X occurring in sample Y".
Step 1. Generate PBS script
Before starting, have you understood the need to know points?
After QC checking your sequence samples and generating a set of CleanReads
Find the absolute paths for the Kraken2 and Bracken reference databases on your HPC system (!! they need to be in the same directory)
Replace the highlighted line --reference-path <path/to/Kraken2_db> with the Kraken2 absolute path
the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
note if you enter the command as one line, remove the backslashes
full path to the ~/mima_tutorial/output/QC_module/CleanReads directory that was generated from Step 1) QC, above. This directory should hold all the *_clean.fastq files
-o <output>
yes
full path to the ~/mima_tutorial/output directory where you would like the output files to be saved, can be the same as Step 1) QC
--reference-path
yes
full path to the reference database (this pipeline uses the GTDB release 95 reference database)
--read-length
no (default=150)
read length for Bracken estimation, choose the value closest to your sequenced read length (choose from 50, 75, 100 and 150)
--threshold
no (default=1000)
Bracken filtering threshold, features with counts below this value are filtered in the abundance estimation
--mode container
no (default=single)
set this if you are running as a Container. By default, the PBS scripts generated are for the ‘standalone’ option, that is without Singularity
--pbs-config
yes if --mode container
path to the pbs configuration file (see below). You must specify this parameter if --mode container is set. You do not need to set this parameter if running outside of Singularity
If you changed the file extension of the cleaned files or are working with already cleaned files from somewhere else, you can specify the forward and reverse suffix using:
Parameter
Required?
Description
--fwd-suffix
default=_clean_1.fq.gz
file suffix for cleaned forward reads from QC module
--rev-suffix
default=_clean_2.fq.gz
file suffix for cleaned reverse reads from QC module
Step 2. Check generated scripts
After Step 1, you should see in the output directory: ~/mima_tutorial/output/Taxonomy_profiling
one PBS script
one bash script/sample (total of 9 bash scripts in this tutorial)
Have a look at the directory structure using tree
tree ~/mima_tutorial/output/Taxonomy_profiling
Expected directory structure
bracken/ and kraken2/ are subdirectories created by Step 2 to store the output files after PBS job is executed
this is the directory that contains the *_bracken.log files, by default these should be in the output/Taxonomy_profiling directory
update this path if you have the *_bracken.log files in another location
apptainer run --app mima-bracken-report $SANDBOX\
-i ~/mima_tutorial/output/Taxonomy_profiling
The output is a tab separated text file
By default, the output file bracken-summary.tsv will be in the same location as the input directory. You can override this using the -o parameter to specify the full path to the output file.
Examine the output report using head
column formats the text file as columns like a table for readability
head ~/mima_tutorial/output/Taxonomy_profiling/bracken-summary.tsv | column -t
Your output should resemble something like below (only the first 9 columns are shown in the example below)
numbers might be different depending on the threshold setting or tool version
assign reads to gene families and pathways, generating a function feature table ready for analysis
Introduction to Functional Profiling
Functional profiling, like taxonomy profiling, takes the cleaned sequenced reads after QC checking as input and matches them against a reference database of previously characterised gene sequences.
There are different types of functional classification tools available. This pipeline uses the HUMAnN3 pipeline, which also comes with its own reference databases.
Workflow description
One PBS script per sample is generated by this module. In this tutorial, we have 9 samples, so there will be 9 PBS scripts.
Steps in the functional profiling module:
Steps
HUMAnN3 is used for processing and generates three outputs for each sample:
gene families
pathway abundances
pathway coverage
Generate feature table is performed after all samples have been processed. This combines the output and generates a feature table that contains the abundance of gene/pathway X in sample Y.
Step 1. Generate PBS scripts
Before starting, have you understood the need to know points?
After QC checking your sequence samples and generating a set of CleanReads
Replace the highlighted lines with the absolute path for the reference databases
path/to/humann3/chocophlan
path/to/humann3/uniref
path/to/humann3/utility_mapping
path/to/metaphlan_databases
the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
note if you enter the command as one line, remove the backslashes
full path to the ~/mima_tutorial/output/QC_module/CleanReads directory that was generated from Step 1) QC, above. This directory should hold all the *_clean.fastq files
-o <output>
yes
full path to the ~/mima_tutorial/output output directory where you would like the output files to be saved, can be the same as Step 1) QC
--nucleotide-database <path>
yes
directory containing the nucleotide database, (default=/refdb/humann/chocophlan)
--protein-database <path>
yes
directory containing the protein database, (default=/refdb/humann/uniref)
--utility-database <path>
yes
directory containing the protein database, (default=/refdb/humann/utility_mapping)
--metaphlan-options <string>
yes
additional MetaPhlAn settings, like specifying the bowtie2 reference database (default="++bowtie2db /refdb/humann/metaphlan_databases"). Enclose parameters in double quotes and use ++ for parameter settings.
--mode container
no (default=‘single’)
set this if you are running in the Container mode
--pbs-config
yes if --mode container
path to the pbs configuration file (see below). You must specify this parameter if --mode container is set. You do not need to set this parameter if running outside of Singularity
Your PBS script should look something like below, with some differences
line 10: IMAGE_DIR should be where you installed MIMA and build the sandbox
line 11: APPTAINER_BIND should be setup during installation when binding paths
make sure to include the path where the host reference genome file is located
line 14: /home/user is replaced with the absolute path to your actual home directory
lines 22-25: ensure the paths to the reference databases are all replaced (the example below uses the vJan21 database version and the parameter index is sent to MetaPhlAn to disable autocheck of index)
In the file generate_func_feature_tables.sh, the last step in the humann_regroup_table command uses the KEGG orthology mapping file, that is the -c parameter:
generate bar plots to visualise the overall biodiversity in your samples
Introduction
When starting a new metagenomics study, it is useful to visualise the microbial diversity present in the individual samples. You can also average the relative abundances of the most abundant species present and observe how they are distributed between groups in your study (e.g., controls versus disease or placebo versus treatment).
For taxonomy features, the biodiversity bar plots can be generated for each taxonomic rank from Phylum to Species.
Example data
The data set used for this tutorial is based on the full data set from the data processing tutorials(Tourlousse, et al., 2022) that was pre-process with the same pipeline. There are 56 samples in total from the study, but after taxonomy profiling with Kraken2 and Bracken, 4 samples fell below the abundance threshold and therefore the final taxonomy abundance table has 52 samples.
This data set consists of two mock communities: (i) DNA-mock and (ii) Cell-mock processed in three different labs: A, B and C.
Check list
For this set of tutorial, you need
System with Apptainer or Singularity installed (currently only tested on UNIX-based system)
This tutorial assumes the vis-tutorial directory is in your home directory as indicated by the ~ (tilde) sign.
If you have put the files in another location then replace all occurrences of ~/vis-tutorial with your location
Data files explained
Directory / File
Description
metadata.tsv
is a tab-separated text file of the study metadata
There are feature tables for each taxonomy rank from Phylum to Species. As some analysis tools might require feature table input to be discrete counts while others might require relative abundances, two formats are provided with the suffixes:
*_count files have discrete counts of features (row) per samples (columns)
*_relAbund files ahve the relative abundances of features (row) per samples (columns)
Confirm you have the following 3 files (if they don’t exists, there will be an error message)
ls metadata.tsv
ls Taxonomy_profiling/featureTables/bracken_FT_genus_counts
ls Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund
Step 2) Examine input files
Examine metadata
Examine the metadata.csv file:
head shows the first 10 rows
column formats the output into columns using tab as the separator (prettier view)
head metadata.tsv | column -t -s $'\t'
Sample_ID biosample experiment instrument library.name sample.name lab type n_species
SRR17380113 SAMN20256237 SRX13554346 Illumina HiSeq 2500 metagenome_mockCell_labC_lib016 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380114 SAMN20256237 SRX13554345 Illumina HiSeq 2500 metagenome_mockCell_labC_lib015 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380115 SAMN20256237 SRX13554344 Illumina HiSeq 2500 metagenome_mockCell_labC_lib014 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380116 SAMN20256237 SRX13554343 Illumina HiSeq 2500 metagenome_mockCell_labC_lib013 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380117 SAMN20256237 SRX13554342 Illumina HiSeq 2500 metagenome_mockCell_labC_lib012 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380118 SAMN20256237 SRX13554341 Illumina HiSeq 2500 metagenome_mockCell_labC_lib011 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380119 SAMN20256237 SRX13554340 Illumina HiSeq 2500 metagenome_mockCell_labC_lib010 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380120 SAMN20256237 SRX13554339 Illumina HiSeq 2500 metagenome_mockCell_labC_lib009 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380121 SAMN20256237 SRX13554338 Illumina HiSeq 2500 metagenome_mockCell_labC_lib008 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
There should be 9 columns in the metadata.tsv file
row 1 is the header and the columns are pretty much self-explanatory
column 1 is the Sample_ID, which must correspond with the sample names in the feature tables
column 8 is the lab where the sample was processed, this will be our grouping factor in later steps
Examine taxonomy feature table
You can have a quick look at the taxonomy input table
this command will show the first 10 rows and 6 columns
head Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund | cut -f1-6 | column -t -s $'\t'
For example, check the sample IDs (columns) are the same as the metadata.tsv
the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
note if you enter the command as one line, remove the backslashes
file path of the study metadata table. Expect a tab separated text file with one row per sample with the first row being the header, and columns are the different metadata.
4
<column_group:g1:g2>
column name in metadata.tsv that holds the grouping information and the group levels. Format is group_column:group1:group2:.... Note: the values are case sensitive. In this tutorial, the lab column is used and has 3 groups: labA, labB and labC.
5
<output_dir>
output directory path where results are stored
6-1*
<output_prefix>
prefix that will be used for output files. In the example, LAB is set as the prefix for output files.
6-2*
<top_N>
the number of top taxa to colour in the stack bar plot, remaining detected taxa are grouped together as ‘Other’.
6-3*
<figure_width>
output figure width (inches)
6-4*
<figure_height>
output figure height (inches)
*the last 4 parameters are all on line 6 separated by a space
Note! Parameter order matters
The parameters must be in the same order described above.
Step 4) Check output files
You should have the following output files after running Step 2
mean group abundances for the overall top 7 taxa across the entire study
*.average.top_*_separategroup
mean group abundances for the top 7 taxa within each group
*.meta.txt
study metadata specific for the plots
*.table_top_*
sample abundances of the overall top 7 taxa across the entire study
*.table_top_*_separategroup.*
sample abundances of the top 7 taxa within each group
Example plots
Relative abundances across all samples
This plot shows the relative abundances for the top 8 genera for each sample, where the top 8 genera is across all samples in the study.
File: LAB.select.top_8_.barchart.pdf
Mean relative abundances by group
This plot shows the mean relative abundances for the top 8 genera occurring across all samples in the study. All other detected taxa not in the top 7 are aggregated as Others. In this example, Pseudomonas E is the most abundant in samples processed by Lab-C."
File: LAB.select.average.top_8_.barchart.pdf
Plots from Phylum to Species
You can repeat Step 3 and change the input file from Phylum to Species feature tables to generate bar plots for the different or desired taxonomic ranks.
Alternatively if you have other grouping variables in your study design, you can change column_groupparameter in the command.
4.2.2 - Core diversity analysis and visualisation
comparisons of groups in cross-sectional studies
Introduction
This tutorial runs through three common core biodiversity analyses for cross-sectional study designs that include group comparisons.
Alpha-diversity analysis
describes the biodiversity of a single sample/community. There are many metrics such as Richness (number of species detected), Evenness (describing the distribution of abundances across the detected species in a sample), Shannon Diversity (combines richness and abundances) and others.
the pipeline will calculate the alpha-diversity for each sample present in the feature table and then perform hypothesis testing to assess if there are any significant (p < 0.05) differences between groups. The pipeline also generates a set of plots to compare the alph-diversity between groups.
Beta-diversity analysis
measures the community (dis)similarity between two or more samples/communities. Two common metrics are:
Jaccard (dis)similarity which only considers presence/absences of species
Bray-Curtis dissimilarity which also considers the abundances of the detected species
the pipeline will calculate the pairwise beta-diversities between all samples present in the feature table and then perform PERMANOVA testing using the adonis function in vegan to assess for any significant (p < 0.05) differences between groups. The pipeline also generates a set of plots to help visualise the similarity between samples coloured by the groups using multiple ordinations.
Differential abundance analysis
identifies which features (e.g., species, genes, pathways) are significant different between groups based on their relative abundances
Example study data
The data set used for this tutorial is based on the full data set from the [data processing tutorials](https://mrcbioinfo.github.io/mima-pipeline/docs/tutorials/data-processing/ (Tourlousse, et al., 2022) that was pre-process with the same pipeline. There are 56 samples in total from the study, but after taxonomy profiling with Kraken2 and Bracken, 4 samples fell below the abundance threshold and therefore the final taxonomy abundance table has 52 samples.
This data set consists of two mock communities: (i) DNA-mock and (ii) Cell-mock processed in three different labs: A, B and C.
Analysis objective
In this tutorial, our main questions are:
is there a difference between the two types of mock communities? and
are there any differences between the processing labs?
Check list
For this set of tutorial, you need
System with Apptainer or Singularity installed (currently only tested on UNIX-based system)
This tutorial assumes the vis-tutorial directory is in your home directory as indicated by the ~ (tilde) sign.
If you have put the files in another location then replace all occurrences of ~/vis-tutorial with your location.
Data files explained
Directory / File
Description
metadata.tsv
is a tab-separated text file of the study metadata
There are feature tables for each taxonomy rank from Phylum to Species. As some analysis tools might require feature table input to be discrete counts while others might require relative abundances, two formats are provided with the suffixes:
*_count files have discrete counts of features (row) per samples (columns)
*_relAbund files ahve the relative abundances of features (row) per samples (columns)
Confirm you have the following 3 files (if they don’t exists, there will be an error message)
ls metadata.tsv
ls Taxonomy_profiling/featureTables/bracken_FT_genus_counts
ls Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund
Step 2) Examine input files
Examine metadata
Examine the metadata.csv file:
head shows the first 10 rows
column formats the output into columns using tab as the separator (prettier view)
head metadata.tsv | column -t -s $'\t'
Sample_ID biosample experiment instrument library.name sample.name lab type n_species
SRR17380113 SAMN20256237 SRX13554346 Illumina HiSeq 2500 metagenome_mockCell_labC_lib016 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380114 SAMN20256237 SRX13554345 Illumina HiSeq 2500 metagenome_mockCell_labC_lib015 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380115 SAMN20256237 SRX13554344 Illumina HiSeq 2500 metagenome_mockCell_labC_lib014 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380116 SAMN20256237 SRX13554343 Illumina HiSeq 2500 metagenome_mockCell_labC_lib013 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380117 SAMN20256237 SRX13554342 Illumina HiSeq 2500 metagenome_mockCell_labC_lib012 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380118 SAMN20256237 SRX13554341 Illumina HiSeq 2500 metagenome_mockCell_labC_lib011 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380119 SAMN20256237 SRX13554340 Illumina HiSeq 2500 metagenome_mockCell_labC_lib010 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380120 SAMN20256237 SRX13554339 Illumina HiSeq 2500 metagenome_mockCell_labC_lib009 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380121 SAMN20256237 SRX13554338 Illumina HiSeq 2500 metagenome_mockCell_labC_lib008 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
There should be 9 columns in the metadata.tsv file
row 1 is the header and the columns are pretty much self-explanatory
column 1 is the Sample_ID, which must correspond with the sample names in the feature tables
column 8 is the lab where the sample was processed, this will be our grouping factor in later steps
Examine taxonomy feature table
You can have a quick look at the taxonomy input table
this command will show the first 10 rows and 6 columns
head Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund | cut -f1-6 | column -t -s $'\t'
For example, check the sample IDs (columns) are the same as the metadata.tsv
the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
note if you enter the command as one line, remove the backslashes
Comma separated list of column names found in the --metadata file. These columns contain the study groups of interest. Each column should have at least 2 or more levels for comparisons.
--output-dir <path>
output directory path
In this tutorial, we are performing two sets of comparisons:
comparing between labs (3 levels: A, B, and C)
comparing between mock community type (2 levels: DNA and Cell)
Since we ran 2 sets of comparisons, between lab and type, there are 2 sets of results as indicated by the 2nd part of the filename.
When comparing between labs, the output files with *_lab_* in the filename (highlighted rows), you should have:
4 significant results after adjusting for multiple comparison adjustment using: chao1, chao2, evenness and observed richness diversity indcies
3 non-significant results after adjustment using: inverse simpson, shannon and simpson indices
0 results that were significant before adjustment
Sub-directories explained
Directory
Description
Kraken_alpha/boxplots/
boxplots comparing groups defined by the --study-groups parameter. If there are two groups then the Wilcox rank-sum test is performed, if there are more than two groups than the Kruskal-Wallis test is performed.
Comparisons are adjusted for multiple comparisons and results are separated further into sub-directories.
.../boxplots/corrected_sig/
alpha-diversities that show significant group differences after adjustment
.../boxplots/non_sig/
non-significant alpha-diversities after adjustment
.../boxplots/uncorrected_sig/
results that are significant before adjustment
Kraken_alpha/wilcox-pairwise/
text output from Wilcox rank-sum tests for two-group comparisons defined by the --study-groups parameter
File
alpha_diversity_raw_data.txt
tab-separated text file of calculated alpha-diversities (columns) for each sample (row)
*_filter_table.txt
filtered alpha_diversity_raw_data.txt table
*_kw_stat_summary.txt
output from Kruskal-Wallis test when there are > 2 groups being compared
*_pairwise_dunn_test.txt
output from post-hoc pairwise comparisons when there are > 2 groups being compared
Example figures
Comparing Observed Richness between Labs
There is significant differences between Labs A-vs-B, A-vs-C and B-vs-C when comparing their Observed richness (number of species detected in a sample).
Comparing Evenness between Labs
There is significant differences between Labs A-vs-B, A-vs-C and B-vs-C when comparing their Evenness (a measure to describe the abundance distribution of species detected in a sample/community, e.g., their dominance/rareness).
*_results.tsv are text files of results from the PERMANOVA tests
Multiple ordination plots are generated in the plots directory
As we preformed 2 sets of comparisons (the --study-groups parameter had value lab,type) we have 2 files per ordination with the column name used as the suffix for the output file
Example figures
Principal co-ordinate analysis plot of the beta-diversity measured using Bray-Curtis dissimilarity between samples grouped by the three labs.
tree ~/vis-tutorial/analysis/Kraken_diff-abundance/
warning! there will be a lot of files because we performed 2 sets of comparisons between lab and type. Below only shows a snapshot with ... meaning “and others”
The univariate comparison suggests there is significant lab differences after adjustment for multiple comparison in species *Acetivibrio A ethanolgignens*.
Example marker 2
Another example of significant lab differences after adjustment for multiple comparison in species *Bifidobacterium callitrichos*.
bar plot of each input feature and the measured Hodges-Lehmann estimator. suitable for 2 groups only, if you have >2 groups in your study, this will only compare the first two groups.
Kraken_diff-abundance/boxplots/
one boxplot for every feature (e.g., species, gene, pathway) in the input feature table, comparing between groups specified by the columns in --study-groups.
Comparisons are adjusted for multiple comparisons and results are separated further into sub-directories
.../boxplots/corrected_sig/
features that show significant group differences after adjustment
.../boxplots/non_sig/
non-significant features after adjustment
.../boxplots/uncorrected_sig/
features that are significant before adjustment
Kraken_diff-abundance/volcano/
volcano plots of comparisons, a scatterplot showing the statistical significance (p-value) versus the magnitude of change (fold change).
Kraken_diff-abundance/wilcox-pairwise/
text output from Wilcox rank-sum tests for two-group comparisons defined by the --study-groups parameter
File
*_filter_table.txt
filtered alpha_diversity_raw_data.txt table
*_kw_stat_summary.txt
output from Kruskal-Wallis test when there are > 2 groups being compared
*_pairwise_dunn_test.txt
output from post-hoc pairwise comparisons when there are > 2 groups being compared
4.2.3 - Classification with Random Forest
assessing the classification using Random Forest for cross-sectional studies
Introduction
This tutorial shows how to run Random Forest classification using the MIMA Container.
Briefly, Random Forest is a supervised machine learning approach for classification and regression. It uses labelled data (what samples below to what group) to learn which key data features are useful in differentiating samples between classes (groups)—hence “supervised learning”—so that when new data are presented it can make a prediction.
Input data required for training a Random Forest classifier:
one or more feature tables: a text file with features (e.g., species, genes, or pathways) as rows and samples as columns. The cells denote the relative abundance of feature i for sample j.
metadata: informs the classifier which samples belong to which group; samples must belong to mutually exclusive groups.
One common approach to evaluate classifier performance is via cross-validation (CV), which basically splits the input data into a training set and a testing set. MIMA uses the K-fold cross-validation approach. This splits the input data into K number of partitions; hold one partition for testing and the remaining K-1 partitions are used for training. So when you have 5-fold CV, you split data into 5 parts or 80/20% split with 80% of the data used for training and 20% used for testing.
In MIMA, K ranges from 3 to 10, incrementing by one each time, so for each input data you will get 8 outputs: cv_3, cv_4, …, cv_9, cv_10.
Our data consists of N=205 stool samples, of which 99 are from females and 106 are from males. Only samples from their first visit are used. The taxonomy profiles have been annotated using the MetaPhlAn2 pipeline.
Analysis objective
Our question is “how well can we distinguish between Sex using the human stool microbiome?”
Check list
For this set of tutorial, you need
System with Apptainer or Singularity installed (currently only tested on UNIX-based system)
There should be 9 tab-separated text files (*.tsv extension) of which one is the metadata and the other are feature tables from Kingdom to Strain ranks.
Note! assumed working directory
This tutorial assumes the randforest-tut directory is in your home directory as indicated by the ~ (tilde) sign.
If you have put the files in another location then replace all occurrences of ~/randforest-tut with your location.
Step 2) Examine input files
Examine the metadata
head HMPv1-stool-samples/hmp1-stool-visit-1.metadata.tsv | column -t
SN RANDSID VISNO STArea STSite SNPRNT Gender WMSPhase SRS
700014562 158458797 1 Gut Stool 700014555 Female 1 SRS011061
700014724 158479027 1 Gut Stool 700014718 Male 1 SRS011084
700014837 158499257 1 Gut Stool 700014832 Male 1 SRS011134
700015181 158742018 1 Gut Stool 700015179 Female 1 SRS011239
700015250 158802708 1 Gut Stool 700015245 Male 1 SRS011271
700015415 158944319 1 Gut Stool 700015394 Female 1 SRS011302
700015981 159247771 1 Gut Stool 700015979 Female 1 SRS011405
700016142 159146620 1 Gut Stool 700016136 Male 1 SRS011452
700016610 159166850 1 Gut Stool 700016608 Male 1 SRS011529
There are 9 columns, where SN is the Sample ID and should correspond with the columns of the feature table
Examine the CLASS feature table
head HMPv1-stool-samples/hmp1-stool-visit-1.features_3-class.tsv | cut -f1-6 | column -t
if you used different data preparation settings, than replace all ~/randforest-tut with your location.
the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
note if you enter the command as one line, remove the backslashes
comma separated list of input text files (e.g. taxonomy abundance tables).
10
-m/--metadata <path>
file path of the study metadata table. Expect a tab separated text file with one row per sample with the first row being the header, and columns are the different metadata.
11
-c/--column <column_name>
column header from the metadata table that contains the grouping categories for classification
12
-o/--output <output_dir>
output directory path where results are stored
N+1 output
If N input files are provided, there will be N+1 sets of output:
one classifier for each input data
a classifier trained on data_all which combines all the input data together as a big table
In this tutorial, we provided 8 input taxonomy files from Kingdom to Strain level features. This will generate 9 output files: one for each taxonomy rank and the last is a big table combining all the previous 8 together.
Step 4) Check output files
As mentioned above, MIMA uses K-fold cross validation (CV/cv) to evaluate the trained Random Forest classifier overall performance. For each input data there will be 8 outputs from cv_3 to cv_10.
there is on file one per input data (8 in total from Kingdom to Strain taxonomy abundance tables) and the last output data_all is where all input tables are combined.
Each PDF (data set) will have 8 pages: from 3- to 10-fold CV
cv_auc.pdf
1
9
one page per input data: 8 from Kingdom to Strain and the last page for data_all where all the input tables are combined
roc_classif.randomForest_dataset.pdf
1
8
one page per cross-validation (from 3- to 10-fold CV)
roc_data_*.pdf
9
8
same as above
Example figures
There is one roc_data_*_classifier.pdf file for each input data. Each PDF contains one page per N-fold cross validation (CV) that includes 3-folds to 10-folds CV.
Below we’ve only shown the classification results when using Species (data_7) as input features and for 4 out of 8 CVs.
Receiver operating characteristic curve (ROC curve) is a graph that shows the overall performance of a classification model. Ideally, good classifiers will have high specificity (x-axis) and high sensitivity (y-axis), therefore the desired curve is towards to top-left above the diagonal line. If the curve follows the diagonal line then the classifier is no better than randomly guessing the class (groups) a sample belongs.
ROC curve characteristics [source: Wikipedia]
note
In the plots generated by MIMA Random Forest Classification, the x-axis shown is in reverse order from 1 to 0, since false positive rate, \(FPR = 1-Specificity\)
The area under the ROC curve (AUC) is the overall accuracy of a classifier, ranging between 0-1 (or reported as percentages).
Below shows a subset of the figures from the Species (input 7) abundance table roc_data_7_classifier.pdf. There is one page per cross-validation from 3- to 10-fold CV.
Note the AUC annotation at the top of each figure
This shows the ROC curve for the Species relative abundance table (data_7) when using 3-fold cross-validation.
This figure shows the ROC curve for the Species relative abundance table (data_7) when using 4-fold cross-validation.
This figure shows the ROC curve for the Species relative abundance table (data_7) when using 8-fold cross-validation.
This figure shows the ROC curve for the Species relative abundance table (data_7) when using 10-fold cross-validation.
cv.auc.pdf files shows an aggregation of the above roc_data_*_classifier.pdf plots. There is one page per input data and the plots shows the AUC against the K-fold CV.
In this plot, we observe that when using Species (data_7) abundances as the feature table, the classifier accuracy ranges between 0.7 to 0.8 with a peak performance occurring at cv_8 (8-fold cross validation). In general, when there is more data for training (higher cv) there is better performance (higher AUC).
This plot compares accuracy (AUC) for Random Forest classifiers trained on the Species abundance tables (input 7 == page 7). The AUC values correspond to the area under the curve of the above roc_data_7_classifier.pdf file. In this plot, we observe that when using Species (data_7) abundances as the feature table, the classifier accuracy ranges between 0.7 to 0.8 with a peak performance occurring at cv_8 (8-fold cross validation). In general, when there is more data for training (higher cv) there is better performance (higher AUC).
roc_classif.randomForest_dataset.pdf is summary PDF with one page per K-fold CV. Each plot shows the ROC curves for each input data for a specified K-fold CV. Below shows the performance for classifiers trained using 8-fold CV (page 8) across the 8 input tables (Kingdom to Strain) plus the final data table that combine all the eight tables together as one big dataset giving the ninth data set.
This PDF combines the individual plots from the set of roc_data_*_classifier.pdf files.
The plot below tells us that when using 8-fold CV for training, the best performing classifier to distinguish between males and females, in this set is data_7 with an accuracy of 0.787 (or 78.7%). This classifier was trained using Species rank abundance table. Strain level is comparable with accuracy of 0.779, while Kingdom classifiers perform very poorly (accuracy = 0.519) which are pretty much random guesses. There was no improvement to the classify if we combined all the features (data_9) together.
The ROC curves of Random Forest classifiers trained using 8-fold cross-validation (cv_8) for input data from Kingdom (data_1) to Strain (data_8) and the combined dataset which merges all input data as one table (data_9). When using 8-fold CV the best performing classifier has accuracy of 0.787 using the Species input data (data_7). See text description for further interpretation.