This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Documentation

1: What is Container Image?
2: Installation
3: Data dependencies
4: Tutorials

4.1: Data processing with MIMA

4.1.1: Download tutorial data
4.1.2: Need to know
4.1.3: Quality control (QC)
4.1.4: Taxonomy Profiling
4.1.5: Functional Profiling

4.2: Analytical tutorials

4.2.1: Biodiversity bar plots
4.2.2: Core diversity analysis and visualisation
4.2.3: Classification with Random Forest

MIMA Pipeline

The MIMA Pipeline contains two fundamental components and the tutorials are split as such:

Data processing
- checks the quality of shotgun metagenomics sequence reads
- detects what taxa (phylum to species) are present in the samples and generates taxonomy feature tables
- detects what genes (families and pathways) are present in the samples and generates function feature tables
Analytics
- core biodiversity analysis (e.g., alpha-diversity, beta-diversity and differential abundance analyses) and comparisons between groups
- classification analysis

pipeline-schema

Getting started

If you are new to shotgun metagenomics processing, or have not heard of Apptainer/Singularity before, then start with “What is Container Image?”

1 - What is Container Image?

brief introduction to image Containers

…in a nutshell

A Container Image is an image file with packaged code (software applications) along with its dependencies (e.g., specific operating system, system libraries) that can be deployed consistently as a container in any environment.

Since the image is unchangeable after it is built, it allows for consistent and repeatable deployment. For example, let’s say an image was built with Ubuntu as the operating system, and you deploy this image on your Windows 11 PC (host). A Ubuntu “container” will be created and the programs inside this container will run under the Ubuntu settings.

Different platforms are available for building and deploying these container images, for example, Apptainer, SingularityCE or Docker.

For the MIMA Pipeline we use Apptainer.

Apptainer == Singularity

We recently changed to Apptainer from Singularity, any overlooked mention of ‘Singularity’ in the tutorials can be changed to ‘Apptainer’.

Noteworthy points

Build a Sandbox

Every time you deploy a container from an image file (*.sif extension), the platform needs to unpack the system files each time. If the container is complex and have many dependencies or programs, this can be time consuming.

You can bypass this by building a sandbox that unpacks the files once into a directory (container in a directory) and then repeatedly use that directory. This removes the need to keep unpacking each time.

apptainer build --sandbox <container-directory> <image.sif>

Containers need explicit instructions to access your files

A container is deployed with the bare minimum filesystem to operate, meaning that it only has access to your home directory by default.

It does not automatically have access to other filesystems or disks. If your data is located on another drive or path, then you need to explicitly provide this information using either

Option 1) -B parameter: the binding only exists during the life of the container

apptainer run -B /home/jsmith/studyABC:/study/data,/srv/shared/refDB:/reference/data

Option 2) APPTAINER_BIND environmental variable: the binding exists during your terminal session (multiple deployed containers)

export APPTAINER_BIND="/home/jsmith/studyABC:/study/data,/srv/shared/refDB:/reference/data"

The path binding format is: <host-PC>:<container> where the:

Left of the colon (:) is the absolute path of the directory found on the host PC
Right of the colon (:) is the path that the Container sees when it’s deployed

The above example:

Path on Host PC (left)	Path in Container (right)	Description
`/home/jsmith/studyABC`	`/study/data`	The `studyABC` directory on the host PC is in the home directory of user jsmith (`/home/jsmith`). Applications running in the Container can access the data from `/study/data`
`/srv/shared/refDB`	`/reference/data`	The `refDB` directory on the hostPC is in a shared located `/srv/shared`. Applications running in the Container can access this data from `/reference/data`

Shortcuts/softlinks also need binding

If you have shortcuts or softlinks in your home directory pointing to elsewhere, you also need to bind the original locations using the -B parameter or the environment variable setting.

2 - Installation

MIMA Container image

All tools required by the MIMA pipeline are encapsulated into a Container image file with the mimaX_vX.X.X_yyyyddmm.sif naming scheme, where XXX denotes different versions.

The latest MIMA container version (mima4_v2.0.0) supports HUMAnN v3.8 and MetaPhlAn v4.1.0 tools and their reference database (older reference databases will not work).

For MRC users, see here for Container image location

Start an interactive PBS job

We assume you are on a HPC environment with a job scheduling system. Many HPC environments will not allow running containers from the login or head node. You need to start an interactive PBS job first.

In OpenPBS, specify the following to request an interactive job with 4 CPUs, 4GB ram for 6 hours

qsub -I -l select=1:ncpus=4:mem=4gb,walltime=6:00:00

Optional: if apptainer (or singularity) is installed via Modules on your host machine, then run the following

module load apptainer
apptainer --version

at the time of writing this tutorial we were using apptainer version 1.2.4-1el.8

Build a sandbox

As mentioned previously, to skip repeated unpacking the image when we deploy containers, we will build a sandbox container.

Build a sandbox called mima-pipeline
- below shows command for the two MIMA versions

apptainer build --sandbox mima-pipeline mima4_v2.0.0_20240409.sif

INFO:    Starting build...
INFO:    Verifying bootstrap image mima4_v2.0.0_20240409.sif
INFO:    Creating sandbox directory...
INFO:    Build complete:

Create a SANDBOX environment variable to store the full directory path (saves typing a long paths!)

export SANDBOX=`pwd`/mima-pipeline

Confirm installation

Test SANDBOX environment variable is working
If this command is not working then check your SANDBOX environment variable using echo $SANDBOX which should output the path where you build the sandbox

apptainer run $SANDBOX

Below is the output, check the line active environment : mima is the same as below

----
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-binutils_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-gcc_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-gfortran_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-gxx_linux-64.sh:10:40: parameter expansion requires a literal
Auto-activate MIMA conda environment
source: /opt/miniconda/envs/mima/etc/conda/deactivate.d/deactivate-gxx_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/deactivate.d/deactivate-gfortran_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/deactivate.d/deactivate-gcc_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/deactivate.d/deactivate-binutils_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-binutils_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-gcc_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-gfortran_linux-64.sh:10:40: parameter expansion requires a literal
source: /opt/miniconda/envs/mima/etc/conda/activate.d/activate-gxx_linux-64.sh:10:40: parameter expansion requires a literal
----
This is the MIMA pipeline Container
v2.0.0 - build: 20240409

   active environment : mima
  active env location : /opt/miniconda/envs/mima
          shell level : 1
     user config file : /home/z3534482/.condarc
populated config files : /home/z3534482/.condarc
        conda version : 24.1.2
  conda-build version : not installed
       python version : 3.12.1.final.0
               solver : libmamba (default)
     virtual packages : __archspec=1=zen2
                        __conda=24.1.2=0
                        __glibc=2.35=0
                        __linux=4.18.0=0
                        __unix=0=0
     base environment : /opt/miniconda  (read only)
    conda av data dir : /opt/miniconda/etc/conda
conda av metadata url : None
         channel URLs : https://conda.anaconda.org/biobakery/linux-64
                        https://conda.anaconda.org/biobakery/noarch
                        https://conda.anaconda.org/conda-forge/linux-64
                        https://conda.anaconda.org/conda-forge/noarch
                        https://conda.anaconda.org/bioconda/linux-64
                        https://conda.anaconda.org/bioconda/noarch
                        https://repo.anaconda.com/pkgs/main/linux-64
                        https://repo.anaconda.com/pkgs/main/noarch
                        https://repo.anaconda.com/pkgs/r/linux-64
                        https://repo.anaconda.com/pkgs/r/noarch
                        https://conda.anaconda.org/default/linux-64
                        https://conda.anaconda.org/default/noarch
        package cache : /opt/miniconda/pkgs
                        /home/z3534482/.conda/pkgs
     envs directories : /home/z3534482/.conda/envs
                        /opt/miniconda/envs
             platform : linux-64
           user-agent : conda/24.1.2 requests/2.31.0 CPython/3.12.1 Linux/4.18.0-513.24.1.el8_9.x86_64 ubuntu/22.04.4 glibc/2.35 solver/libmamba conda-libmamba-solver/23.12.0 libmambapy/1.5.3
              UID:GID : 13534482:5000
           netrc file : None
         offline mode : False


Python 3.10.8
Rscript (R) version 4.2.1 (2022-06-23)
humann v3.8
MetaPhlAn version 4.1.0 (23 Aug 2023)
java -ea -Xmx97655m -Xms97655m -cp /opt/miniconda/envs/mima/opt/bbmap-39.01-1/current/ clump.Clumpify --version
BBMap version 39.01
For help, please run the shellscript with no parameters, or look in /docs/.
fastp 0.23.4
2.28-r1209
Kraken version 2.1.3
Copyright 2013-2023, Derrick Wood (dwood@cs.jhu.edu)

Reminder

By default Containers are deployed with very minimum filesystem access and you might need to bind paths

Next check you have all the required data-dependencies.

3 - Data dependencies

Many steps in the pipeline require access to reference databases. These reference databases are very big and are often already downloaded by the system administrators. As such they are not included in the container images.

To run the pipeline you need to know the absolute paths for the below reference databases.

minimap2
kraken2 and bracken
humann
- CHOCOPhlAn
- uniref
metaphlan CHOCOPhlAn database

You might also need to set up path binding when deploying the containers.

If you are missing any reference datasets, see below for download information.

QC: decontamination step

Tool	Description	URL
Minimap2	requires reference genome, we used the Humann reference genome GRCh38.p14 from NCBI (~800MB)	https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.40/ (download the Genomic sequence, fasta file)

Taxonomy profiling

Tool	Description	URL
Kraken2	requires taxonomy reference database	Pre-built: https://benlangmead.github.io/aws-indexes/k2
Bracken	build indexes from Kraken2 database	Pre-built: https://benlangmead.github.io/aws-indexes/k2 or see the Bracken build tutorial

Functional profiling

HUMAnN database

Requires 3 reference databases, below are instructions to download using the MIMA container setup or you can refer to the developer’s documentation
Estimated disk space: ~53GB (you might need several hours for downloading)
After installing MIMA container, ensure you have set the SANDBOX environment variable
You can check required database version using the command:

apptainer exec $SANDBOX humann_databases --available

output:

HUMAnN Databases ( database : build = location )
chocophlan : full = http://huttenhower.sph.harvard.edu/humann_data/chocophlan/full_chocophlan.v201901_v31.tar.gz
chocophlan : DEMO = http://huttenhower.sph.harvard.edu/humann_data/chocophlan/DEMO_chocophlan.v201901_v31.tar.gz
uniref : uniref50_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref50_annotated_v201901b_full.tar.gz
uniref : uniref90_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref90_annotated_v201901b_full.tar.gz
uniref : uniref50_ec_filtered_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_ec_filtered/uniref50_ec_filtered_201901b_subset.tar.gz
uniref : uniref90_ec_filtered_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_ec_filtered/uniref90_ec_filtered_201901b_subset.tar.gz
uniref : DEMO_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref90_DEMO_diamond_v201901b.tar.gz
utility_mapping : full = http://huttenhower.sph.harvard.edu/humann_data/full_mapping_v201901b.tar.gz

The first command creates a new folder ~/refDB/humann3 in your home directory
The next three commands install the three required databases
- note that many HPC systems have limited space in your home directory (~).
Replace ~/refDB/humann3 with your preferred location as needed, if installing to external path remember to set path binding.

$ mkdir -p ~/refDB/humann3
$ apptainer exec $SANDBOX humann_databases --download chocophlan full ~/refDB/humann3
$ apptainer exec $SANDBOX humann_databases --download uniref uniref90_diamond ~/refDB/humann3
$ apptainer exec $SANDBOX humann_databases --download utility_mapping full ~/refDB/humann3

After installation, check the files

tree -d ~/refDB/humann3

~/refDB/humann3
├── chocophlan
├── uniref
└── utility_mapping

MetaPhlAn database

Estimated disk space: ~26GB or see the developer’s documentation
After installing MIMA container, ensure you have set the SANDBOX environment variable
The command below installs the required Bowtie2 database where the --bowtie2db parameter lets you set the path
- the example below installs it in your home directory (~)
Replace ~/refDB/metaphlan_databases/ with your preferred location as needed, if installing to external path remember to set path binding.

mkdir -p ~/refDB/metaphlan_databases
apptainer exec $SANDBOX metaphlan --install --bowtie2db ~/refDB/metaphlan_databases/

Installing to external path

If you are using the MIMA container to install reference databases to a location other than your home directory, remember to set path binding.

The example below uses -B parameter:

apptainer exec -B /another/loc/metaphlan_databases:/another/loc/metaphlan_databases $SANDBOX metaphlan --install --bowtie2db /another/loc/metaphlan_databases

4 - Tutorials

Step-by-step guide on data-processing and analytics for shotgun metagenomics data.

The tutorials are split based on the two main functions of the pipeline:

pipeline-schema

4.1 - Data processing with MIMA

How to use MIMA to process your shotgun metagenomics data.

This section shows you how to use the MIMA pipeline for data processing of shotgun metagenomics sequenced reads using the assembly-free approach.

MIMA: data-processing

This section covers the data processing pipeline, which consists of three main modules:

Module
Quality control (QC) of the sequenced reads
Taxonomy profiling after QC, for assigning reads to taxon (this step can be run in parallel with step 3)
Functional profiling after QC, for assigning reads to genes (this step can be run in parallel with step 2)

Note

The Analysis and Visualisation comes after the data has been processed and is covered in the analysis tutorials

How the tutorials works

The tutorials are split into the three modules:

Each module has six sub-sections where actions are required in steps 2 to 6.

A brief introduction
RUN command to generate PBS scripts
Check the generated PBS scripts
RUN command to submit PBS scripts as jobs
Check the expected outputs after PBS job completes
RUN Post-processing step (optional steps are marked)

Check list

For this set of tutorials, you need

Access to High-performance cluster (HPC) with a job scheduling system like OpenPBS
- HPC system must have Apptainer or Singularity installed
Install MIMA Container image
- start an interactive PBS job
- create the image sandbox
- set the SANDBOX environment variable
Take note of the paths for the reference databases, namely
- MiniMap2 host reference genome file
- Kraken2 and Bracken reference database (same directory for the two tools)
- HUMAnN reference databases
- MetaPhlAn reference databases
Understand the need to know points
Data processing worked on paired-end sequenced reads with two files per sample
- forward read fastQ files usually has some variation of _R1.fq.gz or _1.fq.gz filename suffix, and
- reverse read fastQ files usually some variation of _R2.fq.gz or _1.fq.gz filename suffix
Download tutorial data and check the manifest file

4.1.1 - Download tutorial data

metadata and sequence data files for shotgun metagenomics data-processing

In the data processing tutorials, we will use data from the study by

Tourlousse, et al. (2022), Characterization and Demonstration of Mock Communities as Control Reagents for Accurate Human Microbiome Community Measures, Microbiology Spectrum.

This data set consists of two mock communities: DNA-mock and Cell-mock. The mock communities consists of bacteria that are mainly detected the human gastrointestinal tract ecosystem with a small mixture of some skin microbiota. The data was processed in three different labs: A, B and C. In the previous tutorial, , we only processed a subset of the samples (n=9). In this tutorial we will be working with the full data set which has been pre-processed using the same pipeline. In total there were 56 samples of which 4 samples fell below the abundance threshold and therefore the final taxonomy abundance table has 52 samples. We will train the random forest classifier to distinguish between the three labs.

The raw reads are available from NCBI SRA Project PRJNA747117, and there are 56 paired-end samples (112 fastq files)
As the data is very big we will work with a subset (n=9 samples, 18 fastq files)
This tutorial teaches you how to prepare the required raw fastq files

Note

You will need about 80GB of disk space depending on which option you used for downloading.

Step 1) Download tutorial files

Download the zip file mima_tutorial.zip via wget

wget https://github.com/mrcbioinfo/mima-pipeline/raw/master/examples/mima_tutorial.zip

Extract the archived file using unzip

unzip mima_tutorial.zip

Check the directory structure matches using tree

tree mima_tutorial

~/mima_tutorial
├── ftp_download_files.sh
├── manifest.csv
├── pbs_header_func.cfg
├── pbs_header_qc.cfg
├── pbs_header_taxa.cfg
├── raw_data
└── SRA_files

Note! Assumed working directory

This tutorial assumes the ~/mima_tutorial directory is located in your home directory as indicated by the ~ (tilde) sign. If you have put the files in another location then replace all occurrences of ~/mima_tutorial with your location (remember to use absolute paths).

Data files

File	Description
ftp_download_files.sh	direct download FTP links used in Option A: direct download below
SRA_files	contains the SRA identifier of the 9 samples used in Option B and Option C below
manifest.csv	comma separated file of 3 columns, that lists the `sampleID, forward_filename, reverse_filename`
pbs_header_*.cfg	PBS configuration files that contain specific parameters for job submissions

Step 2) Download Sequence files

Choose from the 3 options for downloading the tutorial data depending on your environment setup:

Option Tool Est. size Description

A curl 24GB direct download using curl command, files are already compressed

Option	Tool	Est. size	Description
A	`curl`	24GB	direct download using `curl` command, files are already compressed
B	sratoolkit (installed on system)	75GB	download using `sratoolkit` which is available on your system or via modules The files are not compressed when downloaded; compression is a post-processing step
C	sratoolkit (installed in MIMA)	75GB	installed in MIMA container; same as option B downloaded files are not compressed

sratoolkit
(installed on system)

75GB

download using sratoolkit which is available on your system or via modules

The files are not compressed when downloaded; compression is a post-processing step

C sratoolkit
(installed in MIMA) 75GB installed in MIMA container; same as option B downloaded files are not compressed

Option A: direct download

Run the following command for direct download

bash FTP_download_files.sh

Option B: download with `sratoolkit`

Download the SRA files using prefetch command

prefetch --option-file SRA_files --output-directory raw_data

Below is the output, wait until all files are downloaded

2022-09-08T05:50:42 prefetch.3.0.0: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2022-09-08T05:50:42 prefetch.3.0.0: 1) Downloading 'SRR17380209'...
2022-09-08T05:50:42 prefetch.3.0.0: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2022-09-08T05:50:42 prefetch.3.0.0:  Downloading via HTTPS...
...

After download finish, check the downloaded files with the tree command

tree raw_data

raw_data/
├── SRR17380115
│   └── SRR17380115.sra
├── SRR17380118
│   └── SRR17380118.sra
├── SRR17380122
│   └── SRR17380122.sra
├── SRR17380209
│   └── SRR17380209.sra
├── SRR17380218
│   └── SRR17380218.sra
├── SRR17380222
│   └── SRR17380222.sra
├── SRR17380231
│   └── SRR17380231.sra
├── SRR17380232
│   └── SRR17380232.sra
└── SRR17380236
    └── SRR17380236.sra

Extract the fastq files using the fasterq-dump command
We’ll also save some disk space by zipping up the fastq files using bzip (or pigz)

cd ~/mima_tutorial/raw_data
fasterq-dump --split-files */*.
bzip2 *.fastq
tree .

.
├── SRR17380115
│   └── SRR17380115.sra
├── SRR17380115_1.fastq.gz
├── SRR17380115_2.fastq.gz
├── SRR17380118
│   └── SRR17380118.sra
├── SRR17380118_1.fastq.gz
├── SRR17380118_2.fastq.gz
├── SRR17380122
│   └── SRR17380122.sra
├── SRR17380122_1.fastq.gz
├── SRR17380122_2.fastq.gz
├── SRR17380209
│   └── SRR17380209.sra
├── SRR17380209_1.fastq.gz
├── SRR17380209_2.fastq.gz
├── SRR17380218
│   └── SRR17380218.sra
├── SRR17380218_1.fastq.gz
├── SRR17380218_2.fastq.gz
├── SRR17380222
│   └── SRR17380222.sra
├── SRR17380222_1.fastq.gz
├── SRR17380222_2.fastq.gz
├── SRR17380231
│   └── SRR17380231.sra
├── SRR17380231_1.fastq.gz
├── SRR17380231_2.fastq.gz
├── SRR17380232
│   └── SRR17380232.sra
├── SRR17380232_1.fastq.gz
├── SRR17380232_2.fastq.gz
├── SRR17380236
│   └── SRR17380236.sra
├── SRR17380236_1.fastq.gz
└── SRR17380236_2.fastq.gz

Option C: download via MIMA

This options assumes you have installed MIMA and set up the SANDBOX environment variable
Download the SRA files using the following command

apptainer exec $SANDBOX prefetch --option-file SRA_files --output-directory raw_data

Below is the output, wait until all files are downloaded

2022-09-08T05:50:42 prefetch.3.0.0: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2022-09-08T05:50:42 prefetch.3.0.0: 1) Downloading 'SRR17380209'...
2022-09-08T05:50:42 prefetch.3.0.0: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2022-09-08T05:50:42 prefetch.3.0.0:  Downloading via HTTPS...
...

After download finishes, check the downloaded files

tree raw_data

raw_data/
├── SRR17380115
│   └── SRR17380115.sra
├── SRR17380118
│   └── SRR17380118.sra
├── SRR17380122
│   └── SRR17380122.sra
├── SRR17380209
│   └── SRR17380209.sra
├── SRR17380218
│   └── SRR17380218.sra
├── SRR17380222
│   └── SRR17380222.sra
├── SRR17380231
│   └── SRR17380231.sra
├── SRR17380232
│   └── SRR17380232.sra
└── SRR17380236
    └── SRR17380236.sra

Extract the fastq files using the fasterq-dump command
We’ll also save some disk space by zipping up the fastq files using bzip (or pigz)

cd ~/mima_tutorial/raw_data
singularity exec $SANDBOX fasterq-dump --split-files */*.
singularity exec $SANDBOX bzip2 *.fastq
tree .

.
├── SRR17380115
│   └── SRR17380115.sra
├── SRR17380115_1.fastq.gz
├── SRR17380115_2.fastq.gz
├── SRR17380118
│   └── SRR17380118.sra
├── SRR17380118_1.fastq.gz
├── SRR17380118_2.fastq.gz
├── SRR17380122
│   └── SRR17380122.sra
├── SRR17380122_1.fastq.gz
├── SRR17380122_2.fastq.gz
├── SRR17380209
│   └── SRR17380209.sra
├── SRR17380209_1.fastq.gz
├── SRR17380209_2.fastq.gz
├── SRR17380218
│   └── SRR17380218.sra
├── SRR17380218_1.fastq.gz
├── SRR17380218_2.fastq.gz
├── SRR17380222
│   └── SRR17380222.sra
├── SRR17380222_1.fastq.gz
├── SRR17380222_2.fastq.gz
├── SRR17380231
│   └── SRR17380231.sra
├── SRR17380231_1.fastq.gz
├── SRR17380231_2.fastq.gz
├── SRR17380232
│   └── SRR17380232.sra
├── SRR17380232_1.fastq.gz
├── SRR17380232_2.fastq.gz
├── SRR17380236
│   └── SRR17380236.sra
├── SRR17380236_1.fastq.gz
└── SRR17380236_2.fastq.gz

Step 3) Check manifest

Examine the manifest file

cat mima_tutorial/manifest.csv

Your output should looking like something below
- Check column 2 (FileID_R1) and column 3 (FileID_R2) match the names of the files in raw_data
Update the manifest file as required

Sample_ID,FileID_R1,FileID_R2
SRR17380209,SRR17380209_1.fastq.gz,SRR17380209_2.fastq.gz
SRR17380232,SRR17380232_1.fastq.gz,SRR17380232_2.fastq.gz
SRR17380236,SRR17380236_1.fastq.gz,SRR17380236_2.fastq.gz
SRR17380231,SRR17380231_1.fastq.gz,SRR17380231_2.fastq.gz
SRR17380218,SRR17380218_1.fastq.gz,SRR17380218_2.fastq.gz
SRR17380222,SRR17380222_1.fastq.gz,SRR17380222_2.fastq.gz
SRR17380118,SRR17380118_1.fastq.gz,SRR17380118_2.fastq.gz
SRR17380115,SRR17380115_1.fastq.gz,SRR17380115_2.fastq.gz
SRR17380122,SRR17380122_1.fastq.gz,SRR17380122_2.fastq.gz

Manifest file formats

the first row is the header and is case sensitive, it must have the three columns: Sample_ID,FileID_R1,FileID_R2
the filenames in columns 2 and 3 do not need to be absolute paths as the directory where the files are located will be specified during quality checking

Remember to check out what else you need to know before jumping into quality checking

4.1.2 - Need to know

preparation for data-processing tutorials

Project working directory

After downloading the tutorial data, we assume that the mima_tutorial is the working directory located in your home directory (specified by the tilde, ~). Hence, we will try to always make sure we are in the right directory first before executing a command, for example, run the following commands:

$ cd ~/mima_tutorial
$ tree .

the starting directory structure for mima_tutorial should look something like:

mima_tutorial
├── ftp_download_files.sh
├── manifest.csv
├── pbs_header_func.cfg
├── pbs_header_qc.cfg
├── pbs_header_taxa.cfg
├── raw_data/
    ├── SRR17380115_1.fastq.gz
    ├── SRR17380115_2.fastq.gz
    ├── ...
...

From here on, ~/mima_tutorial will refer to the project directory as depicted above. Replace this path if you saved the tutorial data in another location.

TIP: Text editors

There are several places where you may need to edit the commands, scripts or files. You can use the vim text editors to edit text files directly on the terminal.

For example, the command below lets you edit the pbs_head_qc.cfg text file

vim pbs_header_qc.cfg

Containers and binding paths

When deploying images, make sure you check if you need to bind any paths.

PBS configuration files

The three modules (QC, Taxonomy profiling and Function profiling) in the data-processing pipeline require access to a job queue and instructions about the resources required for the job. For example, the number of CPUs, the RAM size, the time required for execution etc. These parameters are defined in PBS configuration text files.

Three such configuration files are in provided after you have downloaded the tutorial data. There are 3 configuration files, one for each module as they require different PBS settings indicated by lines starting with the #PBS tags.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/bin/bash
#PBS -N mima-qc
#PBS -l ncpus=8
#PBS -l walltime=2:00:00
#PBS -l mem=64GB
#PBS -j oe

set -x

IMAGE_DIR=~/mima-pipeline
export APPTAINER_BIND="</path/to/source1>:</path/to/destination1>,</path/to/source2>:</path/to/destination2>"

PBS settings	Description
`#PBS -N`	name of the job
`#PBS -l ncpus`	number of CPUs required
`#PBS -l walltime`	how long the job will take, here it’s 2 hours. Note check the log files whether your jobs have completed correctly or failed due to not enough time
`#PBS -l mem=64GB`	how much RAM the job needs, here it’s 64GB
`#PBS -l -j oe`	standard output logs and error logs are concatenated into one file

IMAGE_DIR refers to where you installed MIMA and built your sandbox.
APPTAINER_BIND is the environment variable you set when binding file paths to the container.

Use absolute paths

When running the pipeline it is best to use full paths when specifying the locations of input files, output files and reference data to avoid any ambiguity.

Absolute/Full paths

always start with the root directory, indicated by the forward slash (/) on Unix based systems.

e.g., below changes directory (cd) to a folder named scripts that is located in the user jsmith’s home directory. Provided this folder exists, then this command will work from anywhere on the filesystem."

[~/metadata] $ cd /home/jsmith/scripts

Relative paths

are relative to the current working directory

Now imagine the following file system structure in the user john_doe’s home directory
The asterisks marks his current location, which is inside the /home/user/john_doe/studyAB/metadata sub-folder

/home/user/john_doe
├── apps
├── reference_data
├── studyABC
│   ├── metadata **
│   ├── raw_reads
│   └── analysis
├── study_123
├── scripts
└── templates

In this example we are currently in the metadata directory, and change directory to a folder named data that is located in the parent directory (..)
This command only works provided there is a data folder in the parent directory above metadata
According to the filesystem above, the parent directory of metadata is studyABC and there is no data subfolder in this directory, so this command will fail with an error message

[/home/user/john_doe/studyABC/metadata] $ cd ../data
-bash: cd: ../data: No such file or directory

Now that you have installed the data and know the basics, you can begin data processing with quality control.

4.1.3 - Quality control (QC)

QC module in MIMA, check reads are of high standard

This step must be done before Taxonomy and Function profiling.

Introduction: QC module

Quality control (QC) checks the sequenced reads obtained from the sequencing machine is of good quality. Bad quality reads or artefacts due to sequencing error if not removed can lead to spurious results and affect downstream analyses. There are a number of tools available for checking the read quality of high-throughput sequencing data.

This pipeline uses the following tools:

Workflow description

One bash (*.sh) script per sample is generated by this module that performs the following steps. The module also generates $N$ number of PBS scripts (set using the --num-pbs-jobs parameter).

This tutorial has 9 samples spread across 3 PBS jobs. One PBS job will execute three bash scripts, one per sample.

Step
1. repair.sh from BBTools/BBMap tool suite, repairs the sequenced reads and separates out any singleton reads (orphaned reads that are missing either the forward or reverse partner read) into its own text file.
2. dereplication using clumpify.sh from BBTools/BBMap tool suite, removes duplicate reads and clusters reads for faster downstream processing
3. quality checking is done with fastp.sh and checks the quality of the reads and removes any reads that are of low quality, too long or too short
4. decontamination with minimap2.sh maps the sequenced reads against a reference genome (e.g., human) defined by the user. It removes these host-reads from the data and generates a cleaned output file. This becomes the input for taxonomy profiling and function profiling modules.
The (optional) last step generates a QC report text file with summary statistics for each file. This step is run after all PBS scripts have been executed for all samples in the study.

For details about the tool versions, you can check the MIMA container image installed

Step 1. Generate QC PBS scripts

Tick off the check list
Find the absolute path to the host reference genome file (e.g., Human host might use GRCh38_latest_genome.fna)
Replace the highlighted line --ref <path/to/human/GRCh38_latest_genome.fna> to the location of your host reference genome file
Enter the following command
- the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
- note if you enter the command as one line, remove the backslashes

1
2
3
4
5
6
7
8
apptainer run --app mima-qc $SANDBOX \
-i ~/mima_tutorial/raw_data \
-o ~/mima_tutorial/output \
-m ~/mima_tutorial/manifest.csv \
--num-pbs-jobs 3 \
--ref <path/to/human/GRCh38_latest_genome.fna> \
--mode container \
--pbs-config ~/mima_tutorial/pbs_header_qc.cfg

For MRC users, see here for file locations

Parameters explained

Parameter	Required?	Description
`-i <input>`	yes	must be absolute path to the directory where fastq files are stored. FastQ files often have the .fastq.gz or .fq.gz extension. This path is used to find the FileID_R1 and FileID_R2 columns specified in the `manifest.csv` file
`-o <output>`	yes	absolute path output directory where results will be saved. The `<output>` path will be created if it does not exists. Note if there are already existing subdirectories in the `<output>` path, then this step will fail.
`-m <manifest.csv>`	yes	a comma-separated file (.csv) that has three columns with the headers: Sample_ID, FileID_R1, FileID_R2* see the example below. Note the fileID_R1 and fileID_R2 are relative to the `-i <input>` path provided.
`--num-pbs-jobs`	no (default=4)	number of PBS scripts to generate by default 4 jobs are created with samples split equally between the jobs
`--ref`	yes	path to the host genome file GRCh38_latest_genome.fna
`--mode container`	no (default=‘single’)	set this if you are running as a Container
`--pbs-config`	yes if `--mode container`	path to the pbs configuration file, must specify this parameter if `--mode container` is set

tip

These commands can get really long, you can use Notepad to

copy & paste the commands
change the paths and parameters as desired
copy & paste to the terminal

Or you can explore using the vim text editor.

Step 2. Check generated scripts

After step 1, you should see the new directory: ~/mima_tutorial/output
Inside output/, there should be a sub-directory QC_module
Inside output/QC_module, there should be 3 PBS scripts with the *.pbs file extension

tree ~/mima_tutorial

only the output folder is shown in the snapshot below

~/mima_tutorial
└── output/
    └──  QC_module
      ├── CleanReads
      ├── qcModule_0.pbs
      ├── qcModule_1.pbs
      ├── qcModule_2.pbs
      ├── QCReport
      ├── SRR17380115.sh
      ├── SRR17380118.sh
      ├── SRR17380122.sh
      ├── SRR17380209.sh
      ├── SRR17380218.sh
      ├── SRR17380222.sh
      ├── SRR17380231.sh
      ├── SRR17380232.sh
      └── SRR17380236.sh

Step 3. Submit QC jobs

Let’s examine one of the PBS scripts to be submitted

cat ~/mima_tutorial/output/QC_module/qcModule_0.pbs

Your PBS script should look something like below, with some differences
- line 10: IMAGE_DIR should be where you installed MIMA and build the sandbox
- line 11: APPTAINER_BIND should be setup during installation when binding paths
  - make sure to include the path where the host reference genome file is located
- line 14: /home/user is replaced with the absolute path to your actual home directory

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#!/bin/bash
#PBS -N QC_module_0
#PBS -l ncpus=8
#PBS -l walltime=2:00:00
#PBS -l mem=64GB
#PBS -j oe

set -x

IMAGE_DIR=~/mima-pipeline
export APPTAINER_BIND="/path/to/human/GRCh38:/path/to/human/GRCh38"


cd /home/user/mima_tutorial/output/QC_module/

apptainer exec ${IMAGE_DIR} bash /home/user/mima_tutorial/output/QC_module/SRR17380209.sh > SRR17380209_qc_module.log 2>&1
apptainer exec ${IMAGE_DIR} bash /home/user/mima_tutorial/output/QC_module/SRR17380232.sh > SRR17380232_qc_module.log 2>&1
apptainer exec ${IMAGE_DIR} bash /home/user/mima_tutorial/output/QC_module/SRR17380236.sh > SRR17380236_qc_module.log 2>&1

Navigate to the ~/mima_tutorial/output/QC_module
Submit the PBS job using qsub

cd ~/mima_tutorial/output/QC_module
qsub qcModule_0.pbs

You can check the job has been submitted and their status with qstat
Repeat the qsub command for each of the qcModule_1.pbs and qcModule_2.pbs files
Wait until all PBS jobs have completed

PBS log files

Usually a PBS job will generate a log file in the same directory from where the job was submitted
As such we changed directory to where the PBS scripts are saved to submit the jobs because the log files are required for the Step 5. Generating the QC report.

Step 4. Check QC outputs

After all PBS job completes, you should have the following outputs

tree ~/mima_tutorial/output/QC_module

We only show outputs for one sample below with ... meaning “and others”
You’ll have a set of output files for each sample listed in the manifest.csv (provided the corresponding *.fastq files exists)

 ~/mima_tutorial/output/QC_module
├── CleanReads
│   ├── SRR17380209_clean_1.fq.gz
│   ├── SRR17380209_clean_2.fq.gz
│   └── ...
├── QC_module_0.o3492023
├── qcModule_0.pbs
├── qcModule_1.pbs
├── qcModule_2.pbs
├── QCReport
│   ├── SRR17380209.json
│   ├── SRR17380209.outreport.html
│   └── ...
├── ...
├── SRR17380209_qc_module.log
├── SRR17380209.sh
├── SRR17380209_singletons.fq.gz
└── ...

Step 5. (optional) Generate QC report

You can generate a summary QC Report after all samples have been quality checked
This step can be run directly from the command line
First navigate to the QC_module directory that have the qc_module.log files

cd ~/mima_tutorial/output/QC_module

apptainer run --app mima-qc-report $SANDBOX \
-i ~/mima_tutorial/output/QC_module \
--manifest ~/mima_tutorial/manifest.csv

Troubleshoot

If your sequence read files are not saved in your home directory, check you have set up the APPTAINER_BIND environment variable when binding paths.

The output is a comma separated text file located in ~/mima_tutorial/output/QC_module/QC_report.csv.

Examine the output report using head
- column formats the text file as columns like a table for readabilty

head ~/mima_tutorial/output/QC_module/QC_report.csv | column -t -s ','

Your output should resemble something like below
- the numbers can be different due to different tool versions
The log_status column specify if log files were found, other values mean
- missing QC log: missing *qc_module.log file for the corresponding sample
- missing JSON log: missing fastp *.json file for the corresponding sample

SampleId     log_status      Rawreads_seqs  Derep_seqs  dedup_percentage  Post_QC_seqs  low_quality_reads(%)  Host_seqs  Host(%)                 Clean_reads
SRR17380209  OK              14072002       14045392.0  0.0809091         9568710       31.87295876113675     426.0      0.004452010772611982    9568284.0
SRR17380232  OK              11822934       11713130.0  0.0706387         11102358      5.214421764293575     62.0       0.0005584399278063273   11102296.0
SRR17380236  OK              11756456       11656800.0  0.0688868         10846582      6.950603939331549     46.0       0.00042409673388354047  10846536.0
SRR17380231  OK              12223354       12104874.0  0.069757          11414164      5.7060486544510916    14.0       0.00012265462455244203  11414150.0
SRR17380218  OK              14913690       14874174.0  0.0856518         11505850      22.645452446636703    234.0      0.002033748049905048    11505616.0
SRR17380222  OK              16927002       16905928.0  0.0980011         11692516      30.83777477344042     294.0      0.002514428887674817    11692222.0
SRR17380118  OK              40393998       40186880.0  0.234584          40148912      0.09447859599949038   100242.0   0.2496755080187478      40048670.0

Next: you can now proceed with either Taxonomy Profiling with MIMA or Function Profiling with MIMA.

4.1.4 - Taxonomy Profiling

assign reads to taxa, generating a taxonomy feature table ready for analysis

Introduction to Taxonomy Profiling

Taxonomy profiling takes the quality controlled (clean) sequenced reads as input and matches them against a reference database of previously characterised sequences for taxonomy classification.

There are many different classification tools, for example: Kraken2, MetaPhlAn, Clark, Centrifuge, MEGAN, and many more.

This pipeline uses Kraken2 and Bracken abundance estimation. Both require access to reference database or you can generate your own reference data. In this pipeline, we will use the GTDB database (release 95) and have built a Kraken2 database.

Workflow description

Kraken2 requires big memory (~300GB) to run, so there is only one PBS script that will execute each sample sequentially.

In this tutorial, we have 9 samples which will be executed within one PBS job.

Steps in the taxonomy profiling module:

Step
Kraken2 assigns each read to a lowest common ancestor by matching the sequenced reads against a reference database of previously classified species
Bracken takes the Kraken2 output to estimate abundances for a given taxonomic rank. This is repeated from Phylum to Species rank.
Generate feature table is performed after all samples assigned to a taxa and abundances estimated. This step combines the output for all samples and generates a feature table for a given taxonomic rank. The feature table contains discrete counts and relative abundances of "taxon X occurring in sample Y".

Step 1. Generate PBS script

Before starting, have you understood the need to know points?

After QC checking your sequence samples and generating a set of CleanReads
Find the absolute paths for the Kraken2 and Bracken reference databases on your HPC system (!! they need to be in the same directory)
Replace the highlighted line --reference-path <path/to/Kraken2_db> with the Kraken2 absolute path
- the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
- note if you enter the command as one line, remove the backslashes

1
2
3
4
5
6
7
8
apptainer run --app mima-taxa-profiling $SANDBOX \
-i ~/mima_tutorial/output/QC_module/CleanReads \
-o ~/mima_tutorial/output \
--reference-path </path/to/Kraken2_db> \
--read-length 150 \
--threshold 100 \
--mode container \
--pbs-config ~/mima_tutorial/pbs_header_taxa.cfg

For MRC users, see here for file locations

Parameters explained

Parameter	Required?	Description
`-i <input>`	yes	full path to the `~/mima_tutorial/output/QC_module/CleanReads` directory that was generated from Step 1) QC, above. This directory should hold all the `*_clean.fastq` files
`-o <output>`	yes	full path to the `~/mima_tutorial/output` directory where you would like the output files to be saved, can be the same as Step 1) QC
`--reference-path`	yes	full path to the reference database (this pipeline uses the GTDB release 95 reference database)
`--read-length`	no (default=150)	read length for Bracken estimation, choose the value closest to your sequenced read length (choose from 50, 75, 100 and 150)
`--threshold`	no (default=1000)	Bracken filtering threshold, features with counts below this value are filtered in the abundance estimation
`--mode container`	no (default=single)	set this if you are running as a Container. By default, the PBS scripts generated are for the ‘standalone’ option, that is without Singularity
`--pbs-config`	yes if `--mode container`	path to the pbs configuration file (see below). You must specify this parameter if `--mode container` is set. You do not need to set this parameter if running outside of Singularity

If you changed the file extension of the cleaned files or are working with already cleaned files from somewhere else, you can specify the forward and reverse suffix using:

Parameter	Required?	Description
`--fwd-suffix`	default=_clean_1.fq.gz	file suffix for cleaned forward reads from QC module
`--rev-suffix`	default=_clean_2.fq.gz	file suffix for cleaned reverse reads from QC module

Step 2. Check generated scripts

After Step 1, you should see in the output directory: ~/mima_tutorial/output/Taxonomy_profiling
- one PBS script
- one bash script/sample (total of 9 bash scripts in this tutorial)
Have a look at the directory structure using tree

tree ~/mima_tutorial/output/Taxonomy_profiling

Expected directory structure

bracken/ and kraken2/ are subdirectories created by Step 2 to store the output files after PBS job is executed

.
├── bracken
├── featureTables
│   └── generate_bracken_feature_table.py
├── kraken2
├── run_taxa_profiling.pbs
├── SRR17380209.sh
├── SRR17380232.sh
├── SRR17380236.sh
└── ...

Step 3. Submit Taxonomy job

Let’s examine the PBS script to be submitted

cat ~/mima_tutorial/output/Taxonomy_profiling/run_taxa_profiling.pbs

Your PBS script should look something like below, with some differences
- line 10: IMAGE_DIR should be where you installed MIMA and build the sandbox
- line 11: APPTAINER_BIND should be setup during installation when binding paths
  - make sure to include the path where the host reference genome file is located
- line 14: /home/user is replaced with the absolute path to your actual home directory

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#!/bin/bash
#PBS -N mima-taxa
#PBS -l ncpus=28
#PBS -l walltime=10:00:00
#PBS -l mem=300GB
#PBS -j oe

set -x

IMAGE_DIR=~/mima-pipeline
export APPTAINER_BIND="/path/to/kraken2/reference_database:/path/to/kraken2/reference_database"


cd /home/user/mima_tutorial/output/Taxonomy_profiling/

apptainer exec ${IMAGE_DIR} bash /home/user/mima_tutorial/output/Taxonomy_profiling/SRR17380209.sh
apptainer exec ${IMAGE_DIR} bash /home/user/mima_tutorial/output/Taxonomy_profiling/SRR17380232.sh
...

Tip: when running your own study

increase the wall-time if you have a lot of samples
increase the memory as needed

Change directory to ~/mima_tutorial/output/Taxonomy_profiling
Submit PBS job using qsub

cd ~/mima_tutorial/output/Taxonomy_profiling
qsub run_taxa_profiling.pbs

You can check the job statuses with qstat
Wait for the job to complete

Step 4. Check Taxonomy outputs

After the PBS job completes, you should have the following outputs

tree ~/mima_tutorial/output/Taxonomy_profiling

Only a subset of the outputs are shown below with ... meaning “and others”
You’ll have a set of output for each sample that passed the QC step

~/mima_tutorial/output/Taxonomy_profiling
├── bracken
│   ├── SRR17380218_class
│   ├── SRR17380218_family
│   ├── SRR17380218_genus
│   ├── SRR17380218.kraken2_bracken_classes.report
│   ├── SRR17380218.kraken2_bracken_families.report
│   ├── SRR17380218.kraken2_bracken_genuses.report
│   ├── SRR17380218.kraken2_bracken_orders.report
│   ├── SRR17380218.kraken2_bracken_phylums.report
│   ├── SRR17380218.kraken2_bracken_species.report
│   ├── SRR17380218_order
│   ├── SRR17380218_phylum
│   ├── SRR17380218_species
│   └── ...
├── featureTables
│   └── generate_bracken_feature_table.py
├── kraken2
│   ├── SRR17380218.kraken2.output
│   ├── SRR17380218.kraken2.report
│   ├── ...
│   ├── SRR17380231.kraken2.output
│   └── SRR17380231.kraken2.report
├── QC_module_0.o3492470
├── run_taxa_profiling.pbs
├── SRR17380209.sh
├── SRR17380218_bracken.log
├── SRR17380218.sh
├── SRR17380222_bracken.log
├── SRR17380222.sh
├── SRR17380231_bracken.log
├── SRR17380231.sh
├── SRR17380232.sh
└── SRR17380236.sh

Output files

Directory / Files	Description
output	specified in the `--output-dir <output>` parameter set in step 1b)
Taxonomy_profiling	contains all files and output from this step
Taxonomy_profiling/*.sh	are all the bash scripts generated by step 2b) for taxonomy profiling
Taxonomy_profiling/run_taxa_profiling.pbs	is the PBS wrapper generated by step 2b) that will execute each sample sequentially
Taxonomy_profiling/bracken	consists of the abundance estimation files from Bracken, one per sample, output after PBS submission
Taxonomy_profiling/featureTables	consists of the merged abundance tables generated by step 2e) below
Taxonomy_profiling/kraken2	consists of the output from Kraken2 (two files per sample), output after PBS submission

See the tool website for further details about the Kraken2 output and Bracken output

Step 5. Generate taxonomy abundance table

After all samples have been taxonomically annotated by Kraken2 and abundance estimated by Bracken, we need to combine the tables
Navigate to ~/mima_tutorial/output/Taxonomy_profiling/featureTables
Run the commands below directly from terminal (not a PBS job)
Check the output with tree

cd ~/mima_tutorial/output/Taxonomy_profiling/featureTables
apptainer exec $SANDBOX python3 generate_bracken_feature_table.py
tree .

All bracken output files will be concatenated into a table, one for each taxonomic rank from Phylum to Species
- table rows are taxonomy features
- table columns are abundances
By default, the tables contain two columns for one sample: (i) discrete counts and (ii) relative abundances
The generate_bracken_feature_table.py will split the output into two files with the suffices:
- _counts for discrete counts and
- _relAbund for relative abundances

.
├── bracken_FT_class
├── bracken_FT_class_counts
├── bracken_FT_class_relAbund
├── bracken_FT_family
├── bracken_FT_family_counts
├── bracken_FT_family_relAbund
├── bracken_FT_genus
├── bracken_FT_genus_counts
├── bracken_FT_genus_relAbund
├── bracken_FT_order
├── bracken_FT_order_counts
├── bracken_FT_order_relAbund
├── bracken_FT_phylum
├── bracken_FT_phylum_counts
├── bracken_FT_phylum_relAbund
├── bracken_FT_species
├── bracken_FT_species_counts
├── bracken_FT_species_relAbund
├── combine_bracken_class.log
├── combine_bracken_family.log
├── combine_bracken_genus.log
├── combine_bracken_order.log
├── combine_bracken_phylum.log
├── combine_bracken_species.log
└── generate_bracken_feature_table.py

Step 6. (optional) Generate summary report

You can generate a summary report after the Bracken abundance estimation step
This step can be run directly from the command line
You need to specify the absolute input path (-i parameter)
- this is the directory that contains the *_bracken.log files, by default these should be in the output/Taxonomy_profiling directory
- update this path if you have the *_bracken.log files in another location

apptainer run --app mima-bracken-report $SANDBOX \
-i ~/mima_tutorial/output/Taxonomy_profiling

The output is a tab separated text file
By default, the output file bracken-summary.tsv will be in the same location as the input directory. You can override this using the -o parameter to specify the full path to the output file.
Examine the output report using head
- column formats the text file as columns like a table for readability

head ~/mima_tutorial/output/Taxonomy_profiling/bracken-summary.tsv | column -t

Your output should resemble something like below (only the first 9 columns are shown in the example below)
- numbers might be different depending on the threshold setting or tool version

sampleID     taxon_rank  log_status             threshold  num_taxon  num_tx_above_thres  num_tx_below_thres  total_reads  reads_above_thres
SRR17380209  phylum      ok-reads_unclassified  100        96         21                  75                  4784142      4449148
SRR17380209  class       ok-reads_unclassified  100        230        21                  209                 4784142      4438812
SRR17380209  order       ok-reads_unclassified  100        524        53                  471                 4784142      4404574
SRR17380209  family      ok-reads_unclassified  100        1087       103                 984                 4784142      4363888
SRR17380209  genus       ok-reads_unclassified  100        3260       228                 3032                4784142      4239221
SRR17380209  species     ok-reads_unclassified  100        8081       571                 7510                4784142      3795061
SRR17380232  phylum      ok-reads_unclassified  100        87         21                  66                  5551148      5216610
SRR17380232  class       ok-reads_unclassified  100        207        19                  188                 5551148      5201532
SRR17380232  order       ok-reads_unclassified  100        464        51                  413                 5551148      5165978
SRR17380232  family      ok-reads_unclassified  100        949        105                 844                 5551148      5126854

Next: if you haven’t already, you can also generate functional profiles with your shotgun metagenomics data.

4.1.5 - Functional Profiling

assign reads to gene families and pathways, generating a function feature table ready for analysis

Introduction to Functional Profiling

Functional profiling, like taxonomy profiling, takes the cleaned sequenced reads after QC checking as input and matches them against a reference database of previously characterised gene sequences.

There are different types of functional classification tools available. This pipeline uses the HUMAnN3 pipeline, which also comes with its own reference databases.

Workflow description

One PBS script per sample is generated by this module. In this tutorial, we have 9 samples, so there will be 9 PBS scripts.

Steps in the functional profiling module:

Steps
HUMAnN3 is used for processing and generates three outputs for each sample: gene families pathway abundances pathway coverage
Generate feature table is performed after all samples have been processed. This combines the output and generates a feature table that contains the abundance of gene/pathway X in sample Y.

Step 1. Generate PBS scripts

Before starting, have you understood the need to know points?

After QC checking your sequence samples and generating a set of CleanReads
Find the absolute paths for the HUMAnN3 and MetaPhlAn reference databases depending on the MIMA version you installed
Replace the highlighted lines with the absolute path for the reference databases
- path/to/humann3/chocophlan
- path/to/humann3/uniref
- path/to/humann3/utility_mapping
- path/to/metaphlan_databases
the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
note if you enter the command as one line, remove the backslashes

1
2
3
4
5
6
7
8
9
apptainer run --app mima-function-profiling $SANDBOX \
-i ~/mima_tutorial/output/QC_module/CleanReads \
-o ~/mima_tutorial/output \
--nucleotide-database </path/to>/humann3/chocophlan \
--protein-database /<path/to>/humann3/uniref \
--utility-database /<path/to>/humann3/utility_mapping \
--metaphlan-options "++bowtie2db /<path/to>/metaphlan_databases" \
--mode container \
--pbs-config ~/mima_tutorial/pbs_header_func.cfg

For MRC users, see here for file locations

Parameters explained

Parameter	Required?	Description
`-i <input>`	yes	full path to the `~/mima_tutorial/output/QC_module/CleanReads` directory that was generated from Step 1) QC, above. This directory should hold all the `*_clean.fastq` files
`-o <output>`	yes	full path to the `~/mima_tutorial/output` output directory where you would like the output files to be saved, can be the same as Step 1) QC
`--nucleotide-database <path>`	yes	directory containing the nucleotide database, (default=`/refdb/humann/chocophlan`)
`--protein-database <path>`	yes	directory containing the protein database, (default=`/refdb/humann/uniref`)
`--utility-database <path>`	yes	directory containing the protein database, (default=`/refdb/humann/utility_mapping`)
`--metaphlan-options <string>`	yes	additional MetaPhlAn settings, like specifying the bowtie2 reference database (default=`"++bowtie2db /refdb/humann/metaphlan_databases"`). Enclose parameters in double quotes and use `++` for parameter settings.
`--mode container`	no (default=‘single’)	set this if you are running in the Container mode
`--pbs-config`	yes if `--mode container`	path to the pbs configuration file (see below). You must specify this parameter if `--mode container` is set. You do not need to set this parameter if running outside of Singularity
`--mpa3`	no	this parameter is only applicable for MIMA container with MetaPhlAn4. You can set `--mpa3` for backward compatibility with MetaPhlAn3 databases

Step 2. Check PBS scripts output

After step 1, you should see in the output directory: ~/mima_tutorial/output/Function_profiling

tree ~/mima_tutorial/output/Function_profiling

There should be one PBS script/sample (total of 9 bash scripts in this tutorial)

...
├── featureTables
│   ├── generate_func_feature_tables.pbs
│   └── generate_func_feature_tables.sh
├── SRR17380209.pbs
├── SRR17380218.pbs
├── SRR17380222.pbs
├── SRR17380231.pbs
├── SRR17380232.pbs
└── SRR17380236.pbs

Step 3. Submit Function jobs

Let’s examine one of the PBS scripts

$ cat ~/mima_tutorial/output/Function_profiling/SRR17380209.pbs

Your PBS script should look something like below, with some differences
- line 10: IMAGE_DIR should be where you installed MIMA and build the sandbox
- line 11: APPTAINER_BIND should be setup during installation when binding paths
  - make sure to include the path where the host reference genome file is located
- line 14: /home/user is replaced with the absolute path to your actual home directory
- lines 22-25: ensure the paths to the reference databases are all replaced (the example below uses the vJan21 database version and the parameter index is sent to MetaPhlAn to disable autocheck of index)
- note that the walltime is set to 8 hours

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash
#PBS -N mima-func
#PBS -l ncpus=8
#PBS -l walltime=8:00:00
#PBS -l mem=64GB
#PBS -j oe

set -x

IMAGE_DIR=~/mima-pipeline
export APPTAINER_BIND="/path/to/humann3_database:/path/to/humann3_database,/path/to/metaphlan_databases:/path/to/metaphlan_databases"


cd /home/user/mima_tutorial/output/Function_profiling/

# Execute HUMAnN3
cat /home/user/mima_tutorial/output/QC_module/CleanReads/SRR17380209_clean_1.fq.gz ~/mima_tutorial/output/QC_module/CleanReads/SRR17380209_clean_2.fq.gz > ~/mima_tutorial/output/Function_profiling/SRR17380209_combine.fq.gz

outdir=/home/user/mima_tutorial/output/Function_profiling/
apptainer exec ${IMAGE_DIR} humann -i ${outdir}SRR17380209_combine.fq.gz --threads 28 \
-o $outdir --memory-use maximum \
--nucleotide-database </path/to/humann3>/chocophlan \
--protein-database </path/to/humann3>/uniref \
--utility-database </path/to/humann3>/utility_mapping \
--metaphlan-options \"++bowtie2db /path/to/metaphlan_databases ++index mpa_vJan21_CHOCOPhlAnSGB_202103\" \
--search-mode uniref90

Tip: when running your own study

increase the walltime if you have alot of samples
increase the memory as needed

Change directory to ~/mima_tutorial/output/Functional_profiling
Submit PBS job using qsub

$ cd ~/mima_tutorial/output/Functional_profiling
$ qsub SRR17380209.pbs

Repeat this for each *.pbs file
You can check your job statuses using qstat
Wait until all PBS jobs have completed

Step 4. Check Function outputs

After all PBS jobs have completed, you should have the following outputs

$ ls -1 ~/mima_tutorial/output/Function_profiling

Only a subset of the outputs are shown below with ... meaning “and others”
Each sample that passed QC, will have a set of functional outputs

featureTables
...
SRR17380115_combine_genefamilies.tsv
SRR17380115_combine_humann_temp
SRR17380115_combine_pathabundance.tsv
SRR17380115_combine_pathcoverage.tsv
SRR17380115.pbs
SRR17380118_combine_genefamilies.tsv
SRR17380118_combine_humann_temp
SRR17380118_combine_pathabundance.tsv
SRR17380118_combine_pathcoverage.tsv
SRR17380118.pbs
...
SRR17380236.pbs

Functional outputs

The HUMAnN3 website provides very good descriptions explaining the HUMAnN3 output files

Step 5. Generate Function feature tables

After all samples have been functionally annotated, we need to combine the tables together and normalise the abundances
Navigate to ~/mima_tutorial/output/Fuctional_profiling/featureTables
Submit the PBS script file generate_func_feature_tables.pbs

$ cd <PROJECT_PATH>/output/Functional_profiling/featureTables
$ qsub generate_func_feature_tables.pbs

Check the output
Once the job completes you should have 7 output files beginning with the merge_humann3table_ prefix

~/mima_tutorial/
└── output/
    ├── Function_profiling
    │   ├── featureTables
    │   ├── func_table.o2807928
    │   ├── generate_func_feature_tables.pbs
    │   ├── generate_func_feature_tables.sh
    │   ├── merge_humann3table_genefamilies.cmp_uniref90_KO.txt
    │   ├── merge_humann3table_genefamilies.cpm.txt
    │   ├── merge_humann3table_genefamilies.txt
    │   ├── merge_humann3table_pathabundance.cmp.txt
    │   ├── merge_humann3table_pathabundance.txt
    │   ├── merge_humann3table_pathcoverage.cmp.txt
    │   └── merge_humann3table_pathcoverage.txt
    ├── ...

Tip: if you want different pathway mappings

In the file generate_func_feature_tables.sh, the last step in the humann_regroup_table command uses the KEGG orthology mapping file, that is the -c parameter:

-g uniref90_rxn -c <path/to/>map_ko_uniref90.txt.gz

If you prefer other mappings than change this line accordingly before running the PBS job using generate_func_feature_tables.pbs

Congratulations !

You have completed processing your shotgun metagenomics data and are now ready for further analyses

Analyses usually take in either the taxonomy feature tables or functional feature tables

4.2 - Analytical tutorials

Use MIMA to run core biodiversity analyses.

This section covers standalone tutorials that analyses of the taxonomy biodiversity or functional biodiversity of samples in your study.

The tutorial will take in either the taxonomy abundance tables or functional feature tables.

You can start with:

4.2.1 - Biodiversity bar plots

generate bar plots to visualise the overall biodiversity in your samples

Introduction

When starting a new metagenomics study, it is useful to visualise the microbial diversity present in the individual samples. You can also average the relative abundances of the most abundant species present and observe how they are distributed between groups in your study (e.g., controls versus disease or placebo versus treatment).

For taxonomy features, the biodiversity bar plots can be generated for each taxonomic rank from Phylum to Species.

Example data

The data set used for this tutorial is based on the full data set from the data processing tutorials (Tourlousse, et al., 2022) that was pre-process with the same pipeline. There are 56 samples in total from the study, but after taxonomy profiling with Kraken2 and Bracken, 4 samples fell below the abundance threshold and therefore the final taxonomy abundance table has 52 samples.

This data set consists of two mock communities: (i) DNA-mock and (ii) Cell-mock processed in three different labs: A, B and C.

Check list

For this set of tutorial, you need

System with Apptainer or Singularity installed (currently only tested on UNIX-based system)
Install MIMA Pipeline Singularity container and check that you have
- started an interactive PBS job
- build the sandbox container
- set the SANDBOX environment variable

Step 1) Data preparation

Create a directory vis-tutorial
Change into this directory

mkdir vis-tutorial
cd vis-tutorial

Download taxonomy-feature-table.tar.gz
- Note only the processed feature tables are provided in this download; intermediate files from the data processing pipeline are not provided.

wget https://github.com/mrcbioinfo/mima-pipeline/raw/master/examples/taxonomy-feature-table.tar.gz

Extract the archived file

tar xf taxonomy-feature-table.tar.gz

Check the directory structure matches
Only a subset of the files are shown below and ... means “and others”

tree .

~/vis-tutorial/
├── metadata.tsv
├── taxonomy-feature-table.tar.gz
└── Taxonomy_profiling
    └── featureTables
        ├── bracken_FT_class
        ├── bracken_FT_class_counts
        ├── bracken_FT_class_relAbund
        ├── ...
        ├── bracken_FT_genus
        ├── bracken_FT_genus_counts
        ├── bracken_FT_genus_relAbund
        ├── combine_bracken_class.log
        ├── ...
        ├── combine_bracken_genus.log
        └── generate_bracken_feature_table.py

Note! Assumed working directory

This tutorial assumes the vis-tutorial directory is in your home directory as indicated by the ~ (tilde) sign. If you have put the files in another location then replace all occurrences of ~/vis-tutorial with your location

Data files explained

Directory / File Description

metadata.tsv is a tab-separated text file of the study metadata

Directory / File	Description
`metadata.tsv`	is a tab-separated text file of the study metadata
`Taxonomy_profiling/featureTables`	directory contains the taxonomy feature tables using Kraken2 and Bracken. There are feature tables for each taxonomy rank from Phylum to Species. As some analysis tools might require feature table input to be discrete counts while others might require relative abundances, two formats are provided with the suffixes: `_count` files have discrete counts of features (row) per samples (columns) `_relAbund` files ahve the relative abundances of features (row) per samples (columns)

Taxonomy_profiling/featureTables

directory contains the taxonomy feature tables using Kraken2 and Bracken.

There are feature tables for each taxonomy rank from Phylum to Species. As some analysis tools might require feature table input to be discrete counts while others might require relative abundances, two formats are provided with the suffixes:

*_count files have discrete counts of features (row) per samples (columns)
*_relAbund files ahve the relative abundances of features (row) per samples (columns)

Confirm you have the following 3 files (if they don’t exists, there will be an error message)

ls metadata.tsv
ls Taxonomy_profiling/featureTables/bracken_FT_genus_counts
ls Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund

Step 2) Examine input files

Examine metadata

Examine the metadata.csv file:
- head shows the first 10 rows
- column formats the output into columns using tab as the separator (prettier view)

head metadata.tsv | column -t -s $'\t'

Sample_ID    biosample     experiment   instrument           library.name                     sample.name                                         lab   type                 n_species
SRR17380113  SAMN20256237  SRX13554346  Illumina HiSeq 2500  metagenome_mockCell_labC_lib016  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380114  SAMN20256237  SRX13554345  Illumina HiSeq 2500  metagenome_mockCell_labC_lib015  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380115  SAMN20256237  SRX13554344  Illumina HiSeq 2500  metagenome_mockCell_labC_lib014  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380116  SAMN20256237  SRX13554343  Illumina HiSeq 2500  metagenome_mockCell_labC_lib013  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380117  SAMN20256237  SRX13554342  Illumina HiSeq 2500  metagenome_mockCell_labC_lib012  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380118  SAMN20256237  SRX13554341  Illumina HiSeq 2500  metagenome_mockCell_labC_lib011  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380119  SAMN20256237  SRX13554340  Illumina HiSeq 2500  metagenome_mockCell_labC_lib010  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380120  SAMN20256237  SRX13554339  Illumina HiSeq 2500  metagenome_mockCell_labC_lib009  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380121  SAMN20256237  SRX13554338  Illumina HiSeq 2500  metagenome_mockCell_labC_lib008  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18

There should be 9 columns in the metadata.tsv file
- row 1 is the header and the columns are pretty much self-explanatory
- column 1 is the Sample_ID, which must correspond with the sample names in the feature tables
- column 8 is the lab where the sample was processed, this will be our grouping factor in later steps

Examine taxonomy feature table

You can have a quick look at the taxonomy input table
- this command will show the first 10 rows and 6 columns

head Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund | cut -f1-6 | column -t -s $'\t'

For example, check the sample IDs (columns) are the same as the metadata.tsv

name                        SRR17380113  SRR17380114  SRR17380116  SRR17380117  SRR17380119
s__Escherichia flexneri     0.06616      0.06187      0.06694      0.06276      0.0674
s__Escherichia dysenteriae  0.0346       0.03221      0.03489      0.03252      0.03523
s__Escherichia coli_D       0.03155      0.02893      0.03205      0.02955      0.03223
s__Escherichia coli         0.0185       0.01747      0.01875      0.01766      0.01872
s__Escherichia coli_C       0.00975      0.00943      0.00986      0.00956      0.00994
s__Escherichia sp004211955  0.00458      0.00473      0.0046       0.00474      0.00463
s__Escherichia sp002965065  0.00375      0.00399      0.00381      0.00402      0.00379
s__Escherichia fergusonii   0.00398      0.00371      0.00405      0.00375      0.00405
s__Escherichia sp000208585  0.00321      0.0033       0.00326      0.00334      0.00328

Step 3) Generate taxonomy bar plots

Enter the following command
- if /vis-tutorial is not in your home directory (~), change the ~/vis-tutorial/ to the absolute path where you downloaded the tutorial data
- the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
- note if you enter the command as one line, remove the backslashes

1
2
3
4
5
6
apptainer run --app mima-vis-taxa $SANDBOX \
~/vis-tutorial/Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund \
~/vis-tutorial/metadata.tsv \
lab:labA:labB:labC \
~/vis-tutorial/analysis \
LAB 8 10 6

Parameters explained

Line	Parameters	Description
2	`<feature_table.tsv>`	file path of the relative abundance feature table
3	`<metadata.tsv>`	file path of the study metadata table. Expect a tab separated text file with one row per sample with the first row being the header, and columns are the different metadata.
4	`<column_group:g1:g2>`	column name in `metadata.tsv` that holds the grouping information and the group levels. Format is `group_column:group1:group2:...`. Note: the values are case sensitive. In this tutorial, the `lab` column is used and has 3 groups: `labA`, `labB` and `labC`.
5	`<output_dir>`	output directory path where results are stored
6-1*	`<output_prefix>`	prefix that will be used for output files. In the example, `LAB` is set as the prefix for output files.
6-2*	`<top_N>`	the number of top taxa to colour in the stack bar plot, remaining detected taxa are grouped together as ‘Other’.
6-3*	`<figure_width>`	output figure width (inches)
6-4*	`<figure_height>`	output figure height (inches)

*the last 4 parameters are all on line 6 separated by a space

Note! Parameter order matters

The parameters must be in the same order described above.

Step 4) Check output files

You should have the following output files after running Step 2

tree ~/vis-tutorial/analysis

analysis_v3/
├── LAB.select.average.table.txt
├── LAB.select.average.top_7_.barchart.pdf
├── LAB.select.average.top_7_seperategroup.barchart.pdf
├── LAB.select.average.top_7_seperategroup.table.txt
├── LAB.select.average.top_7_seperategroup.table.txt.7.R
├── LAB.select.average.top_7.table.txt
├── LAB.select.average.top_7.table.txt.7.R
├── LAB.select.meta.txt
├── LAB.select.table_top_7_seperategroup.txt
├── LAB.select.table_top_7_seperategroup.txt.7.R
├── LAB.select.table_top_7.txt
├── LAB.select.table_top_7.txt.7.R
├── LAB.select.table.txt
├── LAB.select.top_7_.barchart.pdf
└── LAB.select.top_7_seperategroup.barchart.pdf

There are 15 output files
- 4 PDF files - plots
- 7 text files - data
- 4 R scripts that generate plots

Output files	Description
`.average.top_`	mean group abundances for the overall top 7 taxa across the entire study
`.average.top__separategroup`	mean group abundances for the top 7 taxa within each group
`*.meta.txt`	study metadata specific for the plots
`.table_top_`	sample abundances of the overall top 7 taxa across the entire study
`.table_top__separategroup.*`	sample abundances of the top 7 taxa within each group

Example plots

Relative abundances across all samples

This plot shows the relative abundances for the top 8 genera for each sample, where the top 8 genera is across all samples in the study.

Mean relative abundances by group

This plot shows the mean relative abundances for the top 8 genera occurring across all samples in the study. All other detected taxa not in the top 7 are aggregated as Others. In this example, Pseudomonas E is the most abundant in samples processed by Lab-C."

File: LAB.select.average.top_8_.barchart.pdf

Plots from Phylum to Species

You can repeat Step 3 and change the input file from Phylum to Species feature tables to generate bar plots for the different or desired taxonomic ranks.

Alternatively if you have other grouping variables in your study design, you can change column_group parameter in the command.

4.2.2 - Core diversity analysis and visualisation

comparisons of groups in cross-sectional studies

Introduction

This tutorial runs through three common core biodiversity analyses for cross-sectional study designs that include group comparisons.

Alpha-diversity analysis

describes the biodiversity of a single sample/community. There are many metrics such as Richness (number of species detected), Evenness (describing the distribution of abundances across the detected species in a sample), Shannon Diversity (combines richness and abundances) and others.
the pipeline will calculate the alpha-diversity for each sample present in the feature table and then perform hypothesis testing to assess if there are any significant (p < 0.05) differences between groups. The pipeline also generates a set of plots to compare the alph-diversity between groups.

Beta-diversity analysis

measures the community (dis)similarity between two or more samples/communities. Two common metrics are:
- Jaccard (dis)similarity which only considers presence/absences of species
- Bray-Curtis dissimilarity which also considers the abundances of the detected species
the pipeline will calculate the pairwise beta-diversities between all samples present in the feature table and then perform PERMANOVA testing using the adonis function in vegan to assess for any significant (p < 0.05) differences between groups. The pipeline also generates a set of plots to help visualise the similarity between samples coloured by the groups using multiple ordinations.

Differential abundance analysis

identifies which features (e.g., species, genes, pathways) are significant different between groups based on their relative abundances

Example study data

The data set used for this tutorial is based on the full data set from the [data processing tutorials](https://mrcbioinfo.github.io/mima-pipeline/docs/tutorials/data-processing/ (Tourlousse, et al., 2022) that was pre-process with the same pipeline. There are 56 samples in total from the study, but after taxonomy profiling with Kraken2 and Bracken, 4 samples fell below the abundance threshold and therefore the final taxonomy abundance table has 52 samples.

This data set consists of two mock communities: (i) DNA-mock and (ii) Cell-mock processed in three different labs: A, B and C.

Analysis objective

In this tutorial, our main questions are:

is there a difference between the two types of mock communities? and
are there any differences between the processing labs?

Check list

For this set of tutorial, you need

System with Apptainer or Singularity installed (currently only tested on UNIX-based system)
Install MIMA Pipeline Singularity container and check that you have
- started an interactive PBS job
- build the sandbox container
- set the SANDBOX environment variable

Step 1) Data preparation

Skip if ...

You can skip this section if you have already run the Biodiversity bar plot tutorial and go directly to Step 2-to refresh yourself with the data or proceed to Step 3, running the analysis.

Use the vis-tutorial directory from Biodiversity bar plots or create a directory vis-tutorial
Change into this directory

mkdir vis-tutorial
cd vis-tutorial

Download taxonomy-feature-table.tar.gz
- Note only the processed feature tables are provided in this download; intermediate files from the data processing pipeline are not provided.

wget https://github.com/mrcbioinfo/mima-pipeline/raw/master/examples/taxonomy-feature-table.tar.gz

Extract the archived file

tar xf taxonomy-feature-table.tar.gz

Check the directory structure matches
Only a subset of the files are shown below and ... means “and others”

tree .

~/vis-tutorial/
├── metadata.tsv
├── taxonomy-feature-table.tar.gz
└── Taxonomy_profiling
    └── featureTables
        ├── bracken_FT_class
        ├── bracken_FT_class_counts
        ├── bracken_FT_class_relAbund
        ├── ...
        ├── bracken_FT_genus
        ├── bracken_FT_genus_counts
        ├── bracken_FT_genus_relAbund
        ├── combine_bracken_class.log
        ├── ...
        ├── combine_bracken_genus.log
        └── generate_bracken_feature_table.py

Note! Assumed working directory

Data files explained

Directory / File Description

metadata.tsv is a tab-separated text file of the study metadata

Directory / File	Description
`metadata.tsv`	is a tab-separated text file of the study metadata
`Taxonomy_profiling/featureTables`	directory contains the taxonomy feature tables using Kraken2 and Bracken. There are feature tables for each taxonomy rank from Phylum to Species. As some analysis tools might require feature table input to be discrete counts while others might require relative abundances, two formats are provided with the suffixes: `_count` files have discrete counts of features (row) per samples (columns) `_relAbund` files ahve the relative abundances of features (row) per samples (columns)

Taxonomy_profiling/featureTables

directory contains the taxonomy feature tables using Kraken2 and Bracken.

*_count files have discrete counts of features (row) per samples (columns)
*_relAbund files ahve the relative abundances of features (row) per samples (columns)

Confirm you have the following 3 files (if they don’t exists, there will be an error message)

ls metadata.tsv
ls Taxonomy_profiling/featureTables/bracken_FT_genus_counts
ls Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund

Step 2) Examine input files

Examine metadata

Examine the metadata.csv file:
- head shows the first 10 rows
- column formats the output into columns using tab as the separator (prettier view)

head metadata.tsv | column -t -s $'\t'

Sample_ID    biosample     experiment   instrument           library.name                     sample.name                                         lab   type                 n_species
SRR17380113  SAMN20256237  SRX13554346  Illumina HiSeq 2500  metagenome_mockCell_labC_lib016  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380114  SAMN20256237  SRX13554345  Illumina HiSeq 2500  metagenome_mockCell_labC_lib015  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380115  SAMN20256237  SRX13554344  Illumina HiSeq 2500  metagenome_mockCell_labC_lib014  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380116  SAMN20256237  SRX13554343  Illumina HiSeq 2500  metagenome_mockCell_labC_lib013  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380117  SAMN20256237  SRX13554342  Illumina HiSeq 2500  metagenome_mockCell_labC_lib012  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380118  SAMN20256237  SRX13554341  Illumina HiSeq 2500  metagenome_mockCell_labC_lib011  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380119  SAMN20256237  SRX13554340  Illumina HiSeq 2500  metagenome_mockCell_labC_lib010  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380120  SAMN20256237  SRX13554339  Illumina HiSeq 2500  metagenome_mockCell_labC_lib009  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380121  SAMN20256237  SRX13554338  Illumina HiSeq 2500  metagenome_mockCell_labC_lib008  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18

There should be 9 columns in the metadata.tsv file
- row 1 is the header and the columns are pretty much self-explanatory
- column 1 is the Sample_ID, which must correspond with the sample names in the feature tables
- column 8 is the lab where the sample was processed, this will be our grouping factor in later steps

Examine taxonomy feature table

You can have a quick look at the taxonomy input table
- this command will show the first 10 rows and 6 columns

head Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund | cut -f1-6 | column -t -s $'\t'

For example, check the sample IDs (columns) are the same as the metadata.tsv

name                        SRR17380113  SRR17380114  SRR17380116  SRR17380117  SRR17380119
s__Escherichia flexneri     0.06616      0.06187      0.06694      0.06276      0.0674
s__Escherichia dysenteriae  0.0346       0.03221      0.03489      0.03252      0.03523
s__Escherichia coli_D       0.03155      0.02893      0.03205      0.02955      0.03223
s__Escherichia coli         0.0185       0.01747      0.01875      0.01766      0.01872
s__Escherichia coli_C       0.00975      0.00943      0.00986      0.00956      0.00994
s__Escherichia sp004211955  0.00458      0.00473      0.0046       0.00474      0.00463
s__Escherichia sp002965065  0.00375      0.00399      0.00381      0.00402      0.00379
s__Escherichia fergusonii   0.00398      0.00371      0.00405      0.00375      0.00405
s__Escherichia sp000208585  0.00321      0.0033       0.00326      0.00334      0.00328

Step 3) Run core diversity analysis

Enter the following command
- if /vis-tutorial is not in your home directory (~), change the ~/vis-tutorial/ to the absolute path where you downloaded the tutorial data
- the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
- note if you enter the command as one line, remove the backslashes

1
2
3
4
5
apptainer run --app mima-visualisation $SANDBOX \
--feature-table ~/vis-tutorial/Taxonomy_profiling/featureTables/bracken_FT_species_counts \
--metadata ~/vis-tutorial/metadata.tsv \
--study-groups lab,type \
--output-dir ~/vis-tutorial/analysis > ~/vis-tutorial/visualisation.log

Parameters explained

Parameter	Description
`--feature-table <path>`	absolute filepath to the taxonomy abundance table
`--metadata <path>`	filepath to the study metadata table
`--study-groups <col1>,<col2>`	Comma separated list of column names found in the `--metadata` file. These columns contain the study groups of interest. Each column should have at least 2 or more levels for comparisons.
`--output-dir <path>`	output directory path

In this tutorial, we are performing two sets of comparisons:
- comparing between labs (3 levels: A, B, and C)
- comparing between mock community type (2 levels: DNA and Cell)

Confirm output directories

tree -d ~/vis-tutorial/analysis

~/vis-tutorial/analysis/
├── Kraken
├── Kraken_alpha
│   ├── boxplots
│   │   ├── corrected_sig
│   │   ├── non_sig
│   │   └── uncorrected_sig
│   └── wilcox-pairwise
├── Kraken_beta
│   ├── permanova
│   └── plots
└── Kraken_diff-abundance
    ├── barplot
    ├── boxplots
    │   ├── corrected_sig
    │   └── uncorrected_sig
    ├── volcano
    └── wilcox-pairwise

Your output directory (only folders are shown) should resemble something like above, where first level sub-directories:

Directory	Description
`Kraken/`	directory contains two text files: `otu_matrix.txt` - the feature abundance table `TaxProfileWithMeta.txt` - the feature abundance table concatenated with the study metadata provided by the `--metadata` parameter
`Kraken_alpha/`	outputs from alpha-diversity analysis
`Kraken_beta/`	outputs from beta-diversity analysis
`Kraken_diff-abundance/`	outputs from differential abundance analysis

Step 4) Understanding the output files

Alpha-diversity output

Examine the output files for alpha-diversity analysis

tree ~/vis-tutorial/analysis/Kraken_alpha

~/vis-tutorial/analysis/Kraken_alpha
├── alpha_diversity_raw_data.txt
├── boxplots
│   ├── corrected_sig
│   │   ├── chao1_lab_dotplot.png
│   │   ├── chao2_lab_dotplot.png
│   │   ├── evenness_lab_dotplot.png
│   │   ├── invsimperson_type_dotplot.png
│   │   ├── observed_lab_dotplot.png
│   │   ├── shannon_type_dotplot.png
│   │   └── simperson_type_dotplot.png
│   ├── non_sig
│   │   ├── chao1_type_dotplot.png
│   │   ├── chao2_type_dotplot.png
│   │   ├── evenness_type_dotplot.png
│   │   ├── invsimperson_lab_dotplot.png
│   │   ├── observed_type_dotplot.png
│   │   ├── shannon_lab_dotplot.png
│   │   └── simperson_lab_dotplot.png
│   └── uncorrected_sig
├── lab_filter_table.txt
├── lab_kw_stat_summary.txt
├── lab_pairwise_dunn_test.txt
├── type_filter_table.txt
├── type_kw_stat_summary.txt
└── wilcox-pairwise
    ├── lab_labA_labB.wil.stat_summary.txt
    ├── lab_labA_labC.wil.stat_summary.txt
    ├── lab_labB_labC.wil.stat_summary.txt
    └── type_Cell mock community_DNA mock community.wil.stat_summary.txt

5 directories, 24 files

Since we ran 2 sets of comparisons, between lab and type, there are 2 sets of results as indicated by the 2nd part of the filename.
When comparing between labs, the output files with *_lab_* in the filename (highlighted rows), you should have:
- 4 significant results after adjusting for multiple comparison adjustment using: chao1, chao2, evenness and observed richness diversity indcies
- 3 non-significant results after adjustment using: inverse simpson, shannon and simpson indices
- 0 results that were significant before adjustment

Sub-directories explained

Directory	Description
`Kraken_alpha/boxplots/`	boxplots comparing groups defined by the `--study-groups` parameter. If there are two groups then the Wilcox rank-sum test is performed, if there are more than two groups than the Kruskal-Wallis test is performed. Comparisons are adjusted for multiple comparisons and results are separated further into sub-directories.
`.../boxplots/corrected_sig/`	alpha-diversities that show significant group differences after adjustment
`.../boxplots/non_sig/`	non-significant alpha-diversities after adjustment
`.../boxplots/uncorrected_sig/`	results that are significant before adjustment
`Kraken_alpha/wilcox-pairwise/`	text output from Wilcox rank-sum tests for two-group comparisons defined by the `--study-groups` parameter
File
`alpha_diversity_raw_data.txt`	tab-separated text file of calculated alpha-diversities (columns) for each sample (row)
`*_filter_table.txt`	filtered `alpha_diversity_raw_data.txt` table
`*_kw_stat_summary.txt`	output from Kruskal-Wallis test when there are > 2 groups being compared
`*_pairwise_dunn_test.txt`	output from post-hoc pairwise comparisons when there are > 2 groups being compared

Example figures

Comparing Observed Richness between Labs

There is significant differences between Labs A-vs-B, A-vs-C and B-vs-C when comparing their Observed richness (number of species detected in a sample).

File: observed_lab_dotplot.png

Comparing Evenness between Labs

There is significant differences between Labs A-vs-B, A-vs-C and B-vs-C when comparing their Evenness (a measure to describe the abundance distribution of species detected in a sample/community, e.g., their dominance/rareness).

File: evenness_lab_dotplot.png

Beta-diversity analysis output

Examine the output files for beta-diversity analysis

tree ~/vis-tutorial/analysis/Kraken_beta

~/vis-tutorial/analysis/Kraken_beta/
├── permanova
│   ├── lab_pairwise_adonis.results.tsv
│   └── type_pairwise_adonis.results.tsv
└── plots
    ├── CA-Bray-Curtis-lab.png
    ├── CA-Bray-Curtis-type.png
    ├── CCA-Bray-Curtis-lab.png
    ├── CCA-Bray-Curtis-type.png
    ├── DCA-Bray-Curtis-lab.png
    ├── DCA-Bray-Curtis-type.png
    ├── NMDS-Bray-Curtis-lab.png
    ├── NMDS-Bray-Curtis-type.png
    ├── PCA-Bray-Curtis-lab.png
    ├── PCA-Bray-Curtis-type.png
    ├── PCoA-Bray-Curtis-lab.png
    ├── PCoA-Bray-Curtis-type.png
    ├── RDA-Bray-Curtis-lab.png
    └── RDA-Bray-Curtis-type.png

2 directories, 16 files

*_results.tsv are text files of results from the PERMANOVA tests
Multiple ordination plots are generated in the plots directory
As we preformed 2 sets of comparisons (the --study-groups parameter had value lab,type) we have 2 files per ordination with the column name used as the suffix for the output file

Example figures

Principal co-ordinate analysis plot of the beta-diversity measured using Bray-Curtis dissimilarity between samples grouped by the three labs.

Differential abundance analysis output

Examine the output files for differential abundance analysis

tree ~/vis-tutorial/analysis/Kraken_diff-abundance/

warning! there will be a lot of files because we performed 2 sets of comparisons between lab and type. Below only shows a snapshot with ... meaning “and others”

~/vis-tutorial/analysis/Kraken_diff-abundance/
├── barplot
│   ├── lablabA_labB_barplot_FDR.png
│   ├── lablabA_labB_barplot_p.png
│   ├── typeCell mock community_DNA mock community_barplot_FDR.png
│   └── typeCell mock community_DNA mock community_barplot_p.png
├── boxplots
│   ├── corrected_sig
│   │   ├── ...
│   │   ├── s__Bifidobacterium.callitrichos_lab_dotplot.png
│   │   ├── s__Bifidobacterium.callitrichos_type_dotplot.png
│   │   ├── ...
│   │   └── s__Zag1.sp900556295_lab_dotplot.png
│   └── uncorrected_sig
│       ├── ...
│       ├── s__Acetatifactor.muris_type_dotplot.png
│       ├── ...
│       └── s__Zag1.sp900556295_type_dotplot.png
├── lab_filter_table.txt
├── lab_kw_stat_summary.txt
├── lab_pairwise_dunn_test.txt
├── type_filter_table.txt
├── type_kw_stat_summary.txt
├── volcano
│   ├── Cell mock community_DNA mock community.volcanoplot.d.p.png
│   ├── labA_labB.volcanoplot.d.p.png
│   ├── lablabA_labB.volcanoplot.d.FDR.png
│   ├── lablabA_labB.volcanoplot.Log2FC.FDR.png
│   ├── lablabA_labB_volcanoplot_Log2FC_p.png
│   ├── typeCell mock community_DNA mock community.volcanoplot.d.FDR.png
│   ├── typeCell mock community_DNA mock community.volcanoplot.Log2FC.FDR.png
│   └── typeCell mock community_DNA mock community_volcanoplot_Log2FC_p.png
└── wilcox-pairwise
    ├── lab_labA_labB.wil.stat_summary.txt
    ├── lab_labA_labC.wil.stat_summary.txt
    ├── lab_labB_labC.wil.stat_summary.txt
    └── type_Cell mock community_DNA mock community.wil.stat_summary.txt

6 directories, 2301 files

Example figures

Example marker 1

The univariate comparison suggests there is significant lab differences after adjustment for multiple comparison in species *Acetivibrio A ethanolgignens*.

File: s__Acetivibrio_A.ethanolgignens_lab_dotplot.png

Example marker 2

Another example of significant lab differences after adjustment for multiple comparison in species *Bifidobacterium callitrichos*.

File: s__Bifidobacterium.callitrichos_lab_dotplot.png

Sub-directories explained

The output files are organised in a similar fashion as per alpha-diversity output

Directory	Description
`Kraken_diff-abundance/barplot/`	bar plot of each input feature and the measured Hodges-Lehmann estimator. suitable for 2 groups only, if you have >2 groups in your study, this will only compare the first two groups.
`Kraken_diff-abundance/boxplots/`	one boxplot for every feature (e.g., species, gene, pathway) in the input feature table, comparing between groups specified by the columns in `--study-groups`. Comparisons are adjusted for multiple comparisons and results are separated further into sub-directories
`.../boxplots/corrected_sig/`	features that show significant group differences after adjustment
`.../boxplots/non_sig/`	non-significant features after adjustment
`.../boxplots/uncorrected_sig/`	features that are significant before adjustment
`Kraken_diff-abundance/volcano/`	volcano plots of comparisons, a scatterplot showing the statistical significance (p-value) versus the magnitude of change (fold change).
`Kraken_diff-abundance/wilcox-pairwise/`	text output from Wilcox rank-sum tests for two-group comparisons defined by the `--study-groups` parameter
File
`*_filter_table.txt`	filtered `alpha_diversity_raw_data.txt` table
`*_kw_stat_summary.txt`	output from Kruskal-Wallis test when there are > 2 groups being compared
`*_pairwise_dunn_test.txt`	output from post-hoc pairwise comparisons when there are > 2 groups being compared

4.2.3 - Classification with Random Forest

assessing the classification using Random Forest for cross-sectional studies

Introduction

This tutorial shows how to run Random Forest classification using the MIMA Container.

Briefly, Random Forest is a supervised machine learning approach for classification and regression. It uses labelled data (what samples below to what group) to learn which key data features are useful in differentiating samples between classes (groups)—hence “supervised learning”—so that when new data are presented it can make a prediction.

Input data required for training a Random Forest classifier:

one or more feature tables: a text file with features (e.g., species, genes, or pathways) as rows and samples as columns. The cells denote the relative abundance of feature i for sample j.
metadata: informs the classifier which samples belong to which group; samples must belong to mutually exclusive groups.

One common approach to evaluate classifier performance is via cross-validation (CV), which basically splits the input data into a training set and a testing set. MIMA uses the K-fold cross-validation approach. This splits the input data into K number of partitions; hold one partition for testing and the remaining K-1 partitions are used for training. So when you have 5-fold CV, you split data into 5 parts or 80/20% split with 80% of the data used for training and 20% used for testing.

In MIMA, K ranges from 3 to 10, incrementing by one each time, so for each input data you will get 8 outputs: cv_3, cv_4, …, cv_9, cv_10.

Example data

For this tutorial, we will use data from the Human Microbiome Project V1.

Our data consists of N=205 stool samples, of which 99 are from females and 106 are from males. Only samples from their first visit are used. The taxonomy profiles have been annotated using the MetaPhlAn2 pipeline.

Analysis objective

Our question is “how well can we distinguish between Sex using the human stool microbiome?”

Check list

For this set of tutorial, you need

System with Apptainer or Singularity installed (currently only tested on UNIX-based system)
Install MIMA Pipeline Singularity container and check that you have
- started an interactive PBS job
- build the sandbox container
- set the SANDBOX environment variable

Step 1) Data preparation

Create a directory randforest-tut
Change directory

mkdir randforest-tut
cd randforest-tut

Download data HMPv1-stool-samples.tar.gz

wget https://github.com/mrcbioinfo/mima-pipeline/raw/master/examples/HMPv1-stool-samples.tar.gz

Extract the archived file

tar -xf HMPv1-stool-samples.tar.gz

Check directory structure

tree .

~/random-forest
├── HMPv1-stool-samples
│   ├── hmp1-stool-visit-1.features_1-kingdom.tsv
│   ├── hmp1-stool-visit-1.features_2-phylum.tsv
│   ├── hmp1-stool-visit-1.features_3-class.tsv
│   ├── hmp1-stool-visit-1.features_4-order.tsv
│   ├── hmp1-stool-visit-1.features_5-family.tsv
│   ├── hmp1-stool-visit-1.features_6-genus.tsv
│   ├── hmp1-stool-visit-1.features_7-species.tsv
│   ├── hmp1-stool-visit-1.features_8-strain.tsv
│   └── hmp1-stool-visit-1.metadata.tsv
└── HMPv1-stool-samples.tar.gz

1 directory, 10 files

There should be 9 tab-separated text files (*.tsv extension) of which one is the metadata and the other are feature tables from Kingdom to Strain ranks.

Note! assumed working directory

This tutorial assumes the randforest-tut directory is in your home directory as indicated by the ~ (tilde) sign. If you have put the files in another location then replace all occurrences of ~/randforest-tut with your location.

Step 2) Examine input files

Examine the metadata

head HMPv1-stool-samples/hmp1-stool-visit-1.metadata.tsv | column -t

SN         RANDSID    VISNO  STArea  STSite  SNPRNT     Gender  WMSPhase  SRS
700014562  158458797  1      Gut     Stool   700014555  Female  1         SRS011061
700014724  158479027  1      Gut     Stool   700014718  Male    1         SRS011084
700014837  158499257  1      Gut     Stool   700014832  Male    1         SRS011134
700015181  158742018  1      Gut     Stool   700015179  Female  1         SRS011239
700015250  158802708  1      Gut     Stool   700015245  Male    1         SRS011271
700015415  158944319  1      Gut     Stool   700015394  Female  1         SRS011302
700015981  159247771  1      Gut     Stool   700015979  Female  1         SRS011405
700016142  159146620  1      Gut     Stool   700016136  Male    1         SRS011452
700016610  159166850  1      Gut     Stool   700016608  Male    1         SRS011529

There are 9 columns, where SN is the Sample ID and should correspond with the columns of the feature table

Examine the CLASS feature table

head HMPv1-stool-samples/hmp1-stool-visit-1.features_3-class.tsv | cut -f1-6 | column -t

lineage                                               700014562  700014724  700014837  700015181  700015250  700015415  700015981  700016142  700016610
k__Archaea|p__Euryarchaeota|c__Methanobacteria        0          0          0.0008246  0.0006299  0          0.0013228  0          0          0.0005494
k__Archaea|p__Euryarchaeota|c__Methanococci           0          0          0          0          0          0          0          0          0
k__Bacteria|p__Acidobacteria|c__Acidobacteria_noname  0          0          0          0          0          0          0          0          0
k__Bacteria|p__Acidobacteria|c__Acidobacteriia        0          0          0          0          0          0          0          0          0
k__Bacteria|p__Actinobacteria|c__Actinobacteria       0.00042    0.0038802  0.0006441  0.0082703  0.0020957  0.0042049  0.0001552  0.0012545  0.0031392
k__Bacteria|p__Bacteroidetes|c__Bacteroidetes_noname  0          0          0          0          0          0          0          0          0
k__Bacteria|p__Bacteroidetes|c__Bacteroidia           0.910796   0.837664   0.600565   0.844245   0.80352    0.451224   0.924745   0.887734   0.838053
k__Bacteria|p__Bacteroidetes|c__Cytophagia            0          0          0          0          0          0          0          0          0
k__Bacteria|p__Bacteroidetes|c__Flavobacteriia        0          0          0          0          0          0          0          0          0

Step 3) Run Random Forest Classifier

First create an output directory

mkdir classifer-output

Enter the following command
- if you used different data preparation settings, than replace all ~/randforest-tut with your location.
- the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
- note if you enter the command as one line, remove the backslashes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ apptainer run --app mima-classifier-RF $SANDBOX \
-i ~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_1-kingdom.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_2-phylum.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_3-class.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_4-order.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_5-family.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_6-genus.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_7-species.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_8-strain.tsv \
-m ~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.metadata.tsv \
-c Gender \
-o classifier-output

Parameters explained

Line	Parameters	Description
2	`-i/--input <path>`	comma separated list of input text files (e.g. taxonomy abundance tables).
10	`-m/--metadata <path>`	file path of the study metadata table. Expect a tab separated text file with one row per sample with the first row being the header, and columns are the different metadata.
11	`-c/--column <column_name>`	column header from the metadata table that contains the grouping categories for classification
12	`-o/--output <output_dir>`	output directory path where results are stored

N+1 output

If N input files are provided, there will be N+1 sets of output:

one classifier for each input data
a classifier trained on data_all which combines all the input data together as a big table

In this tutorial, we provided 8 input taxonomy files from Kingdom to Strain level features. This will generate 9 output files: one for each taxonomy rank and the last is a big table combining all the previous 8 together.

Step 4) Check output files

As mentioned above, MIMA uses K-fold cross validation (CV/cv) to evaluate the trained Random Forest classifier overall performance. For each input data there will be 8 outputs from cv_3 to cv_10.

Examine the output files

tree classifier-output

~/randforest-tut/classifier-output
├── cv_auc.pdf
├── roc_classif.randomForest_dataset.pdf
├── roc_data_1_classifier.pdf
├── roc_data_1.pdf
├── roc_data_2_classifier.pdf
├── roc_data_2.pdf
├── roc_data_3_classifier.pdf
├── roc_data_3.pdf
├── roc_data_4_classifier.pdf
├── roc_data_4.pdf
├── roc_data_5_classifier.pdf
├── roc_data_5.pdf
├── roc_data_6_classifier.pdf
├── roc_data_6.pdf
├── roc_data_7_classifier.pdf
├── roc_data_7.pdf
├── roc_data_8_classifier.pdf
├── roc_data_8.pdf
├── roc_data_all_classifier.pdf
├── roc_data_all.pdf
└── socket_cluster_log.txt

0 directories, 21 files

You should have 21 files in total

File	N files	N pages/file	Description
`socket_cluster_log.txt`	1	n/a	log text file
`roc_data_*_classifier.pdf`	9	8	there is on file one per input data (8 in total from Kingdom to Strain taxonomy abundance tables) and the last output `data_all` is where all input tables are combined. Each PDF (data set) will have 8 pages: from 3- to 10-fold CV
`cv_auc.pdf`	1	9	one page per input data: 8 from Kingdom to Strain and the last page for `data_all` where all the input tables are combined
`roc_classif.randomForest_dataset.pdf`	1	8	one page per cross-validation (from 3- to 10-fold CV)
`roc_data_*.pdf`	9	8	same as above

Example figures

There is one roc_data_*_classifier.pdf file for each input data. Each PDF contains one page per N-fold cross validation (CV) that includes 3-folds to 10-folds CV.

Below we’ve only shown the classification results when using Species (data_7) as input features and for 4 out of 8 CVs.

Receiver operating characteristic curve (ROC curve) is a graph that shows the overall performance of a classification model. Ideally, good classifiers will have high specificity (x-axis) and high sensitivity (y-axis), therefore the desired curve is towards to top-left above the diagonal line. If the curve follows the diagonal line then the classifier is no better than randomly guessing the class (groups) a sample belongs.

ROC curve characteristics [source: Wikipedia]

note

In the plots generated by MIMA Random Forest Classification, the x-axis shown is in reverse order from 1 to 0, since false positive rate, $FPR = 1-Specificity$

The area under the ROC curve (AUC) is the overall accuracy of a classifier, ranging between 0-1 (or reported as percentages).

Below shows a subset of the figures from the Species (input 7) abundance table roc_data_7_classifier.pdf. There is one page per cross-validation from 3- to 10-fold CV.
Note the AUC annotation at the top of each figure

This shows the ROC curve for the Species relative abundance table (data_7) when using 3-fold cross-validation.

This figure shows the ROC curve for the Species relative abundance table (data_7) when using 4-fold cross-validation.

This figure shows the ROC curve for the Species relative abundance table (data_7) when using 8-fold cross-validation.

This figure shows the ROC curve for the Species relative abundance table (data_7) when using 10-fold cross-validation.

cv.auc.pdf files shows an aggregation of the above roc_data_*_classifier.pdf plots. There is one page per input data and the plots shows the AUC against the K-fold CV.

In this plot, we observe that when using Species (data_7) abundances as the feature table, the classifier accuracy ranges between 0.7 to 0.8 with a peak performance occurring at cv_8 (8-fold cross validation). In general, when there is more data for training (higher cv) there is better performance (higher AUC).

This plot compares accuracy (AUC) for Random Forest classifiers trained on the Species abundance tables (input 7 == page 7). The AUC values correspond to the area under the curve of the above roc_data_7_classifier.pdf file. In this plot, we observe that when using Species (data_7) abundances as the feature table, the classifier accuracy ranges between 0.7 to 0.8 with a peak performance occurring at cv_8 (8-fold cross validation). In general, when there is more data for training (higher cv) there is better performance (higher AUC). — This plot compares accuracy (AUC) for Random Forest classifiers trained on the Species abundance tables (input 7 == page 7). The AUC values correspond to the area under the curve of the above `roc_data_7_classifier.pdf` file. In this plot, we observe that when using Species (data_7) abundances as the feature table, the classifier accuracy ranges between 0.7 to 0.8 with a peak performance occurring at cv_8 (8-fold cross validation). In general, when there is more data for training (higher cv) there is better performance (higher AUC).

roc_classif.randomForest_dataset.pdf is summary PDF with one page per K-fold CV. Each plot shows the ROC curves for each input data for a specified K-fold CV. Below shows the performance for classifiers trained using 8-fold CV (page 8) across the 8 input tables (Kingdom to Strain) plus the final data table that combine all the eight tables together as one big dataset giving the ninth data set.

This PDF combines the individual plots from the set of roc_data_*_classifier.pdf files.

The plot below tells us that when using 8-fold CV for training, the best performing classifier to distinguish between males and females, in this set is data_7 with an accuracy of 0.787 (or 78.7%). This classifier was trained using Species rank abundance table. Strain level is comparable with accuracy of 0.779, while Kingdom classifiers perform very poorly (accuracy = 0.519) which are pretty much random guesses. There was no improvement to the classify if we combined all the features (data_9) together.

The ROC curves of Random Forest classifiers trained using 8-fold cross-validation (cv_8) for input data from Kingdom (data_1) to Strain (data_8) and the combined dataset which merges all input data as one table (data_9). When using 8-fold CV the best performing classifier has accuracy of 0.787 using the Species input data (data_7). See text description for further interpretation.

Documentation

MIMA Pipeline

Getting started

1 - What is Container Image?

…in a nutshell

Apptainer == Singularity

Noteworthy points

Build a Sandbox

Containers need explicit instructions to access your files

Shortcuts/softlinks also need binding

2 - Installation

MIMA Container image

Start an interactive PBS job

Build a sandbox

Confirm installation

Reminder

3 - Data dependencies

QC: decontamination step

Taxonomy profiling

Functional profiling

HUMAnN database

MetaPhlAn database

Installing to external path

4 - Tutorials

4.1 - Data processing with MIMA

MIMA: data-processing

Note

How the tutorials works

Check list

4.1.1 - Download tutorial data

Note

Step 1) Download tutorial files

Note! Assumed working directory

Step 2) Download Sequence files

Option A: direct download

Option B: download with sratoolkit

Option C: download via MIMA

Step 3) Check manifest

Manifest file formats

4.1.2 - Need to know

Project working directory

TIP: Text editors

Containers and binding paths

PBS configuration files

Use absolute paths

Absolute/Full paths

Relative paths

4.1.3 - Quality control (QC)

Introduction: QC module

Workflow description

Step 1. Generate QC PBS scripts

Parameters explained

tip

Step 2. Check generated scripts

Step 3. Submit QC jobs

PBS log files

Step 4. Check QC outputs

Step 5. (optional) Generate QC report

Troubleshoot

4.1.4 - Taxonomy Profiling

Introduction to Taxonomy Profiling

Workflow description

Step 1. Generate PBS script

Parameters explained

Step 2. Check generated scripts

Step 3. Submit Taxonomy job

Tip: when running your own study

Step 4. Check Taxonomy outputs

Output files

Step 5. Generate taxonomy abundance table

Step 6. (optional) Generate summary report

4.1.5 - Functional Profiling

Introduction to Functional Profiling

Workflow description

Step 1. Generate PBS scripts

Parameters explained

Step 2. Check PBS scripts output

Step 3. Submit Function jobs

Tip: when running your own study

Step 4. Check Function outputs

Option B: download with `sratoolkit`