This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Analytical tutorials

Use MIMA to run core biodiversity analyses.

1: Biodiversity bar plots
2: Core diversity analysis and visualisation
3: Classification with Random Forest

This section covers standalone tutorials that analyses of the taxonomy biodiversity or functional biodiversity of samples in your study.

The tutorial will take in either the taxonomy abundance tables or functional feature tables.

You can start with:

1 - Biodiversity bar plots

generate bar plots to visualise the overall biodiversity in your samples

Introduction

When starting a new metagenomics study, it is useful to visualise the microbial diversity present in the individual samples. You can also average the relative abundances of the most abundant species present and observe how they are distributed between groups in your study (e.g., controls versus disease or placebo versus treatment).

For taxonomy features, the biodiversity bar plots can be generated for each taxonomic rank from Phylum to Species.

Example data

The data set used for this tutorial is based on the full data set from the data processing tutorials (Tourlousse, et al., 2022) that was pre-process with the same pipeline. There are 56 samples in total from the study, but after taxonomy profiling with Kraken2 and Bracken, 4 samples fell below the abundance threshold and therefore the final taxonomy abundance table has 52 samples.

This data set consists of two mock communities: (i) DNA-mock and (ii) Cell-mock processed in three different labs: A, B and C.

Check list

For this set of tutorial, you need

System with Apptainer or Singularity installed (currently only tested on UNIX-based system)
Install MIMA Pipeline Singularity container and check that you have
- started an interactive PBS job
- build the sandbox container
- set the SANDBOX environment variable

Step 1) Data preparation

Create a directory vis-tutorial
Change into this directory

mkdir vis-tutorial
cd vis-tutorial

Download taxonomy-feature-table.tar.gz
- Note only the processed feature tables are provided in this download; intermediate files from the data processing pipeline are not provided.

wget https://github.com/mrcbioinfo/mima-pipeline/raw/master/examples/taxonomy-feature-table.tar.gz

Extract the archived file

tar xf taxonomy-feature-table.tar.gz

Check the directory structure matches
Only a subset of the files are shown below and ... means “and others”

tree .

~/vis-tutorial/
├── metadata.tsv
├── taxonomy-feature-table.tar.gz
└── Taxonomy_profiling
    └── featureTables
        ├── bracken_FT_class
        ├── bracken_FT_class_counts
        ├── bracken_FT_class_relAbund
        ├── ...
        ├── bracken_FT_genus
        ├── bracken_FT_genus_counts
        ├── bracken_FT_genus_relAbund
        ├── combine_bracken_class.log
        ├── ...
        ├── combine_bracken_genus.log
        └── generate_bracken_feature_table.py

Note! Assumed working directory

This tutorial assumes the vis-tutorial directory is in your home directory as indicated by the ~ (tilde) sign. If you have put the files in another location then replace all occurrences of ~/vis-tutorial with your location

Data files explained

Directory / File Description

metadata.tsv is a tab-separated text file of the study metadata

Directory / File	Description
`metadata.tsv`	is a tab-separated text file of the study metadata
`Taxonomy_profiling/featureTables`	directory contains the taxonomy feature tables using Kraken2 and Bracken. There are feature tables for each taxonomy rank from Phylum to Species. As some analysis tools might require feature table input to be discrete counts while others might require relative abundances, two formats are provided with the suffixes: `_count` files have discrete counts of features (row) per samples (columns) `_relAbund` files ahve the relative abundances of features (row) per samples (columns)

Taxonomy_profiling/featureTables

directory contains the taxonomy feature tables using Kraken2 and Bracken.

There are feature tables for each taxonomy rank from Phylum to Species. As some analysis tools might require feature table input to be discrete counts while others might require relative abundances, two formats are provided with the suffixes:

*_count files have discrete counts of features (row) per samples (columns)
*_relAbund files ahve the relative abundances of features (row) per samples (columns)

Confirm you have the following 3 files (if they don’t exists, there will be an error message)

ls metadata.tsv
ls Taxonomy_profiling/featureTables/bracken_FT_genus_counts
ls Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund

Step 2) Examine input files

Examine metadata

Examine the metadata.csv file:
- head shows the first 10 rows
- column formats the output into columns using tab as the separator (prettier view)

head metadata.tsv | column -t -s $'\t'

Sample_ID    biosample     experiment   instrument           library.name                     sample.name                                         lab   type                 n_species
SRR17380113  SAMN20256237  SRX13554346  Illumina HiSeq 2500  metagenome_mockCell_labC_lib016  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380114  SAMN20256237  SRX13554345  Illumina HiSeq 2500  metagenome_mockCell_labC_lib015  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380115  SAMN20256237  SRX13554344  Illumina HiSeq 2500  metagenome_mockCell_labC_lib014  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380116  SAMN20256237  SRX13554343  Illumina HiSeq 2500  metagenome_mockCell_labC_lib013  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380117  SAMN20256237  SRX13554342  Illumina HiSeq 2500  metagenome_mockCell_labC_lib012  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380118  SAMN20256237  SRX13554341  Illumina HiSeq 2500  metagenome_mockCell_labC_lib011  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380119  SAMN20256237  SRX13554340  Illumina HiSeq 2500  metagenome_mockCell_labC_lib010  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380120  SAMN20256237  SRX13554339  Illumina HiSeq 2500  metagenome_mockCell_labC_lib009  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380121  SAMN20256237  SRX13554338  Illumina HiSeq 2500  metagenome_mockCell_labC_lib008  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18

There should be 9 columns in the metadata.tsv file
- row 1 is the header and the columns are pretty much self-explanatory
- column 1 is the Sample_ID, which must correspond with the sample names in the feature tables
- column 8 is the lab where the sample was processed, this will be our grouping factor in later steps

Examine taxonomy feature table

You can have a quick look at the taxonomy input table
- this command will show the first 10 rows and 6 columns

head Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund | cut -f1-6 | column -t -s $'\t'

For example, check the sample IDs (columns) are the same as the metadata.tsv

name                        SRR17380113  SRR17380114  SRR17380116  SRR17380117  SRR17380119
s__Escherichia flexneri     0.06616      0.06187      0.06694      0.06276      0.0674
s__Escherichia dysenteriae  0.0346       0.03221      0.03489      0.03252      0.03523
s__Escherichia coli_D       0.03155      0.02893      0.03205      0.02955      0.03223
s__Escherichia coli         0.0185       0.01747      0.01875      0.01766      0.01872
s__Escherichia coli_C       0.00975      0.00943      0.00986      0.00956      0.00994
s__Escherichia sp004211955  0.00458      0.00473      0.0046       0.00474      0.00463
s__Escherichia sp002965065  0.00375      0.00399      0.00381      0.00402      0.00379
s__Escherichia fergusonii   0.00398      0.00371      0.00405      0.00375      0.00405
s__Escherichia sp000208585  0.00321      0.0033       0.00326      0.00334      0.00328

Step 3) Generate taxonomy bar plots

Enter the following command
- if /vis-tutorial is not in your home directory (~), change the ~/vis-tutorial/ to the absolute path where you downloaded the tutorial data
- the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
- note if you enter the command as one line, remove the backslashes

1
2
3
4
5
6
apptainer run --app mima-vis-taxa $SANDBOX \
~/vis-tutorial/Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund \
~/vis-tutorial/metadata.tsv \
lab:labA:labB:labC \
~/vis-tutorial/analysis \
LAB 8 10 6

Parameters explained

Line	Parameters	Description
2	`<feature_table.tsv>`	file path of the relative abundance feature table
3	`<metadata.tsv>`	file path of the study metadata table. Expect a tab separated text file with one row per sample with the first row being the header, and columns are the different metadata.
4	`<column_group:g1:g2>`	column name in `metadata.tsv` that holds the grouping information and the group levels. Format is `group_column:group1:group2:...`. Note: the values are case sensitive. In this tutorial, the `lab` column is used and has 3 groups: `labA`, `labB` and `labC`.
5	`<output_dir>`	output directory path where results are stored
6-1*	`<output_prefix>`	prefix that will be used for output files. In the example, `LAB` is set as the prefix for output files.
6-2*	`<top_N>`	the number of top taxa to colour in the stack bar plot, remaining detected taxa are grouped together as ‘Other’.
6-3*	`<figure_width>`	output figure width (inches)
6-4*	`<figure_height>`	output figure height (inches)

*the last 4 parameters are all on line 6 separated by a space

Note! Parameter order matters

The parameters must be in the same order described above.

Step 4) Check output files

You should have the following output files after running Step 2

tree ~/vis-tutorial/analysis

analysis_v3/
├── LAB.select.average.table.txt
├── LAB.select.average.top_7_.barchart.pdf
├── LAB.select.average.top_7_seperategroup.barchart.pdf
├── LAB.select.average.top_7_seperategroup.table.txt
├── LAB.select.average.top_7_seperategroup.table.txt.7.R
├── LAB.select.average.top_7.table.txt
├── LAB.select.average.top_7.table.txt.7.R
├── LAB.select.meta.txt
├── LAB.select.table_top_7_seperategroup.txt
├── LAB.select.table_top_7_seperategroup.txt.7.R
├── LAB.select.table_top_7.txt
├── LAB.select.table_top_7.txt.7.R
├── LAB.select.table.txt
├── LAB.select.top_7_.barchart.pdf
└── LAB.select.top_7_seperategroup.barchart.pdf

There are 15 output files
- 4 PDF files - plots
- 7 text files - data
- 4 R scripts that generate plots

Output files	Description
`.average.top_`	mean group abundances for the overall top 7 taxa across the entire study
`.average.top__separategroup`	mean group abundances for the top 7 taxa within each group
`*.meta.txt`	study metadata specific for the plots
`.table_top_`	sample abundances of the overall top 7 taxa across the entire study
`.table_top__separategroup.*`	sample abundances of the top 7 taxa within each group

Example plots

Relative abundances across all samples

This plot shows the relative abundances for the top 8 genera for each sample, where the top 8 genera is across all samples in the study.

Mean relative abundances by group

This plot shows the mean relative abundances for the top 8 genera occurring across all samples in the study. All other detected taxa not in the top 7 are aggregated as Others. In this example, Pseudomonas E is the most abundant in samples processed by Lab-C."

File: LAB.select.average.top_8_.barchart.pdf

Plots from Phylum to Species

You can repeat Step 3 and change the input file from Phylum to Species feature tables to generate bar plots for the different or desired taxonomic ranks.

Alternatively if you have other grouping variables in your study design, you can change column_group parameter in the command.

2 - Core diversity analysis and visualisation

comparisons of groups in cross-sectional studies

Introduction

This tutorial runs through three common core biodiversity analyses for cross-sectional study designs that include group comparisons.

Alpha-diversity analysis

describes the biodiversity of a single sample/community. There are many metrics such as Richness (number of species detected), Evenness (describing the distribution of abundances across the detected species in a sample), Shannon Diversity (combines richness and abundances) and others.
the pipeline will calculate the alpha-diversity for each sample present in the feature table and then perform hypothesis testing to assess if there are any significant (p < 0.05) differences between groups. The pipeline also generates a set of plots to compare the alph-diversity between groups.

Beta-diversity analysis

measures the community (dis)similarity between two or more samples/communities. Two common metrics are:
- Jaccard (dis)similarity which only considers presence/absences of species
- Bray-Curtis dissimilarity which also considers the abundances of the detected species
the pipeline will calculate the pairwise beta-diversities between all samples present in the feature table and then perform PERMANOVA testing using the adonis function in vegan to assess for any significant (p < 0.05) differences between groups. The pipeline also generates a set of plots to help visualise the similarity between samples coloured by the groups using multiple ordinations.

Differential abundance analysis

identifies which features (e.g., species, genes, pathways) are significant different between groups based on their relative abundances

Example study data

The data set used for this tutorial is based on the full data set from the [data processing tutorials](https://mrcbioinfo.github.io/mima-pipeline/docs/tutorials/data-processing/ (Tourlousse, et al., 2022) that was pre-process with the same pipeline. There are 56 samples in total from the study, but after taxonomy profiling with Kraken2 and Bracken, 4 samples fell below the abundance threshold and therefore the final taxonomy abundance table has 52 samples.

This data set consists of two mock communities: (i) DNA-mock and (ii) Cell-mock processed in three different labs: A, B and C.

Analysis objective

In this tutorial, our main questions are:

is there a difference between the two types of mock communities? and
are there any differences between the processing labs?

Check list

For this set of tutorial, you need

System with Apptainer or Singularity installed (currently only tested on UNIX-based system)
Install MIMA Pipeline Singularity container and check that you have
- started an interactive PBS job
- build the sandbox container
- set the SANDBOX environment variable

Step 1) Data preparation

Skip if ...

You can skip this section if you have already run the Biodiversity bar plot tutorial and go directly to Step 2-to refresh yourself with the data or proceed to Step 3, running the analysis.

Use the vis-tutorial directory from Biodiversity bar plots or create a directory vis-tutorial
Change into this directory

mkdir vis-tutorial
cd vis-tutorial

Download taxonomy-feature-table.tar.gz
- Note only the processed feature tables are provided in this download; intermediate files from the data processing pipeline are not provided.

wget https://github.com/mrcbioinfo/mima-pipeline/raw/master/examples/taxonomy-feature-table.tar.gz

Extract the archived file

tar xf taxonomy-feature-table.tar.gz

Check the directory structure matches
Only a subset of the files are shown below and ... means “and others”

tree .

~/vis-tutorial/
├── metadata.tsv
├── taxonomy-feature-table.tar.gz
└── Taxonomy_profiling
    └── featureTables
        ├── bracken_FT_class
        ├── bracken_FT_class_counts
        ├── bracken_FT_class_relAbund
        ├── ...
        ├── bracken_FT_genus
        ├── bracken_FT_genus_counts
        ├── bracken_FT_genus_relAbund
        ├── combine_bracken_class.log
        ├── ...
        ├── combine_bracken_genus.log
        └── generate_bracken_feature_table.py

Note! Assumed working directory

Data files explained

Directory / File Description

metadata.tsv is a tab-separated text file of the study metadata

Directory / File	Description
`metadata.tsv`	is a tab-separated text file of the study metadata
`Taxonomy_profiling/featureTables`	directory contains the taxonomy feature tables using Kraken2 and Bracken. There are feature tables for each taxonomy rank from Phylum to Species. As some analysis tools might require feature table input to be discrete counts while others might require relative abundances, two formats are provided with the suffixes: `_count` files have discrete counts of features (row) per samples (columns) `_relAbund` files ahve the relative abundances of features (row) per samples (columns)

Taxonomy_profiling/featureTables

directory contains the taxonomy feature tables using Kraken2 and Bracken.

*_count files have discrete counts of features (row) per samples (columns)
*_relAbund files ahve the relative abundances of features (row) per samples (columns)

Confirm you have the following 3 files (if they don’t exists, there will be an error message)

ls metadata.tsv
ls Taxonomy_profiling/featureTables/bracken_FT_genus_counts
ls Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund

Step 2) Examine input files

Examine metadata

Examine the metadata.csv file:
- head shows the first 10 rows
- column formats the output into columns using tab as the separator (prettier view)

head metadata.tsv | column -t -s $'\t'

Sample_ID    biosample     experiment   instrument           library.name                     sample.name                                         lab   type                 n_species
SRR17380113  SAMN20256237  SRX13554346  Illumina HiSeq 2500  metagenome_mockCell_labC_lib016  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380114  SAMN20256237  SRX13554345  Illumina HiSeq 2500  metagenome_mockCell_labC_lib015  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380115  SAMN20256237  SRX13554344  Illumina HiSeq 2500  metagenome_mockCell_labC_lib014  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380116  SAMN20256237  SRX13554343  Illumina HiSeq 2500  metagenome_mockCell_labC_lib013  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380117  SAMN20256237  SRX13554342  Illumina HiSeq 2500  metagenome_mockCell_labC_lib012  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380118  SAMN20256237  SRX13554341  Illumina HiSeq 2500  metagenome_mockCell_labC_lib011  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380119  SAMN20256237  SRX13554340  Illumina HiSeq 2500  metagenome_mockCell_labC_lib010  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380120  SAMN20256237  SRX13554339  Illumina HiSeq 2500  metagenome_mockCell_labC_lib009  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18
SRR17380121  SAMN20256237  SRX13554338  Illumina HiSeq 2500  metagenome_mockCell_labC_lib008  Cell mock community, blend of 18 bacterial species  labC  Cell mock community  18

There should be 9 columns in the metadata.tsv file
- row 1 is the header and the columns are pretty much self-explanatory
- column 1 is the Sample_ID, which must correspond with the sample names in the feature tables
- column 8 is the lab where the sample was processed, this will be our grouping factor in later steps

Examine taxonomy feature table

You can have a quick look at the taxonomy input table
- this command will show the first 10 rows and 6 columns

head Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund | cut -f1-6 | column -t -s $'\t'

For example, check the sample IDs (columns) are the same as the metadata.tsv

name                        SRR17380113  SRR17380114  SRR17380116  SRR17380117  SRR17380119
s__Escherichia flexneri     0.06616      0.06187      0.06694      0.06276      0.0674
s__Escherichia dysenteriae  0.0346       0.03221      0.03489      0.03252      0.03523
s__Escherichia coli_D       0.03155      0.02893      0.03205      0.02955      0.03223
s__Escherichia coli         0.0185       0.01747      0.01875      0.01766      0.01872
s__Escherichia coli_C       0.00975      0.00943      0.00986      0.00956      0.00994
s__Escherichia sp004211955  0.00458      0.00473      0.0046       0.00474      0.00463
s__Escherichia sp002965065  0.00375      0.00399      0.00381      0.00402      0.00379
s__Escherichia fergusonii   0.00398      0.00371      0.00405      0.00375      0.00405
s__Escherichia sp000208585  0.00321      0.0033       0.00326      0.00334      0.00328

Step 3) Run core diversity analysis

Enter the following command
- if /vis-tutorial is not in your home directory (~), change the ~/vis-tutorial/ to the absolute path where you downloaded the tutorial data
- the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
- note if you enter the command as one line, remove the backslashes

1
2
3
4
5
apptainer run --app mima-visualisation $SANDBOX \
--feature-table ~/vis-tutorial/Taxonomy_profiling/featureTables/bracken_FT_species_counts \
--metadata ~/vis-tutorial/metadata.tsv \
--study-groups lab,type \
--output-dir ~/vis-tutorial/analysis > ~/vis-tutorial/visualisation.log

Parameters explained

Parameter	Description
`--feature-table <path>`	absolute filepath to the taxonomy abundance table
`--metadata <path>`	filepath to the study metadata table
`--study-groups <col1>,<col2>`	Comma separated list of column names found in the `--metadata` file. These columns contain the study groups of interest. Each column should have at least 2 or more levels for comparisons.
`--output-dir <path>`	output directory path

In this tutorial, we are performing two sets of comparisons:
- comparing between labs (3 levels: A, B, and C)
- comparing between mock community type (2 levels: DNA and Cell)

Confirm output directories

tree -d ~/vis-tutorial/analysis

~/vis-tutorial/analysis/
├── Kraken
├── Kraken_alpha
│   ├── boxplots
│   │   ├── corrected_sig
│   │   ├── non_sig
│   │   └── uncorrected_sig
│   └── wilcox-pairwise
├── Kraken_beta
│   ├── permanova
│   └── plots
└── Kraken_diff-abundance
    ├── barplot
    ├── boxplots
    │   ├── corrected_sig
    │   └── uncorrected_sig
    ├── volcano
    └── wilcox-pairwise

Your output directory (only folders are shown) should resemble something like above, where first level sub-directories:

Directory	Description
`Kraken/`	directory contains two text files: `otu_matrix.txt` - the feature abundance table `TaxProfileWithMeta.txt` - the feature abundance table concatenated with the study metadata provided by the `--metadata` parameter
`Kraken_alpha/`	outputs from alpha-diversity analysis
`Kraken_beta/`	outputs from beta-diversity analysis
`Kraken_diff-abundance/`	outputs from differential abundance analysis

Step 4) Understanding the output files

Alpha-diversity output

Examine the output files for alpha-diversity analysis

tree ~/vis-tutorial/analysis/Kraken_alpha

~/vis-tutorial/analysis/Kraken_alpha
├── alpha_diversity_raw_data.txt
├── boxplots
│   ├── corrected_sig
│   │   ├── chao1_lab_dotplot.png
│   │   ├── chao2_lab_dotplot.png
│   │   ├── evenness_lab_dotplot.png
│   │   ├── invsimperson_type_dotplot.png
│   │   ├── observed_lab_dotplot.png
│   │   ├── shannon_type_dotplot.png
│   │   └── simperson_type_dotplot.png
│   ├── non_sig
│   │   ├── chao1_type_dotplot.png
│   │   ├── chao2_type_dotplot.png
│   │   ├── evenness_type_dotplot.png
│   │   ├── invsimperson_lab_dotplot.png
│   │   ├── observed_type_dotplot.png
│   │   ├── shannon_lab_dotplot.png
│   │   └── simperson_lab_dotplot.png
│   └── uncorrected_sig
├── lab_filter_table.txt
├── lab_kw_stat_summary.txt
├── lab_pairwise_dunn_test.txt
├── type_filter_table.txt
├── type_kw_stat_summary.txt
└── wilcox-pairwise
    ├── lab_labA_labB.wil.stat_summary.txt
    ├── lab_labA_labC.wil.stat_summary.txt
    ├── lab_labB_labC.wil.stat_summary.txt
    └── type_Cell mock community_DNA mock community.wil.stat_summary.txt

5 directories, 24 files

Since we ran 2 sets of comparisons, between lab and type, there are 2 sets of results as indicated by the 2nd part of the filename.
When comparing between labs, the output files with *_lab_* in the filename (highlighted rows), you should have:
- 4 significant results after adjusting for multiple comparison adjustment using: chao1, chao2, evenness and observed richness diversity indcies
- 3 non-significant results after adjustment using: inverse simpson, shannon and simpson indices
- 0 results that were significant before adjustment

Sub-directories explained

Directory	Description
`Kraken_alpha/boxplots/`	boxplots comparing groups defined by the `--study-groups` parameter. If there are two groups then the Wilcox rank-sum test is performed, if there are more than two groups than the Kruskal-Wallis test is performed. Comparisons are adjusted for multiple comparisons and results are separated further into sub-directories.
`.../boxplots/corrected_sig/`	alpha-diversities that show significant group differences after adjustment
`.../boxplots/non_sig/`	non-significant alpha-diversities after adjustment
`.../boxplots/uncorrected_sig/`	results that are significant before adjustment
`Kraken_alpha/wilcox-pairwise/`	text output from Wilcox rank-sum tests for two-group comparisons defined by the `--study-groups` parameter
File
`alpha_diversity_raw_data.txt`	tab-separated text file of calculated alpha-diversities (columns) for each sample (row)
`*_filter_table.txt`	filtered `alpha_diversity_raw_data.txt` table
`*_kw_stat_summary.txt`	output from Kruskal-Wallis test when there are > 2 groups being compared
`*_pairwise_dunn_test.txt`	output from post-hoc pairwise comparisons when there are > 2 groups being compared

Example figures

Comparing Observed Richness between Labs

There is significant differences between Labs A-vs-B, A-vs-C and B-vs-C when comparing their Observed richness (number of species detected in a sample).

File: observed_lab_dotplot.png

Comparing Evenness between Labs

There is significant differences between Labs A-vs-B, A-vs-C and B-vs-C when comparing their Evenness (a measure to describe the abundance distribution of species detected in a sample/community, e.g., their dominance/rareness).

File: evenness_lab_dotplot.png

Beta-diversity analysis output

Examine the output files for beta-diversity analysis

tree ~/vis-tutorial/analysis/Kraken_beta

~/vis-tutorial/analysis/Kraken_beta/
├── permanova
│   ├── lab_pairwise_adonis.results.tsv
│   └── type_pairwise_adonis.results.tsv
└── plots
    ├── CA-Bray-Curtis-lab.png
    ├── CA-Bray-Curtis-type.png
    ├── CCA-Bray-Curtis-lab.png
    ├── CCA-Bray-Curtis-type.png
    ├── DCA-Bray-Curtis-lab.png
    ├── DCA-Bray-Curtis-type.png
    ├── NMDS-Bray-Curtis-lab.png
    ├── NMDS-Bray-Curtis-type.png
    ├── PCA-Bray-Curtis-lab.png
    ├── PCA-Bray-Curtis-type.png
    ├── PCoA-Bray-Curtis-lab.png
    ├── PCoA-Bray-Curtis-type.png
    ├── RDA-Bray-Curtis-lab.png
    └── RDA-Bray-Curtis-type.png

2 directories, 16 files

*_results.tsv are text files of results from the PERMANOVA tests
Multiple ordination plots are generated in the plots directory
As we preformed 2 sets of comparisons (the --study-groups parameter had value lab,type) we have 2 files per ordination with the column name used as the suffix for the output file

Example figures

Principal co-ordinate analysis plot of the beta-diversity measured using Bray-Curtis dissimilarity between samples grouped by the three labs.

Differential abundance analysis output

Examine the output files for differential abundance analysis

tree ~/vis-tutorial/analysis/Kraken_diff-abundance/

warning! there will be a lot of files because we performed 2 sets of comparisons between lab and type. Below only shows a snapshot with ... meaning “and others”

~/vis-tutorial/analysis/Kraken_diff-abundance/
├── barplot
│   ├── lablabA_labB_barplot_FDR.png
│   ├── lablabA_labB_barplot_p.png
│   ├── typeCell mock community_DNA mock community_barplot_FDR.png
│   └── typeCell mock community_DNA mock community_barplot_p.png
├── boxplots
│   ├── corrected_sig
│   │   ├── ...
│   │   ├── s__Bifidobacterium.callitrichos_lab_dotplot.png
│   │   ├── s__Bifidobacterium.callitrichos_type_dotplot.png
│   │   ├── ...
│   │   └── s__Zag1.sp900556295_lab_dotplot.png
│   └── uncorrected_sig
│       ├── ...
│       ├── s__Acetatifactor.muris_type_dotplot.png
│       ├── ...
│       └── s__Zag1.sp900556295_type_dotplot.png
├── lab_filter_table.txt
├── lab_kw_stat_summary.txt
├── lab_pairwise_dunn_test.txt
├── type_filter_table.txt
├── type_kw_stat_summary.txt
├── volcano
│   ├── Cell mock community_DNA mock community.volcanoplot.d.p.png
│   ├── labA_labB.volcanoplot.d.p.png
│   ├── lablabA_labB.volcanoplot.d.FDR.png
│   ├── lablabA_labB.volcanoplot.Log2FC.FDR.png
│   ├── lablabA_labB_volcanoplot_Log2FC_p.png
│   ├── typeCell mock community_DNA mock community.volcanoplot.d.FDR.png
│   ├── typeCell mock community_DNA mock community.volcanoplot.Log2FC.FDR.png
│   └── typeCell mock community_DNA mock community_volcanoplot_Log2FC_p.png
└── wilcox-pairwise
    ├── lab_labA_labB.wil.stat_summary.txt
    ├── lab_labA_labC.wil.stat_summary.txt
    ├── lab_labB_labC.wil.stat_summary.txt
    └── type_Cell mock community_DNA mock community.wil.stat_summary.txt

6 directories, 2301 files

Example figures

Example marker 1

The univariate comparison suggests there is significant lab differences after adjustment for multiple comparison in species *Acetivibrio A ethanolgignens*.

File: s__Acetivibrio_A.ethanolgignens_lab_dotplot.png

Example marker 2

Another example of significant lab differences after adjustment for multiple comparison in species *Bifidobacterium callitrichos*.

File: s__Bifidobacterium.callitrichos_lab_dotplot.png

Sub-directories explained

The output files are organised in a similar fashion as per alpha-diversity output

Directory	Description
`Kraken_diff-abundance/barplot/`	bar plot of each input feature and the measured Hodges-Lehmann estimator. suitable for 2 groups only, if you have >2 groups in your study, this will only compare the first two groups.
`Kraken_diff-abundance/boxplots/`	one boxplot for every feature (e.g., species, gene, pathway) in the input feature table, comparing between groups specified by the columns in `--study-groups`. Comparisons are adjusted for multiple comparisons and results are separated further into sub-directories
`.../boxplots/corrected_sig/`	features that show significant group differences after adjustment
`.../boxplots/non_sig/`	non-significant features after adjustment
`.../boxplots/uncorrected_sig/`	features that are significant before adjustment
`Kraken_diff-abundance/volcano/`	volcano plots of comparisons, a scatterplot showing the statistical significance (p-value) versus the magnitude of change (fold change).
`Kraken_diff-abundance/wilcox-pairwise/`	text output from Wilcox rank-sum tests for two-group comparisons defined by the `--study-groups` parameter
File
`*_filter_table.txt`	filtered `alpha_diversity_raw_data.txt` table
`*_kw_stat_summary.txt`	output from Kruskal-Wallis test when there are > 2 groups being compared
`*_pairwise_dunn_test.txt`	output from post-hoc pairwise comparisons when there are > 2 groups being compared

3 - Classification with Random Forest

assessing the classification using Random Forest for cross-sectional studies

Introduction

This tutorial shows how to run Random Forest classification using the MIMA Container.

Briefly, Random Forest is a supervised machine learning approach for classification and regression. It uses labelled data (what samples below to what group) to learn which key data features are useful in differentiating samples between classes (groups)—hence “supervised learning”—so that when new data are presented it can make a prediction.

Input data required for training a Random Forest classifier:

one or more feature tables: a text file with features (e.g., species, genes, or pathways) as rows and samples as columns. The cells denote the relative abundance of feature i for sample j.
metadata: informs the classifier which samples belong to which group; samples must belong to mutually exclusive groups.

One common approach to evaluate classifier performance is via cross-validation (CV), which basically splits the input data into a training set and a testing set. MIMA uses the K-fold cross-validation approach. This splits the input data into K number of partitions; hold one partition for testing and the remaining K-1 partitions are used for training. So when you have 5-fold CV, you split data into 5 parts or 80/20% split with 80% of the data used for training and 20% used for testing.

In MIMA, K ranges from 3 to 10, incrementing by one each time, so for each input data you will get 8 outputs: cv_3, cv_4, …, cv_9, cv_10.

Example data

For this tutorial, we will use data from the Human Microbiome Project V1.

Our data consists of N=205 stool samples, of which 99 are from females and 106 are from males. Only samples from their first visit are used. The taxonomy profiles have been annotated using the MetaPhlAn2 pipeline.

Analysis objective

Our question is “how well can we distinguish between Sex using the human stool microbiome?”

Check list

For this set of tutorial, you need

System with Apptainer or Singularity installed (currently only tested on UNIX-based system)
Install MIMA Pipeline Singularity container and check that you have
- started an interactive PBS job
- build the sandbox container
- set the SANDBOX environment variable

Step 1) Data preparation

Create a directory randforest-tut
Change directory

mkdir randforest-tut
cd randforest-tut

Download data HMPv1-stool-samples.tar.gz

wget https://github.com/mrcbioinfo/mima-pipeline/raw/master/examples/HMPv1-stool-samples.tar.gz

Extract the archived file

tar -xf HMPv1-stool-samples.tar.gz

Check directory structure

tree .

~/random-forest
├── HMPv1-stool-samples
│   ├── hmp1-stool-visit-1.features_1-kingdom.tsv
│   ├── hmp1-stool-visit-1.features_2-phylum.tsv
│   ├── hmp1-stool-visit-1.features_3-class.tsv
│   ├── hmp1-stool-visit-1.features_4-order.tsv
│   ├── hmp1-stool-visit-1.features_5-family.tsv
│   ├── hmp1-stool-visit-1.features_6-genus.tsv
│   ├── hmp1-stool-visit-1.features_7-species.tsv
│   ├── hmp1-stool-visit-1.features_8-strain.tsv
│   └── hmp1-stool-visit-1.metadata.tsv
└── HMPv1-stool-samples.tar.gz

1 directory, 10 files

There should be 9 tab-separated text files (*.tsv extension) of which one is the metadata and the other are feature tables from Kingdom to Strain ranks.

Note! assumed working directory

This tutorial assumes the randforest-tut directory is in your home directory as indicated by the ~ (tilde) sign. If you have put the files in another location then replace all occurrences of ~/randforest-tut with your location.

Step 2) Examine input files

Examine the metadata

head HMPv1-stool-samples/hmp1-stool-visit-1.metadata.tsv | column -t

SN         RANDSID    VISNO  STArea  STSite  SNPRNT     Gender  WMSPhase  SRS
700014562  158458797  1      Gut     Stool   700014555  Female  1         SRS011061
700014724  158479027  1      Gut     Stool   700014718  Male    1         SRS011084
700014837  158499257  1      Gut     Stool   700014832  Male    1         SRS011134
700015181  158742018  1      Gut     Stool   700015179  Female  1         SRS011239
700015250  158802708  1      Gut     Stool   700015245  Male    1         SRS011271
700015415  158944319  1      Gut     Stool   700015394  Female  1         SRS011302
700015981  159247771  1      Gut     Stool   700015979  Female  1         SRS011405
700016142  159146620  1      Gut     Stool   700016136  Male    1         SRS011452
700016610  159166850  1      Gut     Stool   700016608  Male    1         SRS011529

There are 9 columns, where SN is the Sample ID and should correspond with the columns of the feature table

Examine the CLASS feature table

head HMPv1-stool-samples/hmp1-stool-visit-1.features_3-class.tsv | cut -f1-6 | column -t

lineage                                               700014562  700014724  700014837  700015181  700015250  700015415  700015981  700016142  700016610
k__Archaea|p__Euryarchaeota|c__Methanobacteria        0          0          0.0008246  0.0006299  0          0.0013228  0          0          0.0005494
k__Archaea|p__Euryarchaeota|c__Methanococci           0          0          0          0          0          0          0          0          0
k__Bacteria|p__Acidobacteria|c__Acidobacteria_noname  0          0          0          0          0          0          0          0          0
k__Bacteria|p__Acidobacteria|c__Acidobacteriia        0          0          0          0          0          0          0          0          0
k__Bacteria|p__Actinobacteria|c__Actinobacteria       0.00042    0.0038802  0.0006441  0.0082703  0.0020957  0.0042049  0.0001552  0.0012545  0.0031392
k__Bacteria|p__Bacteroidetes|c__Bacteroidetes_noname  0          0          0          0          0          0          0          0          0
k__Bacteria|p__Bacteroidetes|c__Bacteroidia           0.910796   0.837664   0.600565   0.844245   0.80352    0.451224   0.924745   0.887734   0.838053
k__Bacteria|p__Bacteroidetes|c__Cytophagia            0          0          0          0          0          0          0          0          0
k__Bacteria|p__Bacteroidetes|c__Flavobacteriia        0          0          0          0          0          0          0          0          0

Step 3) Run Random Forest Classifier

First create an output directory

mkdir classifer-output

Enter the following command
- if you used different data preparation settings, than replace all ~/randforest-tut with your location.
- the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
- note if you enter the command as one line, remove the backslashes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ apptainer run --app mima-classifier-RF $SANDBOX \
-i ~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_1-kingdom.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_2-phylum.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_3-class.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_4-order.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_5-family.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_6-genus.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_7-species.tsv,\
~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.features_8-strain.tsv \
-m ~/randforest-tut/HMPv1-stool-samples/hmp1-stool-visit-1.metadata.tsv \
-c Gender \
-o classifier-output

Parameters explained

Line	Parameters	Description
2	`-i/--input <path>`	comma separated list of input text files (e.g. taxonomy abundance tables).
10	`-m/--metadata <path>`	file path of the study metadata table. Expect a tab separated text file with one row per sample with the first row being the header, and columns are the different metadata.
11	`-c/--column <column_name>`	column header from the metadata table that contains the grouping categories for classification
12	`-o/--output <output_dir>`	output directory path where results are stored

N+1 output

If N input files are provided, there will be N+1 sets of output:

one classifier for each input data
a classifier trained on data_all which combines all the input data together as a big table

In this tutorial, we provided 8 input taxonomy files from Kingdom to Strain level features. This will generate 9 output files: one for each taxonomy rank and the last is a big table combining all the previous 8 together.

Step 4) Check output files

As mentioned above, MIMA uses K-fold cross validation (CV/cv) to evaluate the trained Random Forest classifier overall performance. For each input data there will be 8 outputs from cv_3 to cv_10.

Examine the output files

tree classifier-output

~/randforest-tut/classifier-output
├── cv_auc.pdf
├── roc_classif.randomForest_dataset.pdf
├── roc_data_1_classifier.pdf
├── roc_data_1.pdf
├── roc_data_2_classifier.pdf
├── roc_data_2.pdf
├── roc_data_3_classifier.pdf
├── roc_data_3.pdf
├── roc_data_4_classifier.pdf
├── roc_data_4.pdf
├── roc_data_5_classifier.pdf
├── roc_data_5.pdf
├── roc_data_6_classifier.pdf
├── roc_data_6.pdf
├── roc_data_7_classifier.pdf
├── roc_data_7.pdf
├── roc_data_8_classifier.pdf
├── roc_data_8.pdf
├── roc_data_all_classifier.pdf
├── roc_data_all.pdf
└── socket_cluster_log.txt

0 directories, 21 files

You should have 21 files in total

File	N files	N pages/file	Description
`socket_cluster_log.txt`	1	n/a	log text file
`roc_data_*_classifier.pdf`	9	8	there is on file one per input data (8 in total from Kingdom to Strain taxonomy abundance tables) and the last output `data_all` is where all input tables are combined. Each PDF (data set) will have 8 pages: from 3- to 10-fold CV
`cv_auc.pdf`	1	9	one page per input data: 8 from Kingdom to Strain and the last page for `data_all` where all the input tables are combined
`roc_classif.randomForest_dataset.pdf`	1	8	one page per cross-validation (from 3- to 10-fold CV)
`roc_data_*.pdf`	9	8	same as above

Example figures

There is one roc_data_*_classifier.pdf file for each input data. Each PDF contains one page per N-fold cross validation (CV) that includes 3-folds to 10-folds CV.

Below we’ve only shown the classification results when using Species (data_7) as input features and for 4 out of 8 CVs.

Receiver operating characteristic curve (ROC curve) is a graph that shows the overall performance of a classification model. Ideally, good classifiers will have high specificity (x-axis) and high sensitivity (y-axis), therefore the desired curve is towards to top-left above the diagonal line. If the curve follows the diagonal line then the classifier is no better than randomly guessing the class (groups) a sample belongs.

ROC curve characteristics [source: Wikipedia]

note

In the plots generated by MIMA Random Forest Classification, the x-axis shown is in reverse order from 1 to 0, since false positive rate, \(FPR = 1-Specificity\)

The area under the ROC curve (AUC) is the overall accuracy of a classifier, ranging between 0-1 (or reported as percentages).

Below shows a subset of the figures from the Species (input 7) abundance table roc_data_7_classifier.pdf. There is one page per cross-validation from 3- to 10-fold CV.
Note the AUC annotation at the top of each figure

This shows the ROC curve for the Species relative abundance table (data_7) when using 3-fold cross-validation.

This figure shows the ROC curve for the Species relative abundance table (data_7) when using 4-fold cross-validation.

This figure shows the ROC curve for the Species relative abundance table (data_7) when using 8-fold cross-validation.

This figure shows the ROC curve for the Species relative abundance table (data_7) when using 10-fold cross-validation.

cv.auc.pdf files shows an aggregation of the above roc_data_*_classifier.pdf plots. There is one page per input data and the plots shows the AUC against the K-fold CV.

In this plot, we observe that when using Species (data_7) abundances as the feature table, the classifier accuracy ranges between 0.7 to 0.8 with a peak performance occurring at cv_8 (8-fold cross validation). In general, when there is more data for training (higher cv) there is better performance (higher AUC).

This plot compares accuracy (AUC) for Random Forest classifiers trained on the Species abundance tables (input 7 == page 7). The AUC values correspond to the area under the curve of the above roc_data_7_classifier.pdf file. In this plot, we observe that when using Species (data_7) abundances as the feature table, the classifier accuracy ranges between 0.7 to 0.8 with a peak performance occurring at cv_8 (8-fold cross validation). In general, when there is more data for training (higher cv) there is better performance (higher AUC). — This plot compares accuracy (AUC) for Random Forest classifiers trained on the Species abundance tables (input 7 == page 7). The AUC values correspond to the area under the curve of the above `roc_data_7_classifier.pdf` file. In this plot, we observe that when using Species (data_7) abundances as the feature table, the classifier accuracy ranges between 0.7 to 0.8 with a peak performance occurring at cv_8 (8-fold cross validation). In general, when there is more data for training (higher cv) there is better performance (higher AUC).

roc_classif.randomForest_dataset.pdf is summary PDF with one page per K-fold CV. Each plot shows the ROC curves for each input data for a specified K-fold CV. Below shows the performance for classifiers trained using 8-fold CV (page 8) across the 8 input tables (Kingdom to Strain) plus the final data table that combine all the eight tables together as one big dataset giving the ninth data set.

This PDF combines the individual plots from the set of roc_data_*_classifier.pdf files.

The plot below tells us that when using 8-fold CV for training, the best performing classifier to distinguish between males and females, in this set is data_7 with an accuracy of 0.787 (or 78.7%). This classifier was trained using Species rank abundance table. Strain level is comparable with accuracy of 0.779, while Kingdom classifiers perform very poorly (accuracy = 0.519) which are pretty much random guesses. There was no improvement to the classify if we combined all the features (data_9) together.

The ROC curves of Random Forest classifiers trained using 8-fold cross-validation (cv_8) for input data from Kingdom (data_1) to Strain (data_8) and the combined dataset which merges all input data as one table (data_9). When using 8-fold CV the best performing classifier has accuracy of 0.787 using the Species input data (data_7). See text description for further interpretation.