This section covers standalone tutorials that analyses of the taxonomy biodiversity or functional biodiversity of samples in your study.
The tutorial will take in either the taxonomy abundance tables or functional feature tables.
You can start with:
This is the multi-page printable view of this section. Click here to print.
This section covers standalone tutorials that analyses of the taxonomy biodiversity or functional biodiversity of samples in your study.
The tutorial will take in either the taxonomy abundance tables or functional feature tables.
You can start with:
When starting a new metagenomics study, it is useful to visualise the microbial diversity present in the individual samples. You can also average the relative abundances of the most abundant species present and observe how they are distributed between groups in your study (e.g., controls versus disease or placebo versus treatment).
For taxonomy features, the biodiversity bar plots can be generated for each taxonomic rank from Phylum to Species.
The data set used for this tutorial is based on the full data set from the data processing tutorials (Tourlousse, et al., 2022) that was pre-process with the same pipeline. There are 56 samples in total from the study, but after taxonomy profiling with Kraken2 and Bracken, 4 samples fell below the abundance threshold and therefore the final taxonomy abundance table has 52 samples.
This data set consists of two mock communities: (i) DNA-mock and (ii) Cell-mock processed in three different labs: A, B and C.
For this set of tutorial, you need
Apptainer
or Singularity
installed (currently only tested on UNIX-based system)SANDBOX
environment variablevis-tutorial
mkdir vis-tutorial
cd vis-tutorial
wget https://github.com/mrcbioinfo/mima-pipeline/raw/master/examples/taxonomy-feature-table.tar.gz
tar xf taxonomy-feature-table.tar.gz
...
means “and others”tree .
~/vis-tutorial/
├── metadata.tsv
├── taxonomy-feature-table.tar.gz
└── Taxonomy_profiling
└── featureTables
├── bracken_FT_class
├── bracken_FT_class_counts
├── bracken_FT_class_relAbund
├── ...
├── bracken_FT_genus
├── bracken_FT_genus_counts
├── bracken_FT_genus_relAbund
├── combine_bracken_class.log
├── ...
├── combine_bracken_genus.log
└── generate_bracken_feature_table.py
vis-tutorial
directory is in your home directory as indicated by the ~
(tilde) sign.
If you have put the files in another location then replace all occurrences of ~/vis-tutorial
with your locationDirectory / File | Description |
---|---|
metadata.tsv | is a tab-separated text file of the study metadata |
Taxonomy_profiling/featureTables | directory contains the taxonomy feature tables using Kraken2 and Bracken. There are feature tables for each taxonomy rank from Phylum to Species. As some analysis tools might require feature table input to be discrete counts while others might require relative abundances, two formats are provided with the suffixes:
|
ls metadata.tsv
ls Taxonomy_profiling/featureTables/bracken_FT_genus_counts
ls Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund
metadata.csv
file:head
shows the first 10 rowscolumn
formats the output into columns using tab as the separator (prettier view)head metadata.tsv | column -t -s $'\t'
Sample_ID biosample experiment instrument library.name sample.name lab type n_species
SRR17380113 SAMN20256237 SRX13554346 Illumina HiSeq 2500 metagenome_mockCell_labC_lib016 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380114 SAMN20256237 SRX13554345 Illumina HiSeq 2500 metagenome_mockCell_labC_lib015 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380115 SAMN20256237 SRX13554344 Illumina HiSeq 2500 metagenome_mockCell_labC_lib014 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380116 SAMN20256237 SRX13554343 Illumina HiSeq 2500 metagenome_mockCell_labC_lib013 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380117 SAMN20256237 SRX13554342 Illumina HiSeq 2500 metagenome_mockCell_labC_lib012 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380118 SAMN20256237 SRX13554341 Illumina HiSeq 2500 metagenome_mockCell_labC_lib011 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380119 SAMN20256237 SRX13554340 Illumina HiSeq 2500 metagenome_mockCell_labC_lib010 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380120 SAMN20256237 SRX13554339 Illumina HiSeq 2500 metagenome_mockCell_labC_lib009 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380121 SAMN20256237 SRX13554338 Illumina HiSeq 2500 metagenome_mockCell_labC_lib008 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
metadata.tsv
fileSample_ID
, which must correspond with the sample names in the feature tableslab
where the sample was processed, this will be our grouping factor in later stepshead Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund | cut -f1-6 | column -t -s $'\t'
metadata.tsv
name SRR17380113 SRR17380114 SRR17380116 SRR17380117 SRR17380119
s__Escherichia flexneri 0.06616 0.06187 0.06694 0.06276 0.0674
s__Escherichia dysenteriae 0.0346 0.03221 0.03489 0.03252 0.03523
s__Escherichia coli_D 0.03155 0.02893 0.03205 0.02955 0.03223
s__Escherichia coli 0.0185 0.01747 0.01875 0.01766 0.01872
s__Escherichia coli_C 0.00975 0.00943 0.00986 0.00956 0.00994
s__Escherichia sp004211955 0.00458 0.00473 0.0046 0.00474 0.00463
s__Escherichia sp002965065 0.00375 0.00399 0.00381 0.00402 0.00379
s__Escherichia fergusonii 0.00398 0.00371 0.00405 0.00375 0.00405
s__Escherichia sp000208585 0.00321 0.0033 0.00326 0.00334 0.00328
/vis-tutorial
is not in your home directory (~
), change the ~/vis-tutorial/
to the absolute path where you downloaded the tutorial data\
) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
|
|
Line | Parameters | Description |
---|---|---|
2 | <feature_table.tsv> | file path of the relative abundance feature table |
3 | <metadata.tsv> | file path of the study metadata table. Expect a tab separated text file with one row per sample with the first row being the header, and columns are the different metadata. |
4 | <column_group:g1:g2> | column name in metadata.tsv that holds the grouping information and the group levels. Format is group_column:group1:group2:... .Note: the values are case sensitive. In this tutorial, the lab column is used and has 3 groups: labA , labB and labC . |
5 | <output_dir> | output directory path where results are stored |
6-1* | <output_prefix> | prefix that will be used for output files. In the example, LAB is set as the prefix for output files. |
6-2* | <top_N> | the number of top taxa to colour in the stack bar plot, remaining detected taxa are grouped together as ‘Other’. |
6-3* | <figure_width> | output figure width (inches) |
6-4* | <figure_height> | output figure height (inches) |
*the last 4 parameters are all on line 6 separated by a space
tree ~/vis-tutorial/analysis
analysis_v3/
├── LAB.select.average.table.txt
├── LAB.select.average.top_7_.barchart.pdf
├── LAB.select.average.top_7_seperategroup.barchart.pdf
├── LAB.select.average.top_7_seperategroup.table.txt
├── LAB.select.average.top_7_seperategroup.table.txt.7.R
├── LAB.select.average.top_7.table.txt
├── LAB.select.average.top_7.table.txt.7.R
├── LAB.select.meta.txt
├── LAB.select.table_top_7_seperategroup.txt
├── LAB.select.table_top_7_seperategroup.txt.7.R
├── LAB.select.table_top_7.txt
├── LAB.select.table_top_7.txt.7.R
├── LAB.select.table.txt
├── LAB.select.top_7_.barchart.pdf
└── LAB.select.top_7_seperategroup.barchart.pdf
Output files | Description |
---|---|
*.average.top_* | mean group abundances for the overall top 7 taxa across the entire study |
*.average.top_*_separategroup | mean group abundances for the top 7 taxa within each group |
*.meta.txt | study metadata specific for the plots |
*.table_top_* | sample abundances of the overall top 7 taxa across the entire study |
*.table_top_*_separategroup.* | sample abundances of the top 7 taxa within each group |
Relative abundances across all samples
This plot shows the relative abundances for the top 8 genera for each sample, where the top 8 genera is across all samples in the study.
File: LAB.select.top_8_.barchart.pdf
Mean relative abundances by group
This plot shows the mean relative abundances for the top 8 genera occurring across all samples in the study. All other detected taxa not in the top 7 are aggregated as Others. In this example, Pseudomonas E is the most abundant in samples processed by Lab-C."
File: LAB.select.average.top_8_.barchart.pdf
You can repeat Step 3 and change the input file from Phylum to Species feature tables to generate bar plots for the different or desired taxonomic ranks.
Alternatively if you have other grouping variables in your study design, you can change column_group
parameter in the command.
This tutorial runs through three common core biodiversity analyses for cross-sectional study designs that include group comparisons.
adonis
function in vegan
to assess for any significant (p < 0.05) differences between groups. The pipeline also generates a set of plots to help visualise the similarity between samples coloured by the groups using multiple ordinations.The data set used for this tutorial is based on the full data set from the [data processing tutorials](https://mrcbioinfo.github.io/mima-pipeline/docs/tutorials/data-processing/ (Tourlousse, et al., 2022) that was pre-process with the same pipeline. There are 56 samples in total from the study, but after taxonomy profiling with Kraken2 and Bracken, 4 samples fell below the abundance threshold and therefore the final taxonomy abundance table has 52 samples.
This data set consists of two mock communities: (i) DNA-mock and (ii) Cell-mock processed in three different labs: A, B and C.
In this tutorial, our main questions are:
For this set of tutorial, you need
Apptainer
or Singularity
installed (currently only tested on UNIX-based system)SANDBOX
environment variablevis-tutorial
directory from Biodiversity bar plots or create a directory vis-tutorial
mkdir vis-tutorial
cd vis-tutorial
wget https://github.com/mrcbioinfo/mima-pipeline/raw/master/examples/taxonomy-feature-table.tar.gz
tar xf taxonomy-feature-table.tar.gz
...
means “and others”tree .
~/vis-tutorial/
├── metadata.tsv
├── taxonomy-feature-table.tar.gz
└── Taxonomy_profiling
└── featureTables
├── bracken_FT_class
├── bracken_FT_class_counts
├── bracken_FT_class_relAbund
├── ...
├── bracken_FT_genus
├── bracken_FT_genus_counts
├── bracken_FT_genus_relAbund
├── combine_bracken_class.log
├── ...
├── combine_bracken_genus.log
└── generate_bracken_feature_table.py
vis-tutorial
directory is in your home directory as indicated by the ~
(tilde) sign.
If you have put the files in another location then replace all occurrences of ~/vis-tutorial
with your location.Directory / File | Description |
---|---|
metadata.tsv | is a tab-separated text file of the study metadata |
Taxonomy_profiling/featureTables | directory contains the taxonomy feature tables using Kraken2 and Bracken. There are feature tables for each taxonomy rank from Phylum to Species. As some analysis tools might require feature table input to be discrete counts while others might require relative abundances, two formats are provided with the suffixes:
|
ls metadata.tsv
ls Taxonomy_profiling/featureTables/bracken_FT_genus_counts
ls Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund
metadata.csv
file:head
shows the first 10 rowscolumn
formats the output into columns using tab as the separator (prettier view)head metadata.tsv | column -t -s $'\t'
Sample_ID biosample experiment instrument library.name sample.name lab type n_species
SRR17380113 SAMN20256237 SRX13554346 Illumina HiSeq 2500 metagenome_mockCell_labC_lib016 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380114 SAMN20256237 SRX13554345 Illumina HiSeq 2500 metagenome_mockCell_labC_lib015 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380115 SAMN20256237 SRX13554344 Illumina HiSeq 2500 metagenome_mockCell_labC_lib014 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380116 SAMN20256237 SRX13554343 Illumina HiSeq 2500 metagenome_mockCell_labC_lib013 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380117 SAMN20256237 SRX13554342 Illumina HiSeq 2500 metagenome_mockCell_labC_lib012 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380118 SAMN20256237 SRX13554341 Illumina HiSeq 2500 metagenome_mockCell_labC_lib011 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380119 SAMN20256237 SRX13554340 Illumina HiSeq 2500 metagenome_mockCell_labC_lib010 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380120 SAMN20256237 SRX13554339 Illumina HiSeq 2500 metagenome_mockCell_labC_lib009 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
SRR17380121 SAMN20256237 SRX13554338 Illumina HiSeq 2500 metagenome_mockCell_labC_lib008 Cell mock community, blend of 18 bacterial species labC Cell mock community 18
metadata.tsv
fileSample_ID
, which must correspond with the sample names in the feature tableslab
where the sample was processed, this will be our grouping factor in later stepshead Taxonomy_profiling/featureTables/bracken_FT_genus_relAbund | cut -f1-6 | column -t -s $'\t'
metadata.tsv
name SRR17380113 SRR17380114 SRR17380116 SRR17380117 SRR17380119
s__Escherichia flexneri 0.06616 0.06187 0.06694 0.06276 0.0674
s__Escherichia dysenteriae 0.0346 0.03221 0.03489 0.03252 0.03523
s__Escherichia coli_D 0.03155 0.02893 0.03205 0.02955 0.03223
s__Escherichia coli 0.0185 0.01747 0.01875 0.01766 0.01872
s__Escherichia coli_C 0.00975 0.00943 0.00986 0.00956 0.00994
s__Escherichia sp004211955 0.00458 0.00473 0.0046 0.00474 0.00463
s__Escherichia sp002965065 0.00375 0.00399 0.00381 0.00402 0.00379
s__Escherichia fergusonii 0.00398 0.00371 0.00405 0.00375 0.00405
s__Escherichia sp000208585 0.00321 0.0033 0.00326 0.00334 0.00328
/vis-tutorial
is not in your home directory (~
), change the ~/vis-tutorial/
to the absolute path where you downloaded the tutorial data\
) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
|
|
Parameter | Description |
---|---|
--feature-table <path> | absolute filepath to the taxonomy abundance table |
--metadata <path> | filepath to the study metadata table |
--study-groups <col1>,<col2> | Comma separated list of column names found in the --metadata file. These columns contain the study groups of interest. Each column should have at least 2 or more levels for comparisons. |
--output-dir <path> | output directory path |
lab
s (3 levels: A, B, and C)type
(2 levels: DNA and Cell)tree -d ~/vis-tutorial/analysis
~/vis-tutorial/analysis/
├── Kraken
├── Kraken_alpha
│ ├── boxplots
│ │ ├── corrected_sig
│ │ ├── non_sig
│ │ └── uncorrected_sig
│ └── wilcox-pairwise
├── Kraken_beta
│ ├── permanova
│ └── plots
└── Kraken_diff-abundance
├── barplot
├── boxplots
│ ├── corrected_sig
│ └── uncorrected_sig
├── volcano
└── wilcox-pairwise
Your output directory (only folders are shown) should resemble something like above, where first level sub-directories:
Directory | Description |
---|---|
Kraken/ | directory contains two text files:
|
Kraken_alpha/ | outputs from alpha-diversity analysis |
Kraken_beta/ | outputs from beta-diversity analysis |
Kraken_diff-abundance/ | outputs from differential abundance analysis |
tree ~/vis-tutorial/analysis/Kraken_alpha
~/vis-tutorial/analysis/Kraken_alpha
├── alpha_diversity_raw_data.txt
├── boxplots
│ ├── corrected_sig
│ │ ├── chao1_lab_dotplot.png
│ │ ├── chao2_lab_dotplot.png
│ │ ├── evenness_lab_dotplot.png
│ │ ├── invsimperson_type_dotplot.png
│ │ ├── observed_lab_dotplot.png
│ │ ├── shannon_type_dotplot.png
│ │ └── simperson_type_dotplot.png
│ ├── non_sig
│ │ ├── chao1_type_dotplot.png
│ │ ├── chao2_type_dotplot.png
│ │ ├── evenness_type_dotplot.png
│ │ ├── invsimperson_lab_dotplot.png
│ │ ├── observed_type_dotplot.png
│ │ ├── shannon_lab_dotplot.png
│ │ └── simperson_lab_dotplot.png
│ └── uncorrected_sig
├── lab_filter_table.txt
├── lab_kw_stat_summary.txt
├── lab_pairwise_dunn_test.txt
├── type_filter_table.txt
├── type_kw_stat_summary.txt
└── wilcox-pairwise
├── lab_labA_labB.wil.stat_summary.txt
├── lab_labA_labC.wil.stat_summary.txt
├── lab_labB_labC.wil.stat_summary.txt
└── type_Cell mock community_DNA mock community.wil.stat_summary.txt
5 directories, 24 files
lab
and type
, there are 2 sets of results as indicated by the 2nd part of the filename.*_lab_*
in the filename (highlighted rows), you should have:Directory | Description |
---|---|
Kraken_alpha/boxplots/ | boxplots comparing groups defined by the Comparisons are adjusted for multiple comparisons and results are separated further into sub-directories. |
.../boxplots/corrected_sig/ | alpha-diversities that show significant group differences after adjustment |
.../boxplots/non_sig/ | non-significant alpha-diversities after adjustment |
.../boxplots/uncorrected_sig/ | results that are significant before adjustment |
Kraken_alpha/wilcox-pairwise/ | text output from Wilcox rank-sum tests for two-group comparisons defined by the --study-groups parameter |
File | |
alpha_diversity_raw_data.txt | tab-separated text file of calculated alpha-diversities (columns) for each sample (row) |
*_filter_table.txt | filtered alpha_diversity_raw_data.txt table |
*_kw_stat_summary.txt | output from Kruskal-Wallis test when there are > 2 groups being compared |
*_pairwise_dunn_test.txt | output from post-hoc pairwise comparisons when there are > 2 groups being compared |
There is significant differences between Labs A-vs-B, A-vs-C and B-vs-C when comparing their Observed richness (number of species detected in a sample).
There is significant differences between Labs A-vs-B, A-vs-C and B-vs-C when comparing their Evenness (a measure to describe the abundance distribution of species detected in a sample/community, e.g., their dominance/rareness).
tree ~/vis-tutorial/analysis/Kraken_beta
~/vis-tutorial/analysis/Kraken_beta/
├── permanova
│ ├── lab_pairwise_adonis.results.tsv
│ └── type_pairwise_adonis.results.tsv
└── plots
├── CA-Bray-Curtis-lab.png
├── CA-Bray-Curtis-type.png
├── CCA-Bray-Curtis-lab.png
├── CCA-Bray-Curtis-type.png
├── DCA-Bray-Curtis-lab.png
├── DCA-Bray-Curtis-type.png
├── NMDS-Bray-Curtis-lab.png
├── NMDS-Bray-Curtis-type.png
├── PCA-Bray-Curtis-lab.png
├── PCA-Bray-Curtis-type.png
├── PCoA-Bray-Curtis-lab.png
├── PCoA-Bray-Curtis-type.png
├── RDA-Bray-Curtis-lab.png
└── RDA-Bray-Curtis-type.png
2 directories, 16 files
*_results.tsv
are text files of results from the PERMANOVA testsplots
directory--study-groups
parameter had value lab,type
) we have 2 files per ordination with the column name used as the suffix for the output filePrincipal co-ordinate analysis plot of the beta-diversity measured using Bray-Curtis dissimilarity between samples grouped by the three labs.
File: PCoA-Bray-Curtis-lab.png.
tree ~/vis-tutorial/analysis/Kraken_diff-abundance/
warning! there will be a lot of files because we performed 2 sets of comparisons between lab
and type
. Below only shows a snapshot with ...
meaning “and others”
~/vis-tutorial/analysis/Kraken_diff-abundance/
├── barplot
│ ├── lablabA_labB_barplot_FDR.png
│ ├── lablabA_labB_barplot_p.png
│ ├── typeCell mock community_DNA mock community_barplot_FDR.png
│ └── typeCell mock community_DNA mock community_barplot_p.png
├── boxplots
│ ├── corrected_sig
│ │ ├── ...
│ │ ├── s__Bifidobacterium.callitrichos_lab_dotplot.png
│ │ ├── s__Bifidobacterium.callitrichos_type_dotplot.png
│ │ ├── ...
│ │ └── s__Zag1.sp900556295_lab_dotplot.png
│ └── uncorrected_sig
│ ├── ...
│ ├── s__Acetatifactor.muris_type_dotplot.png
│ ├── ...
│ └── s__Zag1.sp900556295_type_dotplot.png
├── lab_filter_table.txt
├── lab_kw_stat_summary.txt
├── lab_pairwise_dunn_test.txt
├── type_filter_table.txt
├── type_kw_stat_summary.txt
├── volcano
│ ├── Cell mock community_DNA mock community.volcanoplot.d.p.png
│ ├── labA_labB.volcanoplot.d.p.png
│ ├── lablabA_labB.volcanoplot.d.FDR.png
│ ├── lablabA_labB.volcanoplot.Log2FC.FDR.png
│ ├── lablabA_labB_volcanoplot_Log2FC_p.png
│ ├── typeCell mock community_DNA mock community.volcanoplot.d.FDR.png
│ ├── typeCell mock community_DNA mock community.volcanoplot.Log2FC.FDR.png
│ └── typeCell mock community_DNA mock community_volcanoplot_Log2FC_p.png
└── wilcox-pairwise
├── lab_labA_labB.wil.stat_summary.txt
├── lab_labA_labC.wil.stat_summary.txt
├── lab_labB_labC.wil.stat_summary.txt
└── type_Cell mock community_DNA mock community.wil.stat_summary.txt
6 directories, 2301 files
The univariate comparison suggests there is significant lab differences after adjustment for multiple comparison in species *Acetivibrio A ethanolgignens*.
Another example of significant lab differences after adjustment for multiple comparison in species *Bifidobacterium callitrichos*.
The output files are organised in a similar fashion as per alpha-diversity output
Directory | Description |
---|---|
Kraken_diff-abundance/barplot/ | bar plot of each input feature and the measured Hodges-Lehmann estimator. suitable for 2 groups only, if you have >2 groups in your study, this will only compare the first two groups. |
Kraken_diff-abundance/boxplots/ | one boxplot for every feature (e.g., species, gene, pathway) in the input feature table, comparing between groups specified by the columns in Comparisons are adjusted for multiple comparisons and results are separated further into sub-directories |
.../boxplots/corrected_sig/ | features that show significant group differences after adjustment |
.../boxplots/non_sig/ | non-significant features after adjustment |
.../boxplots/uncorrected_sig/ | features that are significant before adjustment |
Kraken_diff-abundance/volcano/ | volcano plots of comparisons, a scatterplot showing the statistical significance (p-value) versus the magnitude of change (fold change). |
Kraken_diff-abundance/wilcox-pairwise/ | text output from Wilcox rank-sum tests for two-group comparisons defined by the --study-groups parameter |
File | |
*_filter_table.txt | filtered alpha_diversity_raw_data.txt table |
*_kw_stat_summary.txt | output from Kruskal-Wallis test when there are > 2 groups being compared |
*_pairwise_dunn_test.txt | output from post-hoc pairwise comparisons when there are > 2 groups being compared |
This tutorial shows how to run Random Forest classification using the MIMA Container.
Briefly, Random Forest is a supervised machine learning approach for classification and regression. It uses labelled data (what samples below to what group) to learn which key data features are useful in differentiating samples between classes (groups)—hence “supervised learning”—so that when new data are presented it can make a prediction.
Input data required for training a Random Forest classifier:
One common approach to evaluate classifier performance is via cross-validation (CV), which basically splits the input data into a training set and a testing set. MIMA uses the K-fold cross-validation approach. This splits the input data into K number of partitions; hold one partition for testing and the remaining K-1 partitions are used for training. So when you have 5-fold CV, you split data into 5 parts or 80/20% split with 80% of the data used for training and 20% used for testing.
In MIMA, K ranges from 3 to 10, incrementing by one each time, so for each input data you will get 8 outputs: cv_3, cv_4, …, cv_9, cv_10.
For this tutorial, we will use data from the Human Microbiome Project V1.
Our data consists of N=205 stool samples, of which 99 are from females and 106 are from males. Only samples from their first visit are used. The taxonomy profiles have been annotated using the MetaPhlAn2 pipeline.
For this set of tutorial, you need
Apptainer
or Singularity
installed (currently only tested on UNIX-based system)SANDBOX
environment variablerandforest-tut
mkdir randforest-tut
cd randforest-tut
wget https://github.com/mrcbioinfo/mima-pipeline/raw/master/examples/HMPv1-stool-samples.tar.gz
tar -xf HMPv1-stool-samples.tar.gz
tree .
~/random-forest
├── HMPv1-stool-samples
│ ├── hmp1-stool-visit-1.features_1-kingdom.tsv
│ ├── hmp1-stool-visit-1.features_2-phylum.tsv
│ ├── hmp1-stool-visit-1.features_3-class.tsv
│ ├── hmp1-stool-visit-1.features_4-order.tsv
│ ├── hmp1-stool-visit-1.features_5-family.tsv
│ ├── hmp1-stool-visit-1.features_6-genus.tsv
│ ├── hmp1-stool-visit-1.features_7-species.tsv
│ ├── hmp1-stool-visit-1.features_8-strain.tsv
│ └── hmp1-stool-visit-1.metadata.tsv
└── HMPv1-stool-samples.tar.gz
1 directory, 10 files
There should be 9 tab-separated text files (*.tsv extension) of which one is the metadata and the other are feature tables from Kingdom to Strain ranks.
randforest-tut
directory is in your home directory as indicated by the ~
(tilde) sign.
If you have put the files in another location then replace all occurrences of ~/randforest-tut
with your location.head HMPv1-stool-samples/hmp1-stool-visit-1.metadata.tsv | column -t
SN RANDSID VISNO STArea STSite SNPRNT Gender WMSPhase SRS
700014562 158458797 1 Gut Stool 700014555 Female 1 SRS011061
700014724 158479027 1 Gut Stool 700014718 Male 1 SRS011084
700014837 158499257 1 Gut Stool 700014832 Male 1 SRS011134
700015181 158742018 1 Gut Stool 700015179 Female 1 SRS011239
700015250 158802708 1 Gut Stool 700015245 Male 1 SRS011271
700015415 158944319 1 Gut Stool 700015394 Female 1 SRS011302
700015981 159247771 1 Gut Stool 700015979 Female 1 SRS011405
700016142 159146620 1 Gut Stool 700016136 Male 1 SRS011452
700016610 159166850 1 Gut Stool 700016608 Male 1 SRS011529
There are 9 columns, where SN
is the Sample ID and should correspond with the columns of the feature table
head HMPv1-stool-samples/hmp1-stool-visit-1.features_3-class.tsv | cut -f1-6 | column -t
lineage 700014562 700014724 700014837 700015181 700015250 700015415 700015981 700016142 700016610
k__Archaea|p__Euryarchaeota|c__Methanobacteria 0 0 0.0008246 0.0006299 0 0.0013228 0 0 0.0005494
k__Archaea|p__Euryarchaeota|c__Methanococci 0 0 0 0 0 0 0 0 0
k__Bacteria|p__Acidobacteria|c__Acidobacteria_noname 0 0 0 0 0 0 0 0 0
k__Bacteria|p__Acidobacteria|c__Acidobacteriia 0 0 0 0 0 0 0 0 0
k__Bacteria|p__Actinobacteria|c__Actinobacteria 0.00042 0.0038802 0.0006441 0.0082703 0.0020957 0.0042049 0.0001552 0.0012545 0.0031392
k__Bacteria|p__Bacteroidetes|c__Bacteroidetes_noname 0 0 0 0 0 0 0 0 0
k__Bacteria|p__Bacteroidetes|c__Bacteroidia 0.910796 0.837664 0.600565 0.844245 0.80352 0.451224 0.924745 0.887734 0.838053
k__Bacteria|p__Bacteroidetes|c__Cytophagia 0 0 0 0 0 0 0 0 0
k__Bacteria|p__Bacteroidetes|c__Flavobacteriia 0 0 0 0 0 0 0 0 0
mkdir classifer-output
~/randforest-tut
with your location.\
) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
|
|
Line | Parameters | Description |
---|---|---|
2 | -i/--input <path> | comma separated list of input text files (e.g. taxonomy abundance tables). |
10 | -m/--metadata <path> | file path of the study metadata table. Expect a tab separated text file with one row per sample with the first row being the header, and columns are the different metadata. |
11 | -c/--column <column_name> | column header from the metadata table that contains the grouping categories for classification |
12 | -o/--output <output_dir> | output directory path where results are stored |
If N input files are provided, there will be N+1 sets of output:
data_all
which combines all the input data together as a big tableIn this tutorial, we provided 8 input taxonomy files from Kingdom to Strain level features. This will generate 9 output files: one for each taxonomy rank and the last is a big table combining all the previous 8 together.
As mentioned above, MIMA uses K-fold cross validation (CV/cv) to evaluate the trained Random Forest classifier overall performance. For each input data there will be 8 outputs from cv_3 to cv_10.
tree classifier-output
~/randforest-tut/classifier-output
├── cv_auc.pdf
├── roc_classif.randomForest_dataset.pdf
├── roc_data_1_classifier.pdf
├── roc_data_1.pdf
├── roc_data_2_classifier.pdf
├── roc_data_2.pdf
├── roc_data_3_classifier.pdf
├── roc_data_3.pdf
├── roc_data_4_classifier.pdf
├── roc_data_4.pdf
├── roc_data_5_classifier.pdf
├── roc_data_5.pdf
├── roc_data_6_classifier.pdf
├── roc_data_6.pdf
├── roc_data_7_classifier.pdf
├── roc_data_7.pdf
├── roc_data_8_classifier.pdf
├── roc_data_8.pdf
├── roc_data_all_classifier.pdf
├── roc_data_all.pdf
└── socket_cluster_log.txt
0 directories, 21 files
You should have 21 files in total
File | N files | N pages/file | Description |
---|---|---|---|
socket_cluster_log.txt | 1 | n/a | log text file |
roc_data_*_classifier.pdf | 9 | 8 | there is on file one per input data (8 in total from Kingdom to Strain taxonomy abundance tables) and the last output Each PDF (data set) will have 8 pages: from 3- to 10-fold CV |
cv_auc.pdf | 1 | 9 | one page per input data: 8 from Kingdom to Strain and the last page for data_all where all the input tables are combined |
roc_classif.randomForest_dataset.pdf | 1 | 8 | one page per cross-validation (from 3- to 10-fold CV) |
roc_data_*.pdf | 9 | 8 | same as above |
There is one roc_data_*_classifier.pdf
file for each input data. Each PDF contains one page per N-fold cross validation (CV) that includes 3-folds to 10-folds CV.
Below we’ve only shown the classification results when using Species (data_7
) as input features and for 4 out of 8 CVs.
Receiver operating characteristic curve (ROC curve) is a graph that shows the overall performance of a classification model. Ideally, good classifiers will have high specificity (x-axis) and high sensitivity (y-axis), therefore the desired curve is towards to top-left above the diagonal line. If the curve follows the diagonal line then the classifier is no better than randomly guessing the class (groups) a sample belongs.
ROC curve characteristics [source: Wikipedia]
The area under the ROC curve (AUC) is the overall accuracy of a classifier, ranging between 0-1 (or reported as percentages).
roc_data_7_classifier.pdf
. There is one page per cross-validation from 3- to 10-fold CV.This shows the ROC curve for the Species relative abundance table (data_7) when using 3-fold cross-validation.
This figure shows the ROC curve for the Species relative abundance table (data_7) when using 4-fold cross-validation.
This figure shows the ROC curve for the Species relative abundance table (data_7) when using 8-fold cross-validation.
This figure shows the ROC curve for the Species relative abundance table (data_7) when using 10-fold cross-validation.
cv.auc.pdf files shows an aggregation of the above roc_data_*_classifier.pdf
plots. There is one page per input data and the plots shows the AUC against the K-fold CV.
In this plot, we observe that when using Species (data_7) abundances as the feature table, the classifier accuracy ranges between 0.7 to 0.8 with a peak performance occurring at cv_8 (8-fold cross validation). In general, when there is more data for training (higher cv) there is better performance (higher AUC).
This plot compares accuracy (AUC) for Random Forest classifiers trained on the Species abundance tables (input 7 == page 7). The AUC values correspond to the area under the curve of the above roc_data_7_classifier.pdf
file. In this plot, we observe that when using Species (data_7) abundances as the feature table, the classifier accuracy ranges between 0.7 to 0.8 with a peak performance occurring at cv_8 (8-fold cross validation). In general, when there is more data for training (higher cv) there is better performance (higher AUC).
roc_classif.randomForest_dataset.pdf is summary PDF with one page per K-fold CV. Each plot shows the ROC curves for each input data for a specified K-fold CV. Below shows the performance for classifiers trained using 8-fold CV (page 8) across the 8 input tables (Kingdom to Strain) plus the final data table that combine all the eight tables together as one big dataset giving the ninth data set.
This PDF combines the individual plots from the set of roc_data_*_classifier.pdf
files.
The plot below tells us that when using 8-fold CV for training, the best performing classifier to distinguish between males and females, in this set is data_7
with an accuracy of 0.787 (or 78.7%). This classifier was trained using Species rank abundance table. Strain level is comparable with accuracy of 0.779, while Kingdom classifiers perform very poorly (accuracy = 0.519) which are pretty much random guesses. There was no improvement to the classify if we combined all the features (data_9) together.
The ROC curves of Random Forest classifiers trained using 8-fold cross-validation (cv_8) for input data from Kingdom (data_1) to Strain (data_8) and the combined dataset which merges all input data as one table (data_9). When using 8-fold CV the best performing classifier has accuracy of 0.787 using the Species input data (data_7). See text description for further interpretation.