Data dependencies

Many steps in the pipeline require access to reference databases. These reference databases are very big and are often already downloaded by the system administrators. As such they are not included in the container images.

To run the pipeline you need to know the absolute paths for the below reference databases.

minimap2
kraken2 and bracken
humann
- CHOCOPhlAn
- uniref
metaphlan CHOCOPhlAn database

You might also need to set up path binding when deploying the containers.

If you are missing any reference datasets, see below for download information.

QC: decontamination step

Tool	Description	URL
Minimap2	requires reference genome, we used the Humann reference genome GRCh38.p14 from NCBI (~800MB)	https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.40/ (download the Genomic sequence, fasta file)

Taxonomy profiling

Tool	Description	URL
Kraken2	requires taxonomy reference database	Pre-built: https://benlangmead.github.io/aws-indexes/k2
Bracken	build indexes from Kraken2 database	Pre-built: https://benlangmead.github.io/aws-indexes/k2 or see the Bracken build tutorial

Functional profiling

HUMAnN database

Requires 3 reference databases, below are instructions to download using the MIMA container setup or you can refer to the developer’s documentation
Estimated disk space: ~53GB (you might need several hours for downloading)
After installing MIMA container, ensure you have set the SANDBOX environment variable
You can check required database version using the command:

apptainer exec $SANDBOX humann_databases --available

output:

HUMAnN Databases ( database : build = location )
chocophlan : full = http://huttenhower.sph.harvard.edu/humann_data/chocophlan/full_chocophlan.v201901_v31.tar.gz
chocophlan : DEMO = http://huttenhower.sph.harvard.edu/humann_data/chocophlan/DEMO_chocophlan.v201901_v31.tar.gz
uniref : uniref50_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref50_annotated_v201901b_full.tar.gz
uniref : uniref90_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref90_annotated_v201901b_full.tar.gz
uniref : uniref50_ec_filtered_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_ec_filtered/uniref50_ec_filtered_201901b_subset.tar.gz
uniref : uniref90_ec_filtered_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_ec_filtered/uniref90_ec_filtered_201901b_subset.tar.gz
uniref : DEMO_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref90_DEMO_diamond_v201901b.tar.gz
utility_mapping : full = http://huttenhower.sph.harvard.edu/humann_data/full_mapping_v201901b.tar.gz

The first command creates a new folder ~/refDB/humann3 in your home directory
The next three commands install the three required databases
- note that many HPC systems have limited space in your home directory (~).
Replace ~/refDB/humann3 with your preferred location as needed, if installing to external path remember to set path binding.

$ mkdir -p ~/refDB/humann3
$ apptainer exec $SANDBOX humann_databases --download chocophlan full ~/refDB/humann3
$ apptainer exec $SANDBOX humann_databases --download uniref uniref90_diamond ~/refDB/humann3
$ apptainer exec $SANDBOX humann_databases --download utility_mapping full ~/refDB/humann3

After installation, check the files

tree -d ~/refDB/humann3

~/refDB/humann3
├── chocophlan
├── uniref
└── utility_mapping

MetaPhlAn database

Estimated disk space: ~26GB or see the developer’s documentation
After installing MIMA container, ensure you have set the SANDBOX environment variable
The command below installs the required Bowtie2 database where the --bowtie2db parameter lets you set the path
- the example below installs it in your home directory (~)
Replace ~/refDB/metaphlan_databases/ with your preferred location as needed, if installing to external path remember to set path binding.

mkdir -p ~/refDB/metaphlan_databases
apptainer exec $SANDBOX metaphlan --install --bowtie2db ~/refDB/metaphlan_databases/

Installing to external path

If you are using the MIMA container to install reference databases to a location other than your home directory, remember to set path binding.

The example below uses -B parameter:

apptainer exec -B /another/loc/metaphlan_databases:/another/loc/metaphlan_databases $SANDBOX metaphlan --install --bowtie2db /another/loc/metaphlan_databases

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified 25.09.2024