Download tutorial data
In the data processing tutorials, we will use data from the study by
Tourlousse, et al. (2022), Characterization and Demonstration of Mock Communities as Control Reagents for Accurate Human Microbiome Community Measures, Microbiology Spectrum.
This data set consists of two mock communities: DNA-mock and Cell-mock. The mock communities consists of bacteria that are mainly detected the human gastrointestinal tract ecosystem with a small mixture of some skin microbiota. The data was processed in three different labs: A, B and C. In the previous tutorial, , we only processed a subset of the samples (n=9). In this tutorial we will be working with the full data set which has been pre-processed using the same pipeline. In total there were 56 samples of which 4 samples fell below the abundance threshold and therefore the final taxonomy abundance table has 52 samples. We will train the random forest classifier to distinguish between the three labs.
- The raw reads are available from NCBI SRA Project PRJNA747117, and there are 56 paired-end samples (112 fastq files)
- As the data is very big we will work with a subset (n=9 samples, 18 fastq files)
- This tutorial teaches you how to prepare the required raw fastq files
Note
You will need about 80GB of disk space depending on which option you used for downloading.Step 1) Download tutorial files
- Download the zip file mima_tutorial.zip via
wget
wget https://github.com/mrcbioinfo/mima-pipeline/raw/master/examples/mima_tutorial.zip
- Extract the archived file using
unzip
unzip mima_tutorial.zip
- Check the directory structure matches using
tree
tree mima_tutorial
~/mima_tutorial
├── ftp_download_files.sh
├── manifest.csv
├── pbs_header_func.cfg
├── pbs_header_qc.cfg
├── pbs_header_taxa.cfg
├── raw_data
└── SRA_files
Note! Assumed working directory
This tutorial assumes the~/mima_tutorial
directory is located in your home directory as indicated by the ~
(tilde) sign. If you have put the files in another location then replace all occurrences of ~/mima_tutorial
with your location (remember to use absolute paths).Data files
File | Description |
---|---|
ftp_download_files.sh | direct download FTP links used in Option A: direct download below |
SRA_files | contains the SRA identifier of the 9 samples used in Option B and Option C below |
manifest.csv | comma separated file of 3 columns, that lists the sampleID, forward_filename, reverse_filename |
pbs_header_*.cfg | PBS configuration files that contain specific parameters for job submissions |
Step 2) Download Sequence files
Choose from the 3 options for downloading the tutorial data depending on your environment setup:
Option | Tool | Est. size | Description |
---|---|---|---|
A | curl | 24GB | direct download using curl command, files are already compressed |
B | sratoolkit (installed on system) | 75GB | download using The files are not compressed when downloaded; compression is a post-processing step |
C | sratoolkit (installed in MIMA) | 75GB | installed in MIMA container; same as option B downloaded files are not compressed |
Option A: direct download
- Run the following command for direct download
bash FTP_download_files.sh
Option B: download with sratoolkit
- Download the SRA files using
prefetch
command
prefetch --option-file SRA_files --output-directory raw_data
Below is the output, wait until all files are downloaded
2022-09-08T05:50:42 prefetch.3.0.0: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2022-09-08T05:50:42 prefetch.3.0.0: 1) Downloading 'SRR17380209'...
2022-09-08T05:50:42 prefetch.3.0.0: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2022-09-08T05:50:42 prefetch.3.0.0: Downloading via HTTPS...
...
- After download finish, check the downloaded files with the
tree
command
tree raw_data
raw_data/
├── SRR17380115
│ └── SRR17380115.sra
├── SRR17380118
│ └── SRR17380118.sra
├── SRR17380122
│ └── SRR17380122.sra
├── SRR17380209
│ └── SRR17380209.sra
├── SRR17380218
│ └── SRR17380218.sra
├── SRR17380222
│ └── SRR17380222.sra
├── SRR17380231
│ └── SRR17380231.sra
├── SRR17380232
│ └── SRR17380232.sra
└── SRR17380236
└── SRR17380236.sra
- Extract the fastq files using the
fasterq-dump
command - We’ll also save some disk space by zipping up the fastq files using
bzip
(orpigz
)
cd ~/mima_tutorial/raw_data
fasterq-dump --split-files */*.
bzip2 *.fastq
tree .
.
├── SRR17380115
│ └── SRR17380115.sra
├── SRR17380115_1.fastq.gz
├── SRR17380115_2.fastq.gz
├── SRR17380118
│ └── SRR17380118.sra
├── SRR17380118_1.fastq.gz
├── SRR17380118_2.fastq.gz
├── SRR17380122
│ └── SRR17380122.sra
├── SRR17380122_1.fastq.gz
├── SRR17380122_2.fastq.gz
├── SRR17380209
│ └── SRR17380209.sra
├── SRR17380209_1.fastq.gz
├── SRR17380209_2.fastq.gz
├── SRR17380218
│ └── SRR17380218.sra
├── SRR17380218_1.fastq.gz
├── SRR17380218_2.fastq.gz
├── SRR17380222
│ └── SRR17380222.sra
├── SRR17380222_1.fastq.gz
├── SRR17380222_2.fastq.gz
├── SRR17380231
│ └── SRR17380231.sra
├── SRR17380231_1.fastq.gz
├── SRR17380231_2.fastq.gz
├── SRR17380232
│ └── SRR17380232.sra
├── SRR17380232_1.fastq.gz
├── SRR17380232_2.fastq.gz
├── SRR17380236
│ └── SRR17380236.sra
├── SRR17380236_1.fastq.gz
└── SRR17380236_2.fastq.gz
Option C: download via MIMA
- This options assumes you have installed MIMA and set up the
SANDBOX
environment variable - Download the SRA files using the following command
apptainer exec $SANDBOX prefetch --option-file SRA_files --output-directory raw_data
- Below is the output, wait until all files are downloaded
2022-09-08T05:50:42 prefetch.3.0.0: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2022-09-08T05:50:42 prefetch.3.0.0: 1) Downloading 'SRR17380209'...
2022-09-08T05:50:42 prefetch.3.0.0: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2022-09-08T05:50:42 prefetch.3.0.0: Downloading via HTTPS...
...
- After download finishes, check the downloaded files
tree raw_data
raw_data/
├── SRR17380115
│ └── SRR17380115.sra
├── SRR17380118
│ └── SRR17380118.sra
├── SRR17380122
│ └── SRR17380122.sra
├── SRR17380209
│ └── SRR17380209.sra
├── SRR17380218
│ └── SRR17380218.sra
├── SRR17380222
│ └── SRR17380222.sra
├── SRR17380231
│ └── SRR17380231.sra
├── SRR17380232
│ └── SRR17380232.sra
└── SRR17380236
└── SRR17380236.sra
- Extract the fastq files using the
fasterq-dump
command - We’ll also save some disk space by zipping up the fastq files using
bzip
(orpigz
)
cd ~/mima_tutorial/raw_data
singularity exec $SANDBOX fasterq-dump --split-files */*.
singularity exec $SANDBOX bzip2 *.fastq
tree .
.
├── SRR17380115
│ └── SRR17380115.sra
├── SRR17380115_1.fastq.gz
├── SRR17380115_2.fastq.gz
├── SRR17380118
│ └── SRR17380118.sra
├── SRR17380118_1.fastq.gz
├── SRR17380118_2.fastq.gz
├── SRR17380122
│ └── SRR17380122.sra
├── SRR17380122_1.fastq.gz
├── SRR17380122_2.fastq.gz
├── SRR17380209
│ └── SRR17380209.sra
├── SRR17380209_1.fastq.gz
├── SRR17380209_2.fastq.gz
├── SRR17380218
│ └── SRR17380218.sra
├── SRR17380218_1.fastq.gz
├── SRR17380218_2.fastq.gz
├── SRR17380222
│ └── SRR17380222.sra
├── SRR17380222_1.fastq.gz
├── SRR17380222_2.fastq.gz
├── SRR17380231
│ └── SRR17380231.sra
├── SRR17380231_1.fastq.gz
├── SRR17380231_2.fastq.gz
├── SRR17380232
│ └── SRR17380232.sra
├── SRR17380232_1.fastq.gz
├── SRR17380232_2.fastq.gz
├── SRR17380236
│ └── SRR17380236.sra
├── SRR17380236_1.fastq.gz
└── SRR17380236_2.fastq.gz
Step 3) Check manifest
- Examine the manifest file
cat mima_tutorial/manifest.csv
- Your output should looking like something below
- Check column 2 (
FileID_R1
) and column 3 (FileID_R2
) match the names of the files inraw_data
- Check column 2 (
- Update the manifest file as required
Sample_ID,FileID_R1,FileID_R2
SRR17380209,SRR17380209_1.fastq.gz,SRR17380209_2.fastq.gz
SRR17380232,SRR17380232_1.fastq.gz,SRR17380232_2.fastq.gz
SRR17380236,SRR17380236_1.fastq.gz,SRR17380236_2.fastq.gz
SRR17380231,SRR17380231_1.fastq.gz,SRR17380231_2.fastq.gz
SRR17380218,SRR17380218_1.fastq.gz,SRR17380218_2.fastq.gz
SRR17380222,SRR17380222_1.fastq.gz,SRR17380222_2.fastq.gz
SRR17380118,SRR17380118_1.fastq.gz,SRR17380118_2.fastq.gz
SRR17380115,SRR17380115_1.fastq.gz,SRR17380115_2.fastq.gz
SRR17380122,SRR17380122_1.fastq.gz,SRR17380122_2.fastq.gz
Manifest file formats
- the first row is the header and is case sensitive, it must have the three columns:
Sample_ID,FileID_R1,FileID_R2
- the filenames in columns 2 and 3 do not need to be absolute paths as the directory where the files are located will be specified during quality checking
Remember to check out what else you need to know before jumping into quality checking
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.