Quality control (QC)

QC module in MIMA, check reads are of high standard

This step must be done before Taxonomy and Function profiling.

Introduction: QC module

Quality control (QC) checks the sequenced reads obtained from the sequencing machine is of good quality. Bad quality reads or artefacts due to sequencing error if not removed can lead to spurious results and affect downstream analyses. There are a number of tools available for checking the read quality of high-throughput sequencing data.

This pipeline uses the following tools:

Workflow description

One bash (*.sh) script per sample is generated by this module that performs the following steps. The module also generates $N$ number of PBS scripts (set using the --num-pbs-jobs parameter).

This tutorial has 9 samples spread across 3 PBS jobs. One PBS job will execute three bash scripts, one per sample.

Step
1. repair.sh from BBTools/BBMap tool suite, repairs the sequenced reads and separates out any singleton reads (orphaned reads that are missing either the forward or reverse partner read) into its own text file.
2. dereplication using clumpify.sh from BBTools/BBMap tool suite, removes duplicate reads and clusters reads for faster downstream processing
3. quality checking is done with fastp.sh and checks the quality of the reads and removes any reads that are of low quality, too long or too short
4. decontamination with minimap2.sh maps the sequenced reads against a reference genome (e.g., human) defined by the user. It removes these host-reads from the data and generates a cleaned output file. This becomes the input for taxonomy profiling and function profiling modules.
The (optional) last step generates a QC report text file with summary statistics for each file. This step is run after all PBS scripts have been executed for all samples in the study.

For details about the tool versions, you can check the MIMA container image installed

Step 1. Generate QC PBS scripts

Tick off the check list
Find the absolute path to the host reference genome file (e.g., Human host might use GRCh38_latest_genome.fna)
Replace the highlighted line --ref <path/to/human/GRCh38_latest_genome.fna> to the location of your host reference genome file
Enter the following command
- the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
- note if you enter the command as one line, remove the backslashes

1
2
3
4
5
6
7
8
apptainer run --app mima-qc $SANDBOX \
-i ~/mima_tutorial/raw_data \
-o ~/mima_tutorial/output \
-m ~/mima_tutorial/manifest.csv \
--num-pbs-jobs 3 \
--ref <path/to/human/GRCh38_latest_genome.fna> \
--mode container \
--pbs-config ~/mima_tutorial/pbs_header_qc.cfg

For MRC users, see here for file locations

Parameters explained

Parameter	Required?	Description
`-i <input>`	yes	must be absolute path to the directory where fastq files are stored. FastQ files often have the .fastq.gz or .fq.gz extension. This path is used to find the FileID_R1 and FileID_R2 columns specified in the `manifest.csv` file
`-o <output>`	yes	absolute path output directory where results will be saved. The `<output>` path will be created if it does not exists. Note if there are already existing subdirectories in the `<output>` path, then this step will fail.
`-m <manifest.csv>`	yes	a comma-separated file (.csv) that has three columns with the headers: Sample_ID, FileID_R1, FileID_R2* see the example below. Note the fileID_R1 and fileID_R2 are relative to the `-i <input>` path provided.
`--num-pbs-jobs`	no (default=4)	number of PBS scripts to generate by default 4 jobs are created with samples split equally between the jobs
`--ref`	yes	path to the host genome file GRCh38_latest_genome.fna
`--mode container`	no (default=‘single’)	set this if you are running as a Container
`--pbs-config`	yes if `--mode container`	path to the pbs configuration file, must specify this parameter if `--mode container` is set

tip

These commands can get really long, you can use Notepad to

copy & paste the commands
change the paths and parameters as desired
copy & paste to the terminal

Or you can explore using the vim text editor.

Step 2. Check generated scripts

After step 1, you should see the new directory: ~/mima_tutorial/output
Inside output/, there should be a sub-directory QC_module
Inside output/QC_module, there should be 3 PBS scripts with the *.pbs file extension

tree ~/mima_tutorial

only the output folder is shown in the snapshot below

~/mima_tutorial
└── output/
    └──  QC_module
      ├── CleanReads
      ├── qcModule_0.pbs
      ├── qcModule_1.pbs
      ├── qcModule_2.pbs
      ├── QCReport
      ├── SRR17380115.sh
      ├── SRR17380118.sh
      ├── SRR17380122.sh
      ├── SRR17380209.sh
      ├── SRR17380218.sh
      ├── SRR17380222.sh
      ├── SRR17380231.sh
      ├── SRR17380232.sh
      └── SRR17380236.sh

Step 3. Submit QC jobs

Let’s examine one of the PBS scripts to be submitted

cat ~/mima_tutorial/output/QC_module/qcModule_0.pbs

Your PBS script should look something like below, with some differences
- line 10: IMAGE_DIR should be where you installed MIMA and build the sandbox
- line 11: APPTAINER_BIND should be setup during installation when binding paths
  - make sure to include the path where the host reference genome file is located
- line 14: /home/user is replaced with the absolute path to your actual home directory

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#!/bin/bash
#PBS -N QC_module_0
#PBS -l ncpus=8
#PBS -l walltime=2:00:00
#PBS -l mem=64GB
#PBS -j oe

set -x

IMAGE_DIR=~/mima-pipeline
export APPTAINER_BIND="/path/to/human/GRCh38:/path/to/human/GRCh38"


cd /home/user/mima_tutorial/output/QC_module/

apptainer exec ${IMAGE_DIR} bash /home/user/mima_tutorial/output/QC_module/SRR17380209.sh > SRR17380209_qc_module.log 2>&1
apptainer exec ${IMAGE_DIR} bash /home/user/mima_tutorial/output/QC_module/SRR17380232.sh > SRR17380232_qc_module.log 2>&1
apptainer exec ${IMAGE_DIR} bash /home/user/mima_tutorial/output/QC_module/SRR17380236.sh > SRR17380236_qc_module.log 2>&1

Navigate to the ~/mima_tutorial/output/QC_module
Submit the PBS job using qsub

cd ~/mima_tutorial/output/QC_module
qsub qcModule_0.pbs

You can check the job has been submitted and their status with qstat
Repeat the qsub command for each of the qcModule_1.pbs and qcModule_2.pbs files
Wait until all PBS jobs have completed

PBS log files

Usually a PBS job will generate a log file in the same directory from where the job was submitted
As such we changed directory to where the PBS scripts are saved to submit the jobs because the log files are required for the Step 5. Generating the QC report.

Step 4. Check QC outputs

After all PBS job completes, you should have the following outputs

tree ~/mima_tutorial/output/QC_module

We only show outputs for one sample below with ... meaning “and others”
You’ll have a set of output files for each sample listed in the manifest.csv (provided the corresponding *.fastq files exists)

 ~/mima_tutorial/output/QC_module
├── CleanReads
│   ├── SRR17380209_clean_1.fq.gz
│   ├── SRR17380209_clean_2.fq.gz
│   └── ...
├── QC_module_0.o3492023
├── qcModule_0.pbs
├── qcModule_1.pbs
├── qcModule_2.pbs
├── QCReport
│   ├── SRR17380209.json
│   ├── SRR17380209.outreport.html
│   └── ...
├── ...
├── SRR17380209_qc_module.log
├── SRR17380209.sh
├── SRR17380209_singletons.fq.gz
└── ...

Step 5. (optional) Generate QC report

You can generate a summary QC Report after all samples have been quality checked
This step can be run directly from the command line
First navigate to the QC_module directory that have the qc_module.log files

cd ~/mima_tutorial/output/QC_module

apptainer run --app mima-qc-report $SANDBOX \
-i ~/mima_tutorial/output/QC_module \
--manifest ~/mima_tutorial/manifest.csv

Troubleshoot

If your sequence read files are not saved in your home directory, check you have set up the APPTAINER_BIND environment variable when binding paths.

The output is a comma separated text file located in ~/mima_tutorial/output/QC_module/QC_report.csv.

Examine the output report using head
- column formats the text file as columns like a table for readabilty

head ~/mima_tutorial/output/QC_module/QC_report.csv | column -t -s ','

Your output should resemble something like below
- the numbers can be different due to different tool versions
The log_status column specify if log files were found, other values mean
- missing QC log: missing *qc_module.log file for the corresponding sample
- missing JSON log: missing fastp *.json file for the corresponding sample

SampleId     log_status      Rawreads_seqs  Derep_seqs  dedup_percentage  Post_QC_seqs  low_quality_reads(%)  Host_seqs  Host(%)                 Clean_reads
SRR17380209  OK              14072002       14045392.0  0.0809091         9568710       31.87295876113675     426.0      0.004452010772611982    9568284.0
SRR17380232  OK              11822934       11713130.0  0.0706387         11102358      5.214421764293575     62.0       0.0005584399278063273   11102296.0
SRR17380236  OK              11756456       11656800.0  0.0688868         10846582      6.950603939331549     46.0       0.00042409673388354047  10846536.0
SRR17380231  OK              12223354       12104874.0  0.069757          11414164      5.7060486544510916    14.0       0.00012265462455244203  11414150.0
SRR17380218  OK              14913690       14874174.0  0.0856518         11505850      22.645452446636703    234.0      0.002033748049905048    11505616.0
SRR17380222  OK              16927002       16905928.0  0.0980011         11692516      30.83777477344042     294.0      0.002514428887674817    11692222.0
SRR17380118  OK              40393998       40186880.0  0.234584          40148912      0.09447859599949038   100242.0   0.2496755080187478      40048670.0

Next: you can now proceed with either Taxonomy Profiling with MIMA or Function Profiling with MIMA.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified 25.09.2024