Quality control (QC)

QC module in MIMA, check reads are of high standard

This step must be done before Taxonomy and Function profiling.

Introduction: QC module

Quality control (QC) checks the sequenced reads obtained from the sequencing machine is of good quality. Bad quality reads or artefacts due to sequencing error if not removed can lead to spurious results and affect downstream analyses. There are a number of tools available for checking the read quality of high-throughput sequencing data.

This pipeline uses the following tools:

Workflow description

One bash (*.sh) script per sample is generated by this module that performs the following steps. The module also generates $N$ number of PBS scripts (set using the --num-pbs-jobs parameter).

This tutorial has 9 samples spread across 3 PBS jobs. One PBS job will execute three bash scripts, one per sample.

Step
1. repair.sh from BBTools/BBMap tool suite, repairs the sequenced reads and separates out any singleton reads (orphaned reads that are missing either the forward or reverse partner read) into its own text file.
2. dereplication using clumpify.sh from BBTools/BBMap tool suite, removes duplicate reads and clusters reads for faster downstream processing
3. quality checking is done with fastp.sh and checks the quality of the reads and removes any reads that are of low quality, too long or too short
4. decontamination with minimap2.sh maps the sequenced reads against a reference genome (e.g., human) defined by the user. It removes these host-reads from the data and generates a cleaned output file. This becomes the input for taxonomy profiling and function profiling modules.
The (optional) last step generates a QC report text file with summary statistics for each file. This step is run after all PBS scripts have been executed for all samples in the study.

For details about the tool versions, you can check the MIMA container image installed

Step 1. Generate QC PBS scripts

  • Tick off the check list
  • Find the absolute path to the host reference genome file (e.g., Human host might use GRCh38_latest_genome.fna)
  • Replace the highlighted line --ref <path/to/human/GRCh38_latest_genome.fna> to the location of your host reference genome file
  • Enter the following command
    • the backslash (\) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below)
    • note if you enter the command as one line, remove the backslashes
1
2
3
4
5
6
7
8
apptainer run --app mima-qc $SANDBOX \
-i ~/mima_tutorial/raw_data \
-o ~/mima_tutorial/output \
-m ~/mima_tutorial/manifest.csv \
--num-pbs-jobs 3 \
--ref <path/to/human/GRCh38_latest_genome.fna> \
--mode container \
--pbs-config ~/mima_tutorial/pbs_header_qc.cfg

Parameters explained

Parameter
Required?
Description
-i <input>yesmust be absolute path to the directory where fastq files are stored. FastQ files often have the *.fastq.gz or *.fq.gz extension. This path is used to find the FileID_R1 and FileID_R2 columns specified in the manifest.csv file
-o <output>yes

absolute path output directory where results will be saved. The <output> path will be created if it does not exists.

Note if there are already existing subdirectories in the <output> path, then this step will fail.

-m <manifest.csv>yes

a comma-separated file (*.csv) that has three columns with the headers: Sample_ID, FileID_R1, FileID_R2 see the example below.

Note the fileID_R1 and fileID_R2 are relative to the -i <input> path provided.

--num-pbs-jobsno
(default=4)
number of PBS scripts to generate by default 4 jobs are created with samples split equally between the jobs
--refyespath to the host genome file GRCh38_latest_genome.fna
--mode containerno
(default=‘single’)
set this if you are running as a Container
--pbs-configyes if
--mode container
path to the pbs configuration file, must specify this parameter if --mode container is set

Step 2. Check generated scripts

  • After step 1, you should see the new directory: ~/mima_tutorial/output
  • Inside output/, there should be a sub-directory QC_module
  • Inside output/QC_module, there should be 3 PBS scripts with the *.pbs file extension
tree ~/mima_tutorial
  • only the output folder is shown in the snapshot below
~/mima_tutorial
└── output/
    └──  QC_module
      ├── CleanReads
      ├── qcModule_0.pbs
      ├── qcModule_1.pbs
      ├── qcModule_2.pbs
      ├── QCReport
      ├── SRR17380115.sh
      ├── SRR17380118.sh
      ├── SRR17380122.sh
      ├── SRR17380209.sh
      ├── SRR17380218.sh
      ├── SRR17380222.sh
      ├── SRR17380231.sh
      ├── SRR17380232.sh
      └── SRR17380236.sh

Step 3. Submit QC jobs

  • Let’s examine one of the PBS scripts to be submitted
cat ~/mima_tutorial/output/QC_module/qcModule_0.pbs
  • Your PBS script should look something like below, with some differences
    • line 10: IMAGE_DIR should be where you installed MIMA and build the sandbox
    • line 11: APPTAINER_BIND should be setup during installation when binding paths
      • make sure to include the path where the host reference genome file is located
    • line 14: /home/user is replaced with the absolute path to your actual home directory
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#!/bin/bash
#PBS -N QC_module_0
#PBS -l ncpus=8
#PBS -l walltime=2:00:00
#PBS -l mem=64GB
#PBS -j oe

set -x

IMAGE_DIR=~/mima-pipeline
export APPTAINER_BIND="/path/to/human/GRCh38:/path/to/human/GRCh38"


cd /home/user/mima_tutorial/output/QC_module/

apptainer exec ${IMAGE_DIR} bash /home/user/mima_tutorial/output/QC_module/SRR17380209.sh > SRR17380209_qc_module.log 2>&1
apptainer exec ${IMAGE_DIR} bash /home/user/mima_tutorial/output/QC_module/SRR17380232.sh > SRR17380232_qc_module.log 2>&1
apptainer exec ${IMAGE_DIR} bash /home/user/mima_tutorial/output/QC_module/SRR17380236.sh > SRR17380236_qc_module.log 2>&1
  • Navigate to the ~/mima_tutorial/output/QC_module
  • Submit the PBS job using qsub
cd ~/mima_tutorial/output/QC_module
qsub qcModule_0.pbs
  • You can check the job has been submitted and their status with qstat
  • Repeat the qsub command for each of the qcModule_1.pbs and qcModule_2.pbs files
  • Wait until all PBS jobs have completed

Step 4. Check QC outputs

  • After all PBS job completes, you should have the following outputs
tree ~/mima_tutorial/output/QC_module
  • We only show outputs for one sample below with ... meaning “and others”
  • You’ll have a set of output files for each sample listed in the manifest.csv (provided the corresponding *.fastq files exists)
 ~/mima_tutorial/output/QC_module
├── CleanReads
│   ├── SRR17380209_clean_1.fq.gz
│   ├── SRR17380209_clean_2.fq.gz
│   └── ...
├── QC_module_0.o3492023
├── qcModule_0.pbs
├── qcModule_1.pbs
├── qcModule_2.pbs
├── QCReport
│   ├── SRR17380209.json
│   ├── SRR17380209.outreport.html
│   └── ...
├── ...
├── SRR17380209_qc_module.log
├── SRR17380209.sh
├── SRR17380209_singletons.fq.gz
└── ...

Step 5. (optional) Generate QC report

  • You can generate a summary QC Report after all samples have been quality checked
  • This step can be run directly from the command line
  • First navigate to the QC_module directory that have the qc_module.log files
cd ~/mima_tutorial/output/QC_module
apptainer run --app mima-qc-report $SANDBOX \
-i ~/mima_tutorial/output/QC_module \
--manifest ~/mima_tutorial/manifest.csv

The output is a comma separated text file located in ~/mima_tutorial/output/QC_module/QC_report.csv.

  • Examine the output report using head
    • column formats the text file as columns like a table for readabilty
head ~/mima_tutorial/output/QC_module/QC_report.csv | column -t -s ','
  • Your output should resemble something like below
    • the numbers can be different due to different tool versions
  • The log_status column specify if log files were found, other values mean
    • missing QC log: missing *qc_module.log file for the corresponding sample
    • missing JSON log: missing fastp *.json file for the corresponding sample
SampleId     log_status      Rawreads_seqs  Derep_seqs  dedup_percentage  Post_QC_seqs  low_quality_reads(%)  Host_seqs  Host(%)                 Clean_reads
SRR17380209  OK              14072002       14045392.0  0.0809091         9568710       31.87295876113675     426.0      0.004452010772611982    9568284.0
SRR17380232  OK              11822934       11713130.0  0.0706387         11102358      5.214421764293575     62.0       0.0005584399278063273   11102296.0
SRR17380236  OK              11756456       11656800.0  0.0688868         10846582      6.950603939331549     46.0       0.00042409673388354047  10846536.0
SRR17380231  OK              12223354       12104874.0  0.069757          11414164      5.7060486544510916    14.0       0.00012265462455244203  11414150.0
SRR17380218  OK              14913690       14874174.0  0.0856518         11505850      22.645452446636703    234.0      0.002033748049905048    11505616.0
SRR17380222  OK              16927002       16905928.0  0.0980011         11692516      30.83777477344042     294.0      0.002514428887674817    11692222.0
SRR17380118  OK              40393998       40186880.0  0.234584          40148912      0.09447859599949038   100242.0   0.2496755080187478      40048670.0

Next: you can now proceed with either Taxonomy Profiling with MIMA or Function Profiling with MIMA.



Last modified 25.09.2024