Quality control (QC)
This step must be done before Taxonomy and Function profiling.
Introduction: QC module
Quality control (QC) checks the sequenced reads obtained from the sequencing machine is of good quality. Bad quality reads or artefacts due to sequencing error if not removed can lead to spurious results and affect downstream analyses. There are a number of tools available for checking the read quality of high-throughput sequencing data.
This pipeline uses the following tools:
Workflow description
One bash (*.sh) script per sample is generated by this module that performs the following steps. The module also generates $N$ number of PBS scripts (set using the --num-pbs-jobs
parameter).
This tutorial has 9 samples spread across 3 PBS jobs. One PBS job will execute three bash scripts, one per sample.
Step | |
---|---|
1. repair.sh from BBTools/BBMap tool suite, repairs the sequenced reads and separates out any singleton reads (orphaned reads that are missing either the forward or reverse partner read) into its own text file. | ![]() |
2. dereplication using clumpify.sh from BBTools/BBMap tool suite, removes duplicate reads and clusters reads for faster downstream processing | |
3. quality checking is done with fastp.sh and checks the quality of the reads and removes any reads that are of low quality, too long or too short | |
4. decontamination with minimap2.sh maps the sequenced reads against a reference genome (e.g., human) defined by the user. It removes these host-reads from the data and generates a cleaned output file. This becomes the input for taxonomy profiling and function profiling modules. | |
The (optional) last step generates a QC report text file with summary statistics for each file. This step is run after all PBS scripts have been executed for all samples in the study. |
For details about the tool versions, you can check the MIMA container image installed
Step 1. Generate QC PBS scripts
- Tick off the check list
- Find the absolute path to the host reference genome file (e.g., Human host might use GRCh38_latest_genome.fna)
- Replace the highlighted line
--ref <path/to/human/GRCh38_latest_genome.fna>
to the location of your host reference genome file - Enter the following command
- the backslash (
\
) at the end of each line informs the terminal that the command has not finished and there’s more to come (we broke up the command for readability purposes to explain each parameter below) - note if you enter the command as one line, remove the backslashes
- the backslash (
|
|
Parameters explained
Parameter | Required? | Description |
---|---|---|
-i <input> | yes | must be absolute path to the directory where fastq files are stored. FastQ files often have the *.fastq.gz or *.fq.gz extension. This path is used to find the FileID_R1 and FileID_R2 columns specified in the manifest.csv file |
-o <output> | yes | absolute path output directory where results will be saved. The Note if there are already existing subdirectories in the |
-m <manifest.csv> | yes | a comma-separated file (*.csv) that has three columns with the headers: Sample_ID, FileID_R1, FileID_R2 see the example below. Note the fileID_R1 and fileID_R2 are relative to the |
--num-pbs-jobs | no (default=4) | number of PBS scripts to generate by default 4 jobs are created with samples split equally between the jobs |
--ref | yes | path to the host genome file GRCh38_latest_genome.fna |
--mode container | no (default=‘single’) | set this if you are running as a Container |
--pbs-config | yes if--mode container | path to the pbs configuration file, must specify this parameter if --mode container is set |
tip
These commands can get really long, you can use Notepad to
- copy & paste the commands
- change the paths and parameters as desired
- copy & paste to the terminal
Or you can explore using the vim text editor.
Step 2. Check generated scripts
- After step 1, you should see the new directory:
~/mima_tutorial/output
- Inside
output/
, there should be a sub-directoryQC_module
- Inside
output/QC_module
, there should be 3 PBS scripts with the *.pbs file extension
tree ~/mima_tutorial
- only the
output
folder is shown in the snapshot below
~/mima_tutorial
└── output/
└── QC_module
├── CleanReads
├── qcModule_0.pbs
├── qcModule_1.pbs
├── qcModule_2.pbs
├── QCReport
├── SRR17380115.sh
├── SRR17380118.sh
├── SRR17380122.sh
├── SRR17380209.sh
├── SRR17380218.sh
├── SRR17380222.sh
├── SRR17380231.sh
├── SRR17380232.sh
└── SRR17380236.sh
Step 3. Submit QC jobs
- Let’s examine one of the PBS scripts to be submitted
cat ~/mima_tutorial/output/QC_module/qcModule_0.pbs
- Your PBS script should look something like below, with some differences
- line 10:
IMAGE_DIR
should be where you installed MIMA and build the sandbox - line 11:
APPTAINER_BIND
should be setup during installation when binding paths- make sure to include the path where the host reference genome file is located
- line 14:
/home/user
is replaced with the absolute path to your actual home directory
- line 10:
|
|
- Navigate to the
~/mima_tutorial/output/QC_module
- Submit the PBS job using
qsub
cd ~/mima_tutorial/output/QC_module
qsub qcModule_0.pbs
- You can check the job has been submitted and their status with
qstat
- Repeat the
qsub
command for each of theqcModule_1.pbs
andqcModule_2.pbs
files - Wait until all PBS jobs have completed
PBS log files
- Usually a PBS job will generate a log file in the same directory from where the job was submitted
- As such we changed directory to where the PBS scripts are saved to submit the jobs because the log files are required for the Step 5. Generating the QC report.
Step 4. Check QC outputs
- After all PBS job completes, you should have the following outputs
tree ~/mima_tutorial/output/QC_module
- We only show outputs for one sample below with
...
meaning “and others” - You’ll have a set of output files for each sample listed in the
manifest.csv
(provided the corresponding *.fastq files exists)
~/mima_tutorial/output/QC_module
├── CleanReads
│ ├── SRR17380209_clean_1.fq.gz
│ ├── SRR17380209_clean_2.fq.gz
│ └── ...
├── QC_module_0.o3492023
├── qcModule_0.pbs
├── qcModule_1.pbs
├── qcModule_2.pbs
├── QCReport
│ ├── SRR17380209.json
│ ├── SRR17380209.outreport.html
│ └── ...
├── ...
├── SRR17380209_qc_module.log
├── SRR17380209.sh
├── SRR17380209_singletons.fq.gz
└── ...
Step 5. (optional) Generate QC report
- You can generate a summary QC Report after all samples have been quality checked
- This step can be run directly from the command line
- First navigate to the
QC_module
directory that have theqc_module.log
files
cd ~/mima_tutorial/output/QC_module
apptainer run --app mima-qc-report $SANDBOX \
-i ~/mima_tutorial/output/QC_module \
--manifest ~/mima_tutorial/manifest.csv
Troubleshoot
If your sequence read files are not saved in your home directory, check you have set up the APPTAINER_BIND environment variable when binding paths.The output is a comma separated text file located in ~/mima_tutorial/output/QC_module/QC_report.csv
.
- Examine the output report using
head
column
formats the text file as columns like a table for readabilty
head ~/mima_tutorial/output/QC_module/QC_report.csv | column -t -s ','
- Your output should resemble something like below
- the numbers can be different due to different tool versions
- The
log_status
column specify if log files were found, other values mean- missing QC log: missing *qc_module.log file for the corresponding sample
- missing JSON log: missing fastp *.json file for the corresponding sample
SampleId log_status Rawreads_seqs Derep_seqs dedup_percentage Post_QC_seqs low_quality_reads(%) Host_seqs Host(%) Clean_reads
SRR17380209 OK 14072002 14045392.0 0.0809091 9568710 31.87295876113675 426.0 0.004452010772611982 9568284.0
SRR17380232 OK 11822934 11713130.0 0.0706387 11102358 5.214421764293575 62.0 0.0005584399278063273 11102296.0
SRR17380236 OK 11756456 11656800.0 0.0688868 10846582 6.950603939331549 46.0 0.00042409673388354047 10846536.0
SRR17380231 OK 12223354 12104874.0 0.069757 11414164 5.7060486544510916 14.0 0.00012265462455244203 11414150.0
SRR17380218 OK 14913690 14874174.0 0.0856518 11505850 22.645452446636703 234.0 0.002033748049905048 11505616.0
SRR17380222 OK 16927002 16905928.0 0.0980011 11692516 30.83777477344042 294.0 0.002514428887674817 11692222.0
SRR17380118 OK 40393998 40186880.0 0.234584 40148912 0.09447859599949038 100242.0 0.2496755080187478 40048670.0
Next: you can now proceed with either Taxonomy Profiling with MIMA or Function Profiling with MIMA.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.