Need to know

preparation for data-processing tutorials

Project working directory

After downloading the tutorial data, we assume that the mima_tutorial is the working directory located in your home directory (specified by the tilde, ~). Hence, we will try to always make sure we are in the right directory first before executing a command, for example, run the following commands:

$ cd ~/mima_tutorial
$ tree .
  • the starting directory structure for mima_tutorial should look something like:
mima_tutorial
├── ftp_download_files.sh
├── manifest.csv
├── pbs_header_func.cfg
├── pbs_header_qc.cfg
├── pbs_header_taxa.cfg
├── raw_data/
    ├── SRR17380115_1.fastq.gz
    ├── SRR17380115_2.fastq.gz
    ├── ...
...

From here on, ~/mima_tutorial will refer to the project directory as depicted above. Replace this path if you saved the tutorial data in another location.

Containers and binding paths

When deploying images, make sure you check if you need to bind any paths.

PBS configuration files

The three modules (QC, Taxonomy profiling and Function profiling) in the data-processing pipeline require access to a job queue and instructions about the resources required for the job. For example, the number of CPUs, the RAM size, the time required for execution etc. These parameters are defined in PBS configuration text files.

Three such configuration files are in provided after you have downloaded the tutorial data. There are 3 configuration files, one for each module as they require different PBS settings indicated by lines starting with the #PBS tags.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/bin/bash
#PBS -N mima-qc
#PBS -l ncpus=8
#PBS -l walltime=2:00:00
#PBS -l mem=64GB
#PBS -j oe

set -x

IMAGE_DIR=~/mima-pipeline
export APPTAINER_BIND="</path/to/source1>:</path/to/destination1>,</path/to/source2>:</path/to/destination2>"
PBS settings
Description
#PBS -Nname of the job
#PBS -l ncpusnumber of CPUs required
#PBS -l walltimehow long the job will take, here it’s 2 hours. Note check the log files whether your jobs have completed correctly or failed due to not enough time
#PBS -l mem=64GBhow much RAM the job needs, here it’s 64GB
#PBS -l -j oestandard output logs and error logs are concatenated into one file

Use absolute paths

When running the pipeline it is best to use full paths when specifying the locations of input files, output files and reference data to avoid any ambiguity.

Absolute/Full paths

always start with the root directory, indicated by the forward slash (/) on Unix based systems.

  • e.g., below changes directory (cd) to a folder named scripts that is located in the user jsmith’s home directory. Provided this folder exists, then this command will work from anywhere on the filesystem."
[~/metadata] $ cd /home/jsmith/scripts

Relative paths

are relative to the current working directory

  • Now imagine the following file system structure in the user john_doe’s home directory
  • The asterisks marks his current location, which is inside the /home/user/john_doe/studyAB/metadata sub-folder
/home/user/john_doe
├── apps
├── reference_data
├── studyABC
│   ├── metadata **
│   ├── raw_reads
│   └── analysis
├── study_123
├── scripts
└── templates
  • In this example we are currently in the metadata directory, and change directory to a folder named data that is located in the parent directory (..)
  • This command only works provided there is a data folder in the parent directory above metadata
  • According to the filesystem above, the parent directory of metadata is studyABC and there is no data subfolder in this directory, so this command will fail with an error message
[/home/user/john_doe/studyABC/metadata] $ cd ../data
-bash: cd: ../data: No such file or directory

Now that you have installed the data and know the basics, you can begin data processing with quality control.



Last modified 25.09.2024