Qc Of Rnaseq Files For Mdec Encore 2022 Tpc Project
Quality Control of RNAseq files for Madracis decactis (Mdec) ENCORE 2022 TPC project
About: This post details the QC of Mdec from the ENCORE 2022 Thermal performance curve (TPC) project RNAseq files. See here for the project summary for the Mdec DNA and RNA extractions and my notebook post for the denovo transcriptome which this QC was used for.
1) Write and run script with raw data for checking quality with FastQC on Andromeda (untrimmed and unfiltered)
nano /data/putnamlab/flofields/denovo_transcriptome/scripts/fastqc.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ffields@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/flofields/denovo_transcriptome/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file
module load FastQC/0.11.9-Java-11
for file in /data/putnamlab/flofields/denovo_transcriptome/data/raw/MDEC*
do
fastqc $file --outdir /data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results/
done
Run the script.
sbatch /data/putnamlab/flofields/denovo_transcriptome/scripts/fastqc.sh
Submitted batch job 289531 on Nov 28 2023
Finished Nov 28 2023
Combined QC output into 1 file with MultiQC, a script is not needed due to fast computational time
#load module
module load MultiQC/1.9-intel-2020a-Python-3.8.2
#Combined files
multiqc /data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results/*fastqc.zip -o /data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results/multiqc/
Copied MultiQC and FastQC report to my computer : Run this in the computer’s terminal not the server
scp -r ffields@ssh3.hac.uri.edu:/data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results/multiqc/multiqc_report.html /Users/flo_f/Putnam-lab/bioinformatics/MDEC_transcriptome/original_fastqc
scp -r ffields@ssh3.hac.uri.edu:/data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results/*.html /Users/flo_f/Putnam-lab/bioinformatics/MDEC_transcriptome/original_fastqc
The raw sequence MultiQC Report can be found here on GitHub
Understanding a MultiQC Report and Fastp
Secondary Fastp source
These Mdec samples were pooled and had RNA concentrations of Qbit 67.20ng/ul and Nanodrop 95.60ng/ul
| Sample Name | % Dups | % GC | M Seqs |
|---|---|---|---|
| MDEC_R1_001 | 69.6% | 43% | 223.6 |
| MDEC_R2_001 | 65.1% | 43% | 223.6 |
- Adapter content present in sequences. Adapters have not been removed yet via trimming.

- Warnings were attached to the GC content. This could be a result of poly-G tails from Illumina NextSeq.

- Sequence counts shows that their is a high number of over-represented sequences. This can occur when they are highly expressed genes.

- Quality scores are good

- Low base N content

2) Trimming
Trimming steps below were taken then another QC report was generated to decide if other trimming decisions needed to be made.
A new folder and script was created for the trimmed data files
mkdir /data/putnamlab/flofields/denovo_transcriptome/data/trimmed
nano /data/putnamlab/flofields/denovo_transcriptome/scripts/trim.sh
The trimming settings in fastp
-
–detect_adapter_for_pe Enables adapter sequence auto-dection. This trim is a result of the presence of Adapter content in the multiQC report
-
–trim_poly_g Enables trimming of the polyG tails that occurs from signal degradation. This trim is a result of the sequence GC-content warning
-
–trim_tail1 15 Trimmed 15 base pairs from the end of the foward sequence 3’-5’.
-
–trim_tail2 15 Trimmed 15 base pairs from the end of the reverse sequence 5’-3’. Trims 15bp from the 3’R1 and 5’R2 end of reads
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ffields@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/flofields/denovo_transcriptome/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file
cd /data/putnamlab/flofields/denovo_transcriptome/data/raw
module load fastp/0.19.7-foss-2018b
fastp --in1 MDEC_R1_001.fastq --int2 MDEC_R2_001.fastq --detect_adapter_for_pe --trim_poly_g --trim_tail1 15 --trim_tail2 15 --out1 /data/putnamlab/flofields/denovo_transcriptome/data/trimmed/MDEC_001_trim_R1.fastq --out2 /data/putnamlab/flofields/denovo_transcriptome/data/trimmed/MDEC_001_trim_R2.fastq
sbatch /data/putnamlab/flofields/denovo_transcriptome/scripts/trim.sh
Submitted batch job 290641 on Dec 11 2023
Finished Dec 11 2023
Check the quality of the trimmed files by confirming the number of files that were trimed and to look at the raw reads
- orginal files
#get sequence name less MDEC_001_trim2_R2.fastqqzgrep -c "@A01587"MDEC* > seq_counts - Trimmed files
zgrep -c "@A01587"MDEC* > trimmed_seq_counts3i) Fastqc and MultiQC on trim2 sequences
Run fastqc on trimmed data
mkdir fastqc_results_trimmednano /data/putnamlab/flofields/denovo_transcriptome/scripts/fastqc_trimmed.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ffields@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/flofields/denovo_transcriptome/data/trimmed
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file
module load FastQC/0.11.9-Java-11
for file in /data/putnamlab/flofields/denovo_transcriptome/data/trimmed/MDEC*
do
fastqc $file --outdir /data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results_trimmed/
done
sbatch /data/putnamlab/flofields/denovo_transcriptome/scripts/fastqc_trimmed.sh
Submitted Batch Job 293773 Jan 29 2024
Finished Jan 29 2024
Combined QC output into 1 file with MultiQC and copied to my destop to look at the trimming information
#load module
module load MultiQC/1.9-intel-2020a-Python-3.8.2
multiqc /data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results_trimmed/*fastqc.zip -o /data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results_trimmed/trimmed_multiqc
scp -r ffields@ssh3.hac.uri.edu://data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results_trimmed/trimmed_multiqc/multiqc_report.html /Users/flo_f/OneDrive/Desktop/Putnam-lab/bioinformatics/MDEC_transcriptome/trimmed_fastqc
scp -r ffields@ssh3.hac.uri.edu://data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results_trimmed/*.html /Users/flo_f/OneDrive/Desktop/Putnam-lab/bioinformatics/MDEC_transcriptome/trimmed_fastqc
The raw sequence MultiQC report can be found here in Github
The MultiQC report showed that the number of unque reads had been trimmed from 68,038,679 to 66,625,214, sequence quality, per sequence quality scores, overrepresented sequences, per base n content was still good and the adapter content was now good however sequence length distribution changed from normal to slightly abnormal as well as the foward stand’s per tile sequence quality.
See the status check heat map for the general overview of the multiQC report of the trim 1
| Sample Name | % Dups | % GC | M Seqs |
|---|---|---|---|
| MDEC_R1_001 | 69.6% | 43% | 219.0 |
| MDEC_R2_001 | 65.1% | 43% | 219.0 |
The adapter was removed which meant it was not necessary to remove base pairs from the end of the foward and reverse stands so I trimmed the raw data again removing only the adapter and poly g tail
A new folder and script was created for the trim 2 data files
mkdir /data/putnamlab/flofields/denovo_transcriptome/data/trim2
nano /data/putnamlab/flofields/denovo_transcriptome/scripts/trim2.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ffields@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/flofields/denovo_transcriptome/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file
cd /data/putnamlab/flofields/denovo_transcriptome/data/raw
module load fastp/0.19.7-foss-2018b
fastp --in1 MDEC_R1_001.fastq --int2 MDEC_R2_001.fastq --detect_adapter_for_pe -D --trim_poly_g --out1 /data/putnamlab/flofields/denovo_transcriptome/data/trim2/MDEC_001_trim2_R1.fastq --out2 /data/putnamlab/flofields/denovo_transcriptome/data/trim2/MDEC_001_trim2_R2.fastq
sbatch /data/putnamlab/flofields/denovo_transcriptome/scripts/trim2.sh
Submitted batch job 293922 on Jan 31 2024
Finished on Jan 31 2024
I then downloaded the fastp.html report to look at the trimmin information
scp -r ffields@ssh3.hac.uri.edu://data/putnamlab/flofields/denovo_transcriptome/data/raw/fastp.html /Users/flo_f/OneDrive/Desktop/Putnam-Lab/mdec-rnaseq
This file can be found here on Github
Here are the results
General statistics
| fastp version: | 0.19.7 (https://github.com/OpenGene/fastp) |
|---|---|
| sequencing: | paired end (150 cycles + 150 cycles) |
| mean length before filtering: | 150bp, 150bp |
| mean length after filtering: | 147bp, 147bp |
| duplication rate: | 23.583228% |
| Insert size peak: | 176 |
| Detected read1 adapter: | AGATCGGAAGAGCACACGTCTGAACTCCAGTCA |
| Detected read2 adapter: | AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT |
Before filtering
| total reads: | 447.181454 M |
|---|---|
| total bases: | 67.077218 G |
| Q20 bases: | 64.582642 G (96.281038%) |
| Q30 bases: | 60.960502 G (90.881082%) |
| GC content: | 43.819885% |
After filtering
| total reads: | 437.584742 M |
|---|---|
| total bases: | 64.576101 G |
| Q20 bases: | 62.624206 G (96.977374%) |
| Q30 bases: | 59.253674 G (91.757900%) |
| GC content: | 43.430054% |
Filtering results
| reads passed filters: | 437.584742 M (97.853956%) |
|---|---|
| reads with low quality: | 9.263188 M (2.071461%) |
| reads with too many N: | 39.784000 K (0.008897%) |
| reads too short: | 293.740000 K (0.065687%) |
These results show that filtering improved quality of reads and removed about 2% of reads due to length and qaulity. Average QC30 bases improved from 90% to 91%. I will be running another round of fastqc and multiqc to see how this changed our qc results.
3ii) Fastqc and MultiQC on trim2 sequences
Run fastqc on trim 2 data
mkdir fastqc_results_trim2
nano /data/putnamlab/flofields/denovo_transcriptome/scripts/fastqc_trim2.sh
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ffields@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/flofields/denovo_transcriptome/data/trimmed
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file
module load FastQC/0.11.9-Java-11
for file in /data/putnamlab/flofields/denovo_transcriptome/data/trimmed/MDEC*
do
fastqc $file --outdir /data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results_trim2/
done
sbatch /data/putnamlab/flofields/denovo_transcriptome/scripts/fastqc_trim2.sh
Submitted batch job 294032 on Jan 31 2024
Finished on Jan 31 2024
Combined QC output into 1 file with MultiQC and fastp and copied to my destop to look at the trim 2 information
#load module
module load MultiQC/1.9-intel-2020a-Python-3.8.2
#Copy the multiqc html file to my computer
multiqc /data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results_trim2/*fastqc.zip -o /data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results_trim2/trim2_multiqc
scp -r ffields@ssh3.hac.uri.edu://data/putnamlab/flofields/denovo_transcriptome/data/fastqc_results_trimmed/trimmed_multiqc/multiqc_report.html /Users/flo_f/OneDrive/Desktop/Putnam-Lab/bioinformatics/MDEC_transcriptome/trimmed_fastqc
The raw sequence MultiQC report can be found here in Github
The MultiQC report results
| Sample Name | %Duplication | GC content | %PF | %Adapter | % Dups | % GC | M Seqs |
|---|---|---|---|---|---|---|---|
| Fastp | 23.58 | 43.4 | 97.9 | ||||
| MDEC_R1_001 | 69.6% | 43% | 218.0 | ||||
| MDEC_R2_001 | 66.1% | 43% | 218.0 |
-
Fastp filtering: most reads filtered were due to low quality
-
Sequence counts shows that 30.4% of reads in R1 is unique and 33.9% in R2 is unique however dulication levels/over represented sequences are high This can occur when they are highly expressed genes. It is possible to have good libraries with small peaks at high duplication levels.

-
Sequence Quality is good

-
Per Sequence GC Content came with warmings this could mean that tey are alot of PCR duplicates

-
Low base n content

-
The status check below shows the overall status for each FastQC section where gree is normal, orange is slightly abnormal and red being very unsual.

Next I will be using the trim2 data to run in trinity. This QC was for the purpose of assembling a denov transciptome. The entire process can be found on Github Here.