Qc Of Rnaseq Files For Mcav Encore 2022 Tpc Project

Quality Control of RNAseq files for Montastraea cavernosa (Mcav) ENCORE 2022 TPC project

About: This post details the QC of Mdec from the ENCORE 2022 Thermal performance curve (TPC) project RNAseq files. See here for the project summary for the Mdec DNA and RNA extractions and my notebook post for the denovo transcriptome which this QC was used for.

1) Write and run script with raw data for checking quality with FastQC on Andromeda (untrimmed and unfiltered)

mkdir scripts
mkdir fastqc_results

nano /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/scripts/fastqc.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ffields@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file

module load FastQC/0.11.9-Java-11

for file in /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/raw/MCAV*
do
fastqc $file --outdir /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/fastqc_results/
done

sbatch /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/scripts/fastqc.sh

Submitted batch job 354836 on Jan 13 2025

Finished Jan 13 2025

Combined QC output into 1 file with MultiQC, a script is not needed due to fast computational time

#load module

module load MultiQC/1.9-intel-2020a-Python-3.8.2

#Combined files

multiqc /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/fastqc_results/*fastqc.zip -o /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/fastqc_results/multiqc

Copied MultiQC and FastQC report to my computer : Run this in the computer’s terminal not the server

scp -r ffields@ssh3.hac.uri.edu:/data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/fastqc_results/multiqc /Users/flo_f/"OneDrive - University of RHode Island"/Github/ENCORE_Transcriptomes/MCAV_Reference_Transcriptome/data/fastqc_results
scp -r ffields@ssh3.hac.uri.edu:/data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/fastqc_results/*.html /Users/flo_f/"OneDrive - University of RHode Island"/Github/ENCORE_Transcriptomes/MCAV_Reference_Transcriptome/data/fastqc_results

The raw sequence MultiQC Report can be found here on GitHub

Understanding a MultiQC Report and Fastp

Secondary Fastp source

These Dlab samples were pooled and had RNA concentrations of Qbit 96.40ng/ul and Nanodrop 148.20ng/ul

Sample Name	% Dups	% GC	M Seqs
DLAB_R1_001	71.1%	44%	206.8
DLAB_R2_001	66.1%	44%	206.8

Adapter content present in sequences. Adapters have not been removed yet via trimming.

GC content is good.

Sequence counts shows that their is a high number of duplication which can mean over-represented sequences (warning for overrepresented sequences). This can occur when they are highly expressed genes.

Quality scores are good

Low base N content

2) Trimming

Trimming steps below were taken then another QC report was generated to decide if other trimming decisions needed to be made.

A new folder and script was created for the trimmed data files

mkdir /data/putnamlab/flofields/ENCORE_Dlab_denovo_transcriptome/data/trimmed

nano /data/putnamlab/flofields/ENCORE_Dlab_denovo_transcriptome/scripts/trim.sh

The trimming settings in fastp

–detect_adapter_for_pe Enables adapter sequence auto-dection. This trim is a result of the presence of Adapter content in the multiQC report
–trim_poly_g Enables trimming of the polyG tails that occurs from signal degradation. This trim is a result of the sequence GC-content warning

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ffields@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/raw
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file

cd /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/raw

module load fastp/0.19.7-foss-2018b

fastp --in1 MCAV_R1_001.fastq --in2 MCAV_R2_001.fastq --detect_adapter_for_pe --trim_poly_g --out1 /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/trimmed/MCAV_001_trim_R1.fastq --out2 /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/trimmed/MCAV_001_trim_R2.fastq

sbatch /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/scripts/trim.sh

Submitted batch job 354870 on Jan 14 2025

Finished Jan 14 2025

I then downloaded the fastp.html report into the designated folder to look at the trimming information Check the quality of the trimmed files by confirming the number of files that were trimed and to look at the raw reads

scp -r ffields@ssh3.hac.uri.edu://data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/raw/fastp.html /Users/flo_f/"OneDrive - University of RHode Island"/Github/ENCORE_Transcriptomes/Mcav_Reference_Transcriptome/data/fastp_stats

This file can be found here on Github

Here are the results

General statistics

fastp version:	0.19.7 (https://github.com/OpenGene/fastp)
sequencing:	paired end (150 cycles + 150 cycles)
mean length before filtering:	150bp, 150bp
mean length after filtering:	147bp, 147bp
duplication rate:	24.067353%
Insert size peak:	197
Detected read1 adapter:	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Detected read2 adapter:	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Before filtering

total reads:	413.551540 M
total bases:	62.032731 G
Q20 bases:	59.635489 G (96.135520%)
Q30 bases:	56.269454 G (90.709296%)
GC content:	44.767978%

After filtering

total reads:	402.949174 M
total bases:	59.516609 G
Q20 bases:	57.706508 G (96.958663%)
Q30 bases:	54.597586 G (91.735041%)
GC content:	44.324584%

Filtering results

reads passed filters:	402.949174 M (97.436265%)
reads with low quality:	10.205084 M (2.467669%)
reads with too many N:	37.686000 K (0.009113%)
reads too short:	359.596000 K (0.086953%)

These results show that filtering improved quality of reads and removed about 2% of reads due to length and qaulity. Average QC30 bases improved from 90% to 91%. I will be running another round of fastqc and multiqc to see how this changed our qc results.

orginal files

#get sequence name
less  MCAV_001_trim_R1.fastq

zgrep -c "@A01587"MCAV* > seq_counts

Trimmed files
Run comman in background so you don’t have to wait to load the FastQC

the & alllows for the line of code to run i the background

“nohup” allows for the process to keep running if you close the terminal or change the current directory

nohup zgrep -c "@A01587"DLAB* > trimmed_seq_counts &

job 45213

3i) Fastqc and MultiQC on trimmed sequences

Run fastqc on trimmed data

mkdir /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/fastqc_results_trimmed

nano /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/scripts/fastqc_trimmed.sh

#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --export=NONE
#SBATCH --mem=100GB
#SBATCH --mail-type=BEGIN,END,FAIL #email you when job starts, stops and/or fails
#SBATCH --mail-user=ffields@uri.edu #your email to send notifications
#SBATCH --account=putnamlab
#SBATCH -D /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/trimmed
#SBATCH --error="script_error" #if your job fails, the error report will be put in this file
#SBATCH --output="output_script" #once your job is completed, any final job report comments will be put in this file

module load FastQC/0.11.9-Java-11

for file in /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/trimmed/MCAV*
do
fastqc $file --outdir /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/fastqc_results_trimmed/
done

sbatch /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/scripts/fastqc_trimmed.sh

Submitted Batch Job 354903 Jan 15 2025

Finished Jan 15 2025

Combined QC output into 1 file with MultiQC and copied to my destop to look at the trimming information

#load module 
module load MultiQC/1.9-intel-2020a-Python-3.8.2

multiqc /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/fastqc_results_trimmed/*fastqc.zip -o /data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/fastqc_results_trimmed/multiqc_trimmed

scp -r ffields@ssh3.hac.uri.edu://data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/fastqc_results_trimmed/multiqc_trimmed /Users/flo_f/"OneDrive - University of RHode Island"/Github/ENCORE_Transcriptomes/Mcav_Reference_Transcriptome/data/fastqc_results_trimmed
scp -r ffields@ssh3.hac.uri.edu://data/putnamlab/flofields/ENCORE_Mcav_denovo_transcriptome/data/fastqc_results_trimmed/*.html /Users/flo_f/"OneDrive - University of RHode Island"/Github/ENCORE_Transcriptomes/Mcav_Reference_Transcriptome/data/fastqc_results_trimmed

The raw sequence MultiQC report can be found here in Github

The MultiQC report results

Sample Name	%Duplication	GC content	%PF	% Dups	% GC	M Seqs
Fastp	24.067%	44.32	97.4
MCAV_001_trim_R1				71.1%	44%	201.5
MCAV_001_trim_R2				67.4%	44%	201.5

Fastp filtering: most reads filtered were due to low quality
Sequence counts shows that 28.9% of reads in R1 is unique and 32.6% in R2 is unique however dulication levels/over represented sequences are high This can occur when they are highly expressed genes. It is possible to have good libraries with small peaks at high duplication levels.
Sequence Quality is good
Per Sequence GC Content is good
Low base n content
The status check below shows the overall status for each FastQC section where green is normal, orange is slightly abnormal and red being very unsual.

Written on January 13, 2025