Functional annotation


LAYOUT: post TITLE: Workflow for Diploria labyrinthiformis functional annotation CATEGORY: [ de novo transcriptome, RNA seq ] TAG: [ Diploria labyrinthiformis, Bioinfamatics, de novo transcriptome ] —

Designing a workflow for a functional annotation for Diploria labyrinthiformis de novo transcriptome.

About: This post documents the pipline used to identify orthologous genes across species for a comparative analyses and the functional annotation of the created de novo transcriptome for Diploria labyrinthiformis using samples obtained in August 2022 as part of the ENCORE project in the TPC-PI experiment.

Use Uniprot and Pfam databases after TransDecoder to convert the nucleotides into amino acid sequences then ensure the gene IDs are species specfic for easy identification when clustering orthologs from 3 species.

Step 1. Download UniProt and Pfam Databases

mkdir /scratch/workspace/ffields_uri_edu-transcriptomes/dlab/data/databases/uniprot
cd /databases/uniprot

# Download Swiss-Prot (curated subset)
wget -O /scratch3/workspace/ffields_uri_edu-transcriptomes/dlab/data/databases/uniprot/uniprot_sprot.fasta.gz \
	https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

  gunzip /scratch3/workspace/ffields_uri_edu-transcriptomes/dlab/data/databases/uniprot/uniprot_sprot.fasta.gz

  srun --nodes=1 --ntasks=4 --mem=2G --time=00:30:00 --pty bash
  module load uri/main BLAST+/2.15.0-gompi-2023a

  makeblastdb -in uniprot_sprot.fasta -dbtype prot

  cd ..
  mkdir pfam
  cd pfam/

 wget -O /scratch3/workspace/ffields_uri_edu-transcriptomes/dlab/data/databases/pfam/Pfam-A.hmm.gz \
  http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz

  # Unzip and index
  gunzip /scratch3/workspace/ffields_uri_edu-transcriptomes/dlab/data/databases/pfam/Pfam-A.hmm.gz
  hmmpress /scratch3/workspace/ffields_uri_edu-transcriptomes/dlab/data/databases/pfam/Pfam-A.hmm

Step 2: Obtain ORFs

Identifies all open reading frames i.e ORFs (coding regions) that meet a minimum length threshold (default is 100 amino acids) and translates them into peptide sequences

Dlab

module load uri/main TransDecoder/5.5.0-GCC-11.3.0

#file in dir /scratch/workspace/ffields_uri_edu-transcriptomes/dlab/data/cd-hit
# -redundancy-minimized set includes 3586 / 5000 = 71.72%
TransDecoder.LongOrfs -t dlab_cdhit.fasta

Mdec

#file in dir /scratch/workspace/ffields_uri_edu-transcriptomes/mdec/data/cd-hit
# -redundancy-minimized set includes 4054 / 5000 = 81.08%
TransDecoder.LongOrfs -t mdec_cdhit.fasta

Mcav

#file in dir /scratch/workspace/ffields_uri_edu-transcriptomes/mcav/data/cd-hit
TransDecoder.LongOrfs -t mcav_cdhit.fasta

Note: after this step a meaasge appears:

Use file: /scratch3/workspace/ffields_uri_edu-transcriptomes/mdec/data/cd-hit/mdec_cdhit.fasta.transdecoder_dir/longest_orfs.pep  for Pfam and/or BlastP searches to enable homology-based coding region identification.

    Then, run TransDecoder.Predict for your final coding region predictions.

Step 3: Run BLASTP and Pfam Searches (via SBATCH script)

Dlab

nano /scratch/workspace/ffields_uri_edu-transcriptomes/dlab/scripts/Blastp_Pfam_search

#!/bin/bash
#SBATCH --job-name=blastp_dlab
#SBATCH --nodes=1 --cpus-per-task=15
#SBATCH --mem=200G  # Requested Memory
#SBATCH --time=24:00:00
#SBATCH -o slurm-dlab.out  # %j = job ID
#SBATCH -e slurm-dlab.err  # %j = job ID
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=ffields@uri.edu
#SBATCH -D /scratch/workspace/ffields_uri_edu-transcriptomes/dlab/data/searches
#SBATCH --constraint=avx512

module load uri/main BLAST+/2.15.0-gompi-2023a
module load uri/main HMMER/3.4-gompi-2023a

echo "Blastp search" $(date)
blastp -query /scratch/workspace/ffields_uri_edu-transcriptomes/dlab/data/cd-hit/dlab_cdhit.fasta.transdecoder_dir/longest_orfs.pep \
	-db /scratch/workspace/ffields_uri_edu-transcriptomes/dlab/data/databases/uniprot/uniprot_sprot.fasta \
	-evalue 1e-5 \
	-max_target_seqs 5 \
	-outfmt 6 \
	-num_threads 8 > blastp.outfmt6

echo "pfam search" $(date)

hmmscan --cpu 8 \
	--domtblout pfam.domtblout \
	/scratch/workspace/ffields_uri_edu-transcriptomes/dlab/data/databases/pfam/Pfam-A.hmm \
	/scratch/workspace/ffields_uri_edu-transcriptomes/dlab/data/cd-hit/dlab_cdhit.fasta.transdecoder_dir/longest_orfs.pep

echo "Done. All outputs saved." $(date)

sbatch /scratch/workspace/ffields_uri_edu-transcriptomes/dlab/scripts/Blastp_Pfam_search Submitted batch job 44042346

Mdec

nano /scratch3/workspace/ffields_uri_edu-transcriptomes/mdec/scripts/Blastp_Pfam_search

#!/bin/bash
#SBATCH --job-name=blastp_mdec
#SBATCH --nodes=1 --cpus-per-task=15
#SBATCH --mem=200G  # Requested Memory
#SBATCH --time=24:00:00
#SBATCH -o slurm-mdec.out  # %j = job ID
#SBATCH -e slurm-mdec.err  # %j = job ID
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=ffields@uri.edu
#SBATCH -D /scratch3/workspace/ffields_uri_edu-transcriptomes/mdec/data/searches
#SBATCH --constraint=avx512

module load uri/main BLAST+/2.15.0-gompi-2023a
module load uri/main HMMER/3.4-gompi-2023a

echo "Blastp search" $(date)
blastp -query /scratch3/workspace/ffields_uri_edu-transcriptomes/mdec/data/cd-hit/mdec_cdhit.fasta.transdecoder_dir/longest_orfs.pep \
	-db /scratch3/workspace/ffields_uri_edu-transcriptomes/dlab/data/databases/uniprot/uniprot_sprot.fasta \
	-evalue 1e-5 \
	-max_target_seqs 5 \
	-outfmt 6 \
	-num_threads 8 > blastp.outfmt6

echo "pfam search" $(date)

hmmscan --cpu 8 \
	--domtblout pfam.domtblout \
	/scratch3/workspace/ffields_uri_edu-transcriptomes/dlab/data/databases/pfam/Pfam-A.hmm \
	/scratch3/workspace/ffields_uri_edu-transcriptomes/mdec/data/cd-hit/mdec_cdhit.fasta.transdecoder_dir/longest_orfs.pep

echo "Done. All outputs saved." $(date)
sbatch  /scratch3/workspace/ffields_uri_edu-transcriptomes/mdec/scripts/Blastp_Pfam_search
Submitted batch job 44041381

Mcav

nano /scratch3/workspace/ffields_uri_edu-transcriptomes/mcav/scripts/Blastp_Pfam_search

#!/bin/bash
#SBATCH --job-name=blastp_mcav
#SBATCH --nodes=1 --cpus-per-task=15
#SBATCH --mem=200G  # Requested Memory
#SBATCH --time=24:00:00
#SBATCH -o slurm-mcav.out  # %j = job ID
#SBATCH -e slurm-mcav.err  # %j = job ID
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=ffields@uri.edu
#SBATCH -D /scratch3/workspace/ffields_uri_edu-transcriptomes/mcav/data/searches
#SBATCH --constraint=avx512

module load uri/main BLAST+/2.15.0-gompi-2023a
module load uri/main HMMER/3.4-gompi-2023a

echo "Blastp search" $(date)
blastp -query /scratch3/workspace/ffields_uri_edu-transcriptomes/mcav/data/cd-hit/mcav_cdhit.fasta.transdecoder_dir/longest_orfs.pep \
-db /scratch3/workspace/ffields_uri_edu-transcriptomes/dlab/data/databases/uniprot/uniprot_sprot.fasta \
-evalue 1e-5 \
-max_target_seqs 5 \
-outfmt 6 \
-num_threads 8 > blastp.outfmt6

echo "pfam search" $(date)

hmmscan --cpu 8 \
--domtblout pfam.domtblout \
/scratch3/workspace/ffields_uri_edu-transcriptomes/dlab/data/databases/pfam/Pfam-A.hmm \
/scratch3/workspace/ffields_uri_edu-transcriptomes/mcav/data/cd-hit/mcav_cdhit.fasta.transdecoder_dir/longest_orfs.pep

echo "Done. All outputs saved." $(date)
sbatch  /scratch3/workspace/ffields_uri_edu-transcriptomes/mcav/scripts/Blastp_Pfam_search
Submitted batch job 44043202

Step 4: Renaming the gene ID headers for identification purposes

sed 's/^>/\>dlab/' dlab_cdhit.fasta > dlab_cdhit_m.fasta
sed 's/^>/\>mcav/' mcav_cdhit.fasta > mcav_cdhit_m.fasta
sed 's/^>/\>mdec/' mdec_cdhit.fasta > mdec_cdhit_m.fasta

Step 5: Run TransDecoder with Homology Support

It takes the ORFs identified from TransDecoder.LongOrfs and predicts which ones are coding sequences using the hits from blastp and pfam.

Dlab

srun --nodes=1 --ntasks=4 --mem=5G --time=00:30:00 --pty bash
module load uri/main TransDecoder/5.5.0-GCC-11.3.0

TransDecoder.Predict -t dlab_cdhit_m.fasta \
  --retain_blastp_hits /scratch/workspace/ffields_uri_edu-transcriptomes/dlab/data/searches/blastp.outfmt6 \
  --retain_pfam_hits /scratch/workspace/ffields_uri_edu-transcriptomes/dlab/data/searches/pfam.domtblout

Mdec

TransDecoder.Predict -t mdec_cdhit.fasta \
  --retain_blastp_hits /scratch/workspace/ffields_uri_edu-transcriptomes/mdec/data/searches/blastp.outfmt6 \
  --retain_pfam_hits /scratch/workspace/ffields_uri_edu-transcriptomes/mdec/data/searches/pfam.domtblout

#### Mcav

TransDecoder.Predict -t mcav_cdhit.fasta \
  --retain_blastp_hits /scratch/workspace/ffields_uri_edu-transcriptomes/mcav/data/searches/blastp.outfmt6 \
  --retain_pfam_hits /scratch/workspace/ffields_uri_edu-transcriptomes/mcav/data/searches/pfam.domtblout

### Step 5: Renaming the gene ID headers for identification purposes

sed 's/^>/dlab|/' dlab_cdhit.fasta.transdecoder.pep > dlab.faa
sed 's/^>/\>dlab_/' dlab_cdhit.fasta.transdecoder.pep > dlab.faa
sed 's/^>/\>mcav_/' mcav_cdhit.fasta.transdecoder.pep > mcav.faa
sed 's/^>/\>mdec_/' mdec_cdhit.fasta.transdecoder.pep > mdec.faa

TransDecoder.Predict -t mcav_cdhit.fasta sed ‘s/^>/>mcav_/’ longest_orfs.pep > mcav.faa

sed ‘s/^>/>mdec_/’ species1.Trinity.fasta > mdec.Trinity.ID.fasta sed ‘s/^>/>species1_/’ species1.faa > species1.ID.faa

TransDecoder.LongOrfs -t species1_transcripts.fasta TransDecoder.Predict -t dlab_cdhit.fasta

mv Trinity.fasta.transdecoder.pep species1.faa

#!/usr/bin/env bash #SBATCH –export=NONE #SBATCH –ntasks=1 –cpus-per-task=20 #split one task over multiple CPU #SBATCH –mem=500GB #SBATCH -p cpu-preempt #SBATCH -t 48:00:00 #SBATCH –mail-type=END,FAIL,TIME_LIMIT_80 #email you when job stops and/or fails or is nearing its time limit #SBATCH –error=scripts/outs_errs/”%x_error.%j” #if your job fails, the error report will be put in this file #SBATCH –output=scripts/outs_errs/”%x_output.%j” #once your job is completed, any final job report comments will be put in this file #SBATCH -D #load conda and activate conda environment module load conda/latest conda activate /work/pi_hputnam_uri_edu/conda/envs/env-broccoli

#load additional programs needed to run broccoli module load uri/main module load diamond/2.1.10 module load all/FastTree/2.1.11-GCCcore-12.3.0

mkdir -p output/broccoli

cd output/broccoli python ~/broccoli.py -dir ../../cnidarian_protein_fastas/ -ext ‘.faa’ -path_fasttree FastTree -threads 8

Functional anotation

#!/bin/bash #SBATCH –job-name=dlab-fa #SBATCH –nodes=1 –cpus-per-task=15 #SBATCH –mem=500G # Requested Memory #SBATCH -t 48:00:00 #SBATCH -o slurm-dlab-fa.out # %j = job ID #SBATCH -e slurm-dlab-fa.err # %j = job ID #SBATCH –mail-type=END,FAIL #email you when job starts, stops and/or fails #SBATCH –mail-user=federica.scucchia@uri.edu #your email to send notifications #SBATCH -D /scratch/workspace/ffields_uri_edu-transcriptomes/dlab/data/functional_annotation

load modules needed

#module load diamond/2.1.10 module load blast-plus/2.14.1 #module load uri/main all/InterProScan/5.60-92.0-foss-2021b

diamond blastp
-d uniprot_sprot.dmnd
-q /scratch/workspace/ffields_uri_edu-transcriptomes/dlab/data/cd-hit/dlab.faa
-o
-f 6 -b 4 –more-sensitive -e 0.00001 -k 1 -e 1e-5 –threads 8

Run InterProScan

#interproscan.sh -f TSV -i

note:look for packages that remove ribosomal rnas from transcriptomes

/scratch3/workspace/ffields_uri_edu-transcriptomes

Written on August 14, 2025