Alignment With Kallisto

Overview

Questions

  • How do I perform pseudo-alignment to map the transcriptome of my sample?

Objectives

  • Understand the difference between pseudo-alignment and alignment

  • Select the correct parameters for kallisto for your sample

  • Submit your job to the cluster

Introduction

Despite challenges of RNA-seq, it is a very useful experiment and researchers have worked through many of the challenges above to develop software that can help us infer what is happening in the transcriptome.

Kallisto is a quick, highly-efficient software for quantifying transcript abundances in an RNA-Seq experiment. Even on a typical laptop, Kallisto can quantify 30 million reads in less than 3 minutes.

In order to analyze data with Kallisto we need several inputs:

  1. Trimmed and filtered FASTQ files
  2. A Reference transcriptome This is a file that has the sequences for all the known expressed genes. Reference transcriptomes are usually available from repositories like Ensembl and NCBI. We will be using the human reference transcriptome. Unlike a genome, the transcriptome is only coding genes.

Step-by-Step Kallisto

Analysis with Kallisto has two main steps:

Step Description Command Input Output
1 Genome indexing kallisto index <–index=> Reference transcriptome Kallisto index (no file extension)
2 Pseudoalignment kallisto quant sample.fa.gz FASTQ (one per run of kallisto), kallisto index (from indexing step) abundances.h5, abundances.tsv and run_info.json

Pseudoalignment and Genomics Word Search Explained

Alignment of reads is an expansive topic. Several reviews cover some of the important topics including Stark et. al. 2019. This blog post and the kallisto paper are further readings to get a deep understanding of the subject.

To try and reduce this problem to its most basic, the Leo Pachter Lab used an analogy.

In the traditional case, when software does alignment, it tries to match a read to the genome.

  Genome: ACTACGTAGCCGTCAAATATCCCGGGTATCGTACGATCGACGT

  Read:   AAATTTCCCGGG

If we move things around we can find the match:

  Genome: ACTACGTAGCCGTCAAATATCCCGGGTATCGTACGATCGACGT

                  |||| |||||||

  Read:           AAATTTCCCGGG

In this example, we have one mismatch. Although finding this match was simple, the genome is far more complex.

In the word search below is the word “DNA”. Can you find it? It may take you a while. Searching for words is a process similar to taking sequencing reads and trying to match them to the genome. Computers are fast, but just as matching a small word (3 letters) in a large (1639 letter) word puzzle is time-intensive, it take a long time to match millions of short reads against genomes of billions of nucleotides.

CUSVFVMAASJFHUTMNCCQMBVXOLBEETYHSRBWOSEY MOBJEYXAZMPMFENZHQKMHHSCZUXUQYEBQONJVYWH LCMIFVSPNMAGIJAOOFCWNYYDETTLMGCDOBSLOPXO ZAUSKOGLCYOIKIXZSHOXHLYGZJLRLZMRGHRFRJWN HBEOJQLIXUYAIAWMJASRBOVSBHMAHJFPUOXTQIYZ LSSCVOGPJCIIMUILZCFKLASLFLQFIZVSYXWJYNMJ QPWLYGLRTSTQHYUVFVGGPDMICLGDUMZOVPHFPLOD YFZAYPVONRRCPHJPNJVROZEZPBLZNYBQHOVUPOGJ SAVCNZWIYQKHJOLKQBHZYGJZKUEHKJLCLXOLMUJF YJAGKRFQOQRNAXXXZAZUMMWMLKWNREHDFOLVVYWK ATOXWLBGQECNAMWJIMONGSAVKGSHUVUOROVJRJGT AJLVEYWLSFPUIYABELQZLYLUGZJGPGWEVPXYCHGO IPXOCIPREEDCYTFZHBDTTOIJRWLEYEMMTRIDSMGJ BNEZKUNVOXZFIXAAGGYUWAYZSNVUHXJZMMPXBPTE QMVXMUJAFUGBBLAVVCGVBDIGDBIDBSHTEVPCJTUB ELJIPXDVTMPNCROAPIHZMROJXZFCDHRYWIEUSZST ERRCPCLAPIJROAAUKZNDYYKNWKNUNFDKSTLYTXKA UAMNKGZVKWMOYGLLWUJOECUNBRUGRPWWUNNVEXYF MQYPHTEPNKPLWECXVORINUEHHVLVBNFUJJXNEZYT MJDQTZMIRJXMLYUCXYCVFQHHQBERSMGQLRWNYAJP IELYQQDAIXJDTUONASHBTLCZISHQWCMJZFAFJOXE UGZWMJGDLLRKIZVVLPHYBCULVCAIRMMEQPRRHFHQ FFCWSFSQZJTBCVFHUAEECJPXKHMKRKJFBZVQNOMZ ZZZEMQQNVWUXDOYBDHSDFLHMAQEWFKWPMIGJVZUT ZNEQDNAHXMPBLRVDEBJRJFTDOIWUPFLSIYOOHNQH TTMIXMVUMETGMJWFEWCRITKZWZGVMCNQRAPDCJZD AXPTCGYNKVFDPALHRQNPKQUBSEGOOQCVUDDHTIWR GEMJXKJGSSRAROGDWETVVTZMWUSCYXIKTUFLUIXO BDZFSFRPFXHALCKUGTOHGMEHYPRPTFXUSWSKDBWC UGGJXCDKYYZZFLDAWLDYJLCRSTDSMLOGPCVAUDZZ AHGKHDAGOHYQDKGWDTOGFVOSRKJCPRRENRQRBBPQ ABHWPOFZYTGHXPREABCTWABQLZLYBOQVHOMUHGRS FPNWCEQGTYJEDRWAVDYKUAZMLVVNTYNRZHPFHYGR UAJNXZWZSBLEVYIGLRPFNZLOGRULAHIAHDLXTJXF DQEKRQNVJACCBVGABSKTWRPYEFXNEHSICVLHWOIV GVOGZWUSLGRGHMVENEYNSBCZRLOZYCMXCYGWUMSG LFXDWDWHJYXAXDVWVMTPKQVTAYHNCADYJCNKNDZM ASJBGHYSPRUIIVQDAVUQBEEDNQKBKQPMKYQKRIKV MWFHPISMEIUIVZVBEUKTFOUADMUDXAJSGYHLXUSP OSIVVLZMDTLDMCLGZLEGWLPVPWOLNERAINTAESFR

Pseudoalignment is one approach to this computational “word search”. It takes advantage of a trick to speed up performance without loosing accuracy. Take the second line in our “DNA” word search puzzle:

MOBJEYXAZMPMFENZHQKMHHSCZUXUQYEBQONJVYWH

There is no “D” in this line. True, We don’t know that until we read the entire line, but once we realize this line can’t possible be a match without a “D” We can ignore this line. Word search puzzles don’t have to be read in just one direction (words might be on diagonals, backwards, etc.), but now consider what happens in the pseudoalignment “word search”. In this case we are searching not the entire genome, but linear transcripts:

Transcript 1: CUSVFVMAASJFHUTMNCCQMBVXOLBEETYHSRBWOSEY

Transcript 2: MOBJEYXAZMPMFENZHQKMHHSCZUXUQYEBQONJVYWH

Transcript 3: LCMIFVSPNMAGIJAOOFCWNYYDETTLMGCDOBSLOPXO

Transcript 4: ZAUSKOGLCYOIKIXZSHOXHLYGZJLRLZMRGHRFRJWN

Transcript 5: ZNEQDNAHXMPBLRVDEBJRJFTDOIWUPFLSIYOOHNQH

We can immediately eliminate transcripts that don’t contain the letter D:

Transcript 3: LCMIFVSPNMAGIJAOOFCWNYYDETTLMGCDOBSLOPXO

Transcript 5: ZNEQDNAHXMPBLRVDEBJRJFTDOIWUPFLSIYOOHNQH

Immediately, the problem is made easier by throwing away transcripts that could not contain the answer. This is not an exact analogy but basically, rather than trying to match every read to every position in the genome, Kallisto is faster because 1) we are only matching to the transcriptome (a subset of the genome) and 2) we focus only on transcripts that could have generated a particular read.

Note Pseudoalignment is just one approach to aligning RNA-Seq reads. Other software will do full alignments of the read to a transcriptome or genome. These methods have different advantages and requirements.

Step 1 Genome indexing for Kallisto

There are a few files we need to perform the first step of Kallisto

  • Reference transcriptome: A file of all the known trasncripts of the human genome
  • Reference annotations: A file with information on the location and structure of the genes in the human genome and a file with chromosome details.

We will now use Kallisto’s indexing function to prepare the transcriptome for analysis. The “Index” is a lookup table for the transcriptome that allows it to be more easily searched by Kallisto. First let’s organize our files by creating a new directory to hold our kallisto work.

$ mkdir -p /srv/scratch/[your_zID]/kallisto_human_ref/

First, we must download the reference files from (https://asia.ensembl.org/info/data/ftp/index.html) using wget

  1. We will need the FASTA file of the cDNA sequence (https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz)

    $ wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

OR, if the expected download time is ages please copy from the communal folder.

  $ scp /srv/scratch/babs3291/references/Homo_sapiens.GRCh38.cdna.all.fa.gz /srv/scratch/[your_zID]/kallisto_human_ref/
  1. We also will need the human GTF file, a file containing coordinates and descriptions for all gene names and locations - we will also download this from Ensembl. (https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz) not needed for index command

     $ wget https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz 
    

OR, if the expected download time is large, please copy from the communal folder.

    $ scp  /srv/scratch/babs3291/references/Homo_sapiens.GRCh38.109.gtf.gz /srv/scratch/[your_zID]/kallisto_human_ref/

Also, must unzip the gtf above. This will take the gtf from being compressed to human readable.

    $ gunzip Homo_sapiens.GRCh38.109.gtf.gz 

First we must load kallisto to our session using module load as this is not installed.

    $ module load kallisto

Next run the indexing command. This prepares the transcriptome so that we can pseudoalign reads to it.

$ kallisto index --index=transcriptome_Homo_sapiens_GRCh38 kallisto_human_ref/Homo_sapiens.GRCh38.cdna.all.fa.gz

Step 2 Pseudoalignment of reads with Kallisto

In this final step, we will run Kallisto on all of our files to quantify the reads. We will write a for loop to do this. Let’s see once again our trimmed reads

Using your trimmed reads

$ cd /srv/scratch/[your_zID]/trimmed_fastq/

All instructions for the commands we are using are in the Kallisto manual: https://pachterlab.github.io/kallisto/manual. Since we are using single read data, we need to provide information on the fragment length used for the library (200) and an estimate of the standard deviation for this value - here we will have to guess (20).

We need to run Kallisto on all of your files. Run the command below on one of your files.

Single-end:

$ INPUT_FASTA="[yourscratch]/data/SRR306844chr1_chr3.trimmed.fastq.gz"
 
$ kallisto quant \
 --single\
 --threads=8\
 --index=[insert_location_your transcriptome]\
 --bootstrap-samples=25\
 --fragment-length=200\
 --sd=20\
 --output-dir=output\
 --genomebam\
 --gtf=Homo_sapiens.GRCh38.109.gtf ${INPUT_FASTA]

For paired-end reads, you need two files as input.

$ INPUT_FASTA="[yourscratch]/data/SRR306844*.trimmed.fastq.gz"
 
$ kallisto quant \
 --threads=8\
 --index=[insert_location_your_transcriptome] \
 --bootstrap-samples=25 \
 --output-dir=output\
 --genomebam\
 --gtf=Homo_sapiens.GRCh38.109.gtf ${INPUT_FASTA}

kallisto quant produces three output files by default:

  • abundance.h5 is a HDF5 binary file containing run info, abundance esimates, bootstrap estimates, and transcript length information length. This file can be read in by sleuth
  • abundance.tsv is a plaintext file of the abundance estimates. It does not contains bootstrap estimates. Please use the –plaintext mode to output plaintext abundance estimates. Alternatively, kallisto h5dump can be used to output an HDF5 file to plaintext. The first line contains a header for each column, including estimated counts, TPM, effective length.
  • run_info.json is a json file containing information about the run

Lets have a look at what the file contains a list of abundances (counts) shows.

` head -n 100 abundance.tsv`

Next you will have to calculate an abundance.tsv file for every sample.

Step 3 For Loop to perform pseudoalignment of reads for every sample

Like when you performed trimming, you will need to first request enough computational resources through qsub -I. Then you will have to run a for loop to loop through every sample fastq file. The loop you use is dependent on whether you are single end or paired end.

If you have single-end reads.

$ for infile in *.trimmed.fastq.gz
      do
      base=$(basename ${infile} trimmed.fastq.gz)
      outdir="${base}"
      kallisto quant \
       --single\
       --threads=8\
       --index=[insert_location_your transcriptome]\
       --bootstrap-samples=25\
       --fragment-length=200\
       --sd=20\
       --output-dir=${outdir}\
       --gtf=Homo_sapiens.GRCh38.109.gtf ${infile}

  done

If you have paired-end reads. Hint: check the string provided as the second part of the basename command matches the prefix of your infile

 $ for infile in *_1.trimmed.fastq.gz
      do
      base=$(basename ${infile} _1.trimmed.fastq.gz)
      outdir="${base}"
      infiles="${base}*trimmed.fastq.gz"
      kallisto quant \
       --threads=8 \
       --index=/srv/scratch/z5342988/transcriptome_Homo_sapiens_GRCh38 \
       --bootstrap-samples=25 \
       --output-dir=${outdir} \
       --gtf=Homo_sapiens.GRCh38.109.gtf ${infiles}

  done

Exercise

Use kallisto quant to quantify all the transcriptome of your files. Hint Be polite and request your turn in the (HPC) queue!

Bonus

Using the referenced papers and other research materials you can find, what are some other software tools that could be used as an alternative to Kallisto in RNA-Seq? Do they require the same datasets as Kallisto does in order to run?

Adapted from https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com and https://pachterlab.github.io/kallisto/manual