A Bioinformatics Paper Review

By Craig Sketchley

Motivation

The Problem: RNA-Seq

RNA-seq: sequencing the mRNA within a cell at a given point in time.

mRNA is not transcribed from a single contiguous section in DNA.

How do you accurately identify splice sites?

High-throughput sequencing also makes it challenging to detect & characterise splice sites.

Two key tasks:

  • Accurate alignment
  • Mapping sequences from non-contiguous regions

Background

Current Algorithmic Solutions

Extended DNA mapping solutions.

Offer compromises in either accuracy or resources required.

Computational component becoming bottleneck.

Mostly designed for short reads (≤ 200 bases).

Not great for "Third Generation Sequencing" (potentially full length reads).

What is STAR?

A tool specifically designed to align non-contiguous sequences to a reference genome.

"Spliced Transcripts Alignment to a Reference" (STAR)

Method

Overview

The STAR algorithm consists of 2 main steps:

  • Seed Search
  • Clustering, stitching & scoring

Method

Seed Search Algorithm

Searching for seeds involves a sequential search for the Maximum Mappable Prefix ($MMP$).

The $MMP$ is calculated as follows:

Given a read sequence $R$, read location $i$ and a reference genome sequence $G$, the $MMP(R,i,G)$ is defined as the longest substring $(R_i,R_{i+1},\dots,R_{i+MML-1})$ that matches exactly one or more substrings of $G$, where $MML$ is the Maximum Mappable Length.

Method

Seed Search Example


The search is implemented using a suffix array for the reference genome; the read sequence is then threaded through.

Method

Clustering, Stitching and Scoring

Cluster seeds around a selected anchor seed.

Anchor seeds are selected by minimising the number of genomic mappings.

Seeds are then stitched together using a local-linear transcription model.

Method

Clustering, Stitching and Scoring cont.

Stitching is guide by a local alignment score.

$S = +\sum_{m}P_m - \sum_{mm}P_{mm} - \sum_{ins}P_{ins} - \sum_{del}P_{del} - \sum_{gap}P_{gap}$

$P_{ins/del} = P_{ins/del}^{open} + P_{ins/del}^{extend} . L_{ins/del}$

If one genomic window is not enough to map the entire read sequence, another anchor is chosen and clustering applied again.

This results in a chimeric read; where the mRNA is spliced from 2 distal parts of the genome.

Results

The paper compared STAR 2.1.3 results with 4 other popular RNA-seq mappers:

  • TopHat 2.0.0
  • GSNAP 2012-07-03
  • RUM 1.11
  • MapSplice

Results

Performance on Simulated Data

Simulated data allows for accurate expected results.

All aligners were run in de novo mode with default parameters.

ROC curves plot true positive vs. false positive.

Varying $N$, the number of reads required across a splice junction for it to be recognised as a splice, from 1 to 100.

All ROC curves exhibit desirable results.

Results

Performance on Experimental Data

All mappers were run on an ENCODE (Encyclopaedia of DNA Elements) long RNA-seq dataset.

Percentage of reads aligned:

  • STAR - 94%
  • GSNAP - 94%
  • RUM - 86%
  • MapSplice - 85%
  • TopHat2 - 71%

Results

Performance on Experimental Data cont.

To measure accuracy, the plots included a pseudo-ROC.

It plotted the follow against each other:

  • the number of junctions detected by at least two mappers (pseudo-true positive)
  • the number of junctions detected exclusively by each mapper (pseudo-false positive)

The Idea:

  • If another mapper detected the junction, then it's probably correct.
  • If no other mapper detected a junction, then it's probably wrong.

Results

Performance on Experimental Data cont.

plots

Results

Speed Comparison

All mappers were run with default parameters on the ~40 million 2 x 76 Illumina human RNA-seq dataset.

Close to linear scaling of the throughput rate with the number of threads.

STAR with 12 threads ~= 45 million reads per thread per hour.

RAM usage more than most, ~27GB RAM for human genome.

STAR has a sparse options to reduce RAM usage, for less speed.

Results

Experimental Validation

STAR was validated on data as part of the ENCODE and compared against BLAT (a popular mRNA aligner).

Similar or higher accuracy to BLAT.

2 x faster than BLAT, important for high-throughput sequencing.

Discussion

Aligning non-contiguous RNA-seq data to a reference genome is hard. It remains unsolved.

STAR is:

  • a stand-alone C++ RNA-seq mapper.
  • multi-threaded, fast and scalable.
  • accurate.
  • extensible to longer reads.
  • able to align reads from a continuous stream (high-throughput).

The End