A Bioinformatics Paper Review

By Craig Sketchley

Motivation

The Problem: RNA-Seq

RNA-seq: sequencing the mRNA within a cell at a given point in time.

mRNA is not transcribed from a single contiguous section in DNA.

How do you accurately identify splice sites?

High-throughput sequencing also makes it challenging to detect & characterise splice sites.

Two key tasks:

Accurate alignment
Mapping sequences from non-contiguous regions

Background

Current Algorithmic Solutions

Extended DNA mapping solutions.

Offer compromises in either accuracy or resources required.

Computational component becoming bottleneck.

Mostly designed for short reads (≤ 200 bases).

Not great for "Third Generation Sequencing" (potentially full length reads).

What is STAR?

A tool specifically designed to align non-contiguous sequences to a reference genome.

"Spliced Transcripts Alignment to a Reference" (STAR)

Method

Overview

The STAR algorithm consists of 2 main steps:

Seed Search
Clustering, stitching & scoring

Method

Seed Search Algorithm

Searching for seeds involves a sequential search for the Maximum Mappable Prefix ($MMP$).

The $MMP$ is calculated as follows:

Given a read sequence $R$, read location $i$ and a reference genome sequence $G$, the $MMP(R,i,G)$ is defined as the longest substring $(R_i,R_{i+1},\dots,R_{i+MML-1})$ that matches exactly one or more substrings of $G$, where $MML$ is the Maximum Mappable Length.

Method

Seed Search Example

The search is implemented using a suffix array for the reference genome; the read sequence is then threaded through.

## Basic Idea - Once the MMP is found, this is a splice location, rerun this on the remainder of the read sequence that is not mapped [show diagram]. - Search is performed forwards and backwards on the read, and also can be started from a user defined position to facilitate finding anchors with errors at the ends. ## Figure - This approach can also help finding mismatches/indels. Run a sequence, if it doesn't reach the end, expand the "anchor" allowing for mismatches (what about indels?). - Poor alignment from this procedure can help identify sequence traits such as poly-A tails (+AAAAAAA), library adapters (???) or just poor sequencing. ## SA - The uncompressed SA offer a speed advantage with a trade off for space. Which we'll come to in the results section. - Build a suffix array of the reference genome and thread the read sequence through the array for a given read position.

Method

Clustering, Stitching and Scoring

Cluster seeds around a selected anchor seed.

Anchor seeds are selected by minimising the number of genomic mappings.

Seeds are then stitched together using a local-linear transcription model.

Method

Clustering, Stitching and Scoring cont.

Stitching is guide by a local alignment score.

$S = +\sum_{m}P_m - \sum_{mm}P_{mm} - \sum_{ins}P_{ins} - \sum_{del}P_{del} - \sum_{gap}P_{gap}$

$P_{ins/del} = P_{ins/del}^{open} + P_{ins/del}^{extend} . L_{ins/del}$

If one genomic window is not enough to map the entire read sequence, another anchor is chosen and clustering applied again.

This results in a chimeric read; where the mRNA is spliced from 2 distal parts of the genome.

Results

The paper compared STAR 2.1.3 results with 4 other popular RNA-seq mappers:

TopHat 2.0.0
GSNAP 2012-07-03
RUM 1.11
MapSplice

Results

Performance on Simulated Data

Simulated data allows for accurate expected results.

All aligners were run in de novo mode with default parameters.

ROC curves plot true positive vs. false positive.

Varying $N$, the number of reads required across a splice junction for it to be recognised as a splice, from 1 to 100.

All ROC curves exhibit desirable results.

README!!! Initially, they used simulated data to accurately benchmark all the mappers; they know exactly what mapping to expect. Data included: genomic variations and sequencing errors. __de novo__ - from the beginning - without any knowledge of the genome/transcript. -- String comparison. __default parameters__ - commonly accepted practice, since all aligners should have been optimised for mammalian genomes and RNA-seq data. ### ROC curve: - Plots the true positive (a classification hit), vs. a false positive (a misclassification, or false alarm). - Varies with a detection/discrimination threshold. - In this case, the plot is varying the number of reads required across a splice junction for it to be recognised as a splice, from 1 to 100. HOW MUCH EVIDENCE REQUIRED FOR A SPLICE. "All aligners exhibit desirable steep ROC curves at high values of detection threshold." "At the lowest detection threshold of 1 read per junction, STAR exhibits the lowest false-positive rate while achieving high sensitivity."

Results

Performance on Experimental Data

All mappers were run on an ENCODE (Encyclopaedia of DNA Elements) long RNA-seq dataset.

Percentage of reads aligned:

STAR - 94%
GSNAP - 94%
RUM - 86%
MapSplice - 85%
TopHat2 - 71%

Results

Performance on Experimental Data cont.

To measure accuracy, the plots included a pseudo-ROC.

It plotted the follow against each other:

the number of junctions detected by at least two mappers (pseudo-true positive)
the number of junctions detected exclusively by each mapper (pseudo-false positive)

The Idea:

If another mapper detected the junction, then it's probably correct.
If no other mapper detected a junction, then it's probably wrong.

Results

Performance on Experimental Data cont.

Results

Speed Comparison

All mappers were run with default parameters on the ~40 million 2 x 76 Illumina human RNA-seq dataset.

Close to linear scaling of the throughput rate with the number of threads.

STAR with 12 threads ~= 45 million reads per thread per hour.

RAM usage more than most, ~27GB RAM for human genome.

STAR has a sparse options to reduce RAM usage, for less speed.

Results

Experimental Validation

STAR was validated on data as part of the ENCODE and compared against BLAT (a popular mRNA aligner).

Similar or higher accuracy to BLAT.

2 x faster than BLAT, important for high-throughput sequencing.

Discussion

Aligning non-contiguous RNA-seq data to a reference genome is hard. It remains unsolved.

STAR is:

a stand-alone C++ RNA-seq mapper.
multi-threaded, fast and scalable.
accurate.
extensible to longer reads.
able to align reads from a continuous stream (high-throughput).

A Bioinformatics Paper Review

Motivation

The Problem: RNA-Seq

Background

Current Algorithmic Solutions

What is STAR?

Method

Overview

Method

Seed Search Algorithm

Method

Seed Search Example

Method

Clustering, Stitching and Scoring

Method

Clustering, Stitching and Scoring cont.

Results

Results

Performance on Simulated Data

Results

Performance on Experimental Data

Results

Performance on Experimental Data cont.

Results

Performance on Experimental Data cont.

Results

Speed Comparison

Results

Experimental Validation

Discussion

The End