De novo genome assembly evaluation

Probabilistic measure of assembly quality

LAP is a de novo probabilistic measure of assembly quality which allows an objective comparison of multiple assemblies generated from a same set of reads. Specifically, we define the quality of an assembly as the conditional probability of the sequenced reads given the sequence output by the assembler, value maximized by the true genome sequence.

Site Map

Latest Release

LAP 1.1 9/3/13 

Please cite: Ghodsi, M., Hill, C. M., Astrovskaya, I., Lin, H., Sommer, D. D., Koren, S., & Pop, M. De novo likelihood-based measures for comparing genome assemblies." BMC research notes 6 (2013): 334.

Related Tools

Coming soon...

Related publications

  • Coming soon...

Authors

Other Documentation

  • OMICS Day, 5/12 (.pdf)

Links

Table of Contents

 Version 1.1

LAP Manual -

Introduction

What is LAP?

LAP (log average probability) is a de novo probabilistic measure of assembly quality which allows an objective comparison of multiple assemblies generated from a same set of reads. Specifically, we define the quality of an assembly as the conditional probability of the sequenced reads given the sequence output by the assembler, value maximized by the true genome sequence.

Methods for calculating LAP

Our framework is divided into the two categories, which contains the executables and scripts for computing the LAP using either alignment tools (aligner) or the dynamic programing methods (dynamic).

Aligner

When it is impractical to calculate exact probabilities for large sets of reads, we can approximate these probabilities using fast and memory-efficient alignment search programs, which internally model the sequencing process. We use Bowtie 2 (Langmead and Salzberg 2012) to align the reads to the assembly. However, it is easy to adapt our programs to use any read alignment tool that stores the alignment results in SAM(Li et al. 2009) format.

calc_prob.py

calc_prob.py is a python script that calculates the read probabilities using Bowtie 2 (version 2.0.2) with the –very-sensitive preset. The location of this version of Bowtie2 (or possibly a more recent version) must be stored in the shell variable $BT2_HOME. The output is a list of tab-delimited read_id, probability tuples.

Usage: single reads

$ALIGNER_HOME/calc_prob.py -a <assembly.fasta> -i <r.fasta>

Usage: mate pairs

$ALIGNER_HOME/calc_prob.py -a <assembly.fasta> -1 <m1.fasta> -2 <m2.fasta> −X <max_ins> −I <min_ins> −o <orient> −m <ins_size> −t <std_dev>

Options

-a <assembly.fasta>

The file of the assembly you wish to evaluate.

-1 <m1>

Comma-separated list of files containing mate 1s (filename usually includes _1), e.g. -1 flyA_1.fq,flyB_1.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <m2>.

-2 <m2>

Comma-separated list of files containing mate 2s (filename usually includes _2), e.g. -2 flyA_2.fq,flyB_2.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <m1>.

-i <r>

Comma-separated list of files containing unpaired reads to be aligned, e.g. lane1.fq,lane2.fq,lane3.fq,lane4.fq. Reads may be a mix of different lengths. If - is specified, bowtie2 gets the reads from the "standard in" or "stdin" filehandle.

-X <r>

The maximum fragment length for valid paired-end alignments. Please see Bowtie 2 manual.

-I <r>

The minimum fragment length for valid paired-end alignments. Please see Bowtie 2 manual.

-o <r>

Comma-separated list of the upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand. E.g., if fr is specified and there is a candidate paired-end alignment where mate 1 appears upstream of the reverse complement of mate 2 and the fragment length constraints Please see Bowtie 2 manual for more information on the orientation.

-m <r>

Comma-separated list of insert sizes.

-t <r>

Comma-separated list of insert size standard deviation. If standard deviation is not known, use 10% of the average insert size.

-p <r>

Number of threads to use.

-n <file.abun>

Contig abundance file in the format of: contig_id \t cvg, used for comparing metagenomic assemblies.

A more detailed list of additional parameters coming soon...

sum_prob.py

The output read probabilities are piped into the python script sum_prob.py, which takes in a list of read probabilities, to calculate the LAP using a threshold probability for unaligned reads:

cat output.prob | $ALIGNER_HOME/sum_prob.py -t "1e-30"

Dynamic

For large, highly repetitive genomes, Bowtie2 may be unable to align a few reads to all locations in the assemblies. Thus, our framework includes a dynamic programming method for computing the LAP.

mprobability

Usage: single reads

cat input.fast[aq] | $DYNAMIC_HOME/mprobability −a reference.fasta -f fast[aq]

Usage: mate pairs

cat stitched_reads.fast[aq] | $DYNAMIC_HOME/mprobability −a reference.fasta -f fast[aq] --use-mates --separation-mean=<sm> --separation-std-dev=<st>

Options

-a <assembly.fasta>

The file of the assembly you wish to evaluate.

-f <fasta>, --reads-format=<m1>

The format ('fasta' or 'fastq') of the file containing the reads (default=fastq).

--use-mates

Calculate mate-pair probability.

--separation-mean=<LENGTH>

The mean length of the separation between the paired reads.

--separation-std-dev=<LENGTH>

The standard deviation of the mean length of the separation.

--print-headers

Print the header of each read before its probability

-p <THREADS>, --num-threads=<THREADS>

Number of threads.

-accuracy=<PROBABILITY>

The probability that a base is read accurately (default=0.98).

mean

The output read probabilities can either be piped into the python script $ALIGNER/sum_prob.py, or into mean which takes in a list of read probabilities, to calculate the LAP using a threshold probability for unaligned reads:

cat output.prob | $DYNAMIC_HOME/mean -t "1e-30"

Getting started with LAP

Evaluating E. Coli MG1655 assemblies

Here we evaluate E. Coli MG1655 assemblies of Illumina MiSeq sequences using our LAP framework. Download and extract the sequences: Read 1 (FASTQ) and Read 2 (FASTQ). In addition, download our sample assemblies, which contains the reference genome and a SOAPdenovo assembly.

First, let's calculate the probability of the reads given the reference genome. (Note: this takes approximately four hours on a single core).

$ALIGNER_HOME/calc_prob.py -p 8 -a tutorial/e_coli_MG1655.fasta -q -1 MiSeq_Ecoli_MG1655_110721_PF_R1.fastq -2 MiSeq_Ecoli_MG1655_110721_PF_R2.fastq -X 500 -I 0 -o fr -m 300 -t 30 > e_coli_reference.prob

Once we have calculated the read probabilities, we calculate the LAP:

$ALIGNER_HOME/sum_prob.py -i e_coli_reference.prob

-14.363667413 0.00375997783309

Now, lets compare the LAP score of our SOAPdenovo assembly. Perform the same steps as above, but replace it with the SOAPdenovo assembly:

$ALIGNER_HOME/calc_prob.py -p 8 -a tutorial/e_coli_SOAPdenovo.fasta -q -1 MiSeq_Ecoli_MG1655_110721_PF_R1.fastq -2 MiSeq_Ecoli_MG1655_110721_PF_R2.fastq -X 500 -I 0 -o fr -m 300 -t 30 > e_coli_SOAPdenovo.prob

$ALIGNER_HOME/sum_prob.py -i e_coli_SOAPdenovo.prob

-14.4367803065 0.00375997783309

As expected, you can see that the LAP score of the reference (-14.364) is larger than the LAP score of our SOAPdenovo assembly (-14.437).

Instead of using the complete read set to determine the LAP, we can use a subset of the reads. Let's try generating a 100,000 read sample:

$ALIGNER_HOME/gen_rand_sample.py -1 MiSeq_Ecoli_MG1655_110721_PF_R1.fastq -2 MiSeq_Ecoli_MG1655_110721_PF_R2.fastq -k 100000

The reads will be written by default to ./[SAMPLE_SIZE]/0_[12].fastq. Now, lets re-compute the LAP scores of both assemblies. Perform the same steps as above, but replace the reads with the newly created sample reads:

$ALIGNER_HOME/calc_prob.py -p 8 -a tutorial/e_coli_MG1655.fasta -q -1 100000/0_1.fastq -2 100000/0_2.fastq -X 500 -I 0 -o fr -m 300 -t 30 > e_coli_reference.prob

$ALIGNER_HOME/calc_prob.py -p 8 -a tutorial/e_coli_SOAPdenovo.fasta -q -1 100000/0_1.fastq -2 100000/0_2.fastq -X 500 -I 0 -o fr -m 300 -t 30 > e_coli_SOAPdenovo.prob

$ALIGNER_HOME/sum_prob.py -i e_coli_reference.prob

-14.3700277661 0.0284604989415

$ALIGNER_HOME/sum_prob.py -i e_coli_SOAPdenovo.prob

-14.4457980897 0.0284604989415

Using less than 2% of the original reads, we are able to show the reference is the more likely assembly.

GAGE analysis

Here we provide the steps to re-create our GAGE assembly evaluation. Download gage_scripts.zip, which contains the necessary scripts to download the GAGE data and run LAP.

download_gage_data.sh downloads and extracts the Assembly.tgz and Data.original.tgz for each organism: Staphylococcus auereus, Rhodobacter spaeroides, and Human chromosome 14.

Next, the users can simply run ./run_calc_gage_probs.sh which contains the command line to call a helper script to calculate the LAP for each of the assembly (IMPORTANT: calc_prob.py and sum_prob.py must be in the same directory as the scripts):

# Run staph for mates, union AND contigs, scaffolds.
./calc_asm_probs.sh s m ctg Allpaths-LG,SOAPdenovo,Velvet,ABySS,SGA,Bambus2,MSR-CA,close,close2,truth 10
./calc_asm_probs.sh s u ctg Allpaths-LG,SOAPdenovo,Velvet,ABySS,SGA,Bambus2,MSR-CA,close,close2,truth 10
...
./sum_asm_probs.sh s m ctg "Allpaths-LG,SOAPdenovo,Velvet,ABySS,SGA,Bambus2,MSR-CA,close,close2,truth" "1e-30" >> bact_results
./sum_asm_probs.sh s u ctg "Allpaths-LG,SOAPdenovo,Velvet,ABySS,SGA,Bambus2,MSR-CA,close,close2,truth" "1e-30" >> bact_results

The results will be available in a file called bact_results:

s_aureus ctg mates Allpaths-LG -23.9740731239 0.00969440802919
s_aureus ctg mates SOAPdenovo -23.8918983037 0.00969440802919
s_aureus ctg mates Velvet -24.2581637228 0.00969440802919
s_aureus ctg mates ABySS -24.6920227231 0.00969440802919

Assemblathon1 analysis

Here we provide the steps to re-create the Assemblathon1 assembly evaluation. Download asm1_scripts.zip, which contains the necessary scripts to download the Assemblathon1 data and run LAP.

First, run ./download_asm1_data.sh to get the sequences and Assemblathon1 assemblies (which have had their E. coli contaminated regions removed). Next, simply run the ./run_calc_asm1_prob.sh which will calculate the LAP on a subset of the 10k insert library.

Best practices

Bacterial genomes

For bacterial genomes that are a few megabases in size, the ideal method for calculating LAP is using the aligner method (with Bowtie2). If there are only a few million reads, then it is safe to calculate the LAP using the complete read set.

Eukaryotic genomes

There are a few things to take into consideration when evaluating eukaryotic genomes. If Bowtie2 is able to successfully find all alignments, then the aligner method should be the used over the dynamic programming method. However, in our Assemblathon1 evaluation, we were unable to use Bowtie2 successfully. If Bowtie2 can't be used, then users should give the aligner a SAM file (-s option) produced from another read alignment tool more suited for finding all alignments (e.g., mrsFAST). If all else fails, then use our dynamic programming method.

Depending how different the assemblies are, users can use a subset of the reads to calculate the LAP. In our experience, most eukaroytic assemblies are different enough that the LAP can be calculated on a small fraction of the data set.