Site Map
Latest Release
LAP 1.1 | 9/3/13 |
Please cite: Ghodsi, M., Hill, C. M., Astrovskaya, I., Lin, H., Sommer, D. D., Koren, S., & Pop, M. De novo likelihood-based measures for comparing genome assemblies." BMC research notes 6 (2013): 334. |
Related Tools
Coming soon... |
Related publications
- Coming soon...
Authors
- Mohammadreza Ghodsi
- Christopher Hill
- Irina Astrovskaya
- Henry Lin
- Dan Sommer
- Sergey Koren
- Mihai Pop
Other Documentation
- OMICS Day, 5/12 (.pdf)
Links
Table of Contents
Version 1.1
- Introduction
- Methods for calculating LAP
- Getting started with LAP
- GAGE analysis
- Assemblathon1 analysis
Introduction
What is LAP?
LAP (log average probability) is a de novo probabilistic measure of assembly quality which allows an objective comparison of multiple assemblies generated from a same set of reads. Specifically, we define the quality of an assembly as the conditional probability of the sequenced reads given the sequence output by the assembler, value maximized by the true genome sequence.
Methods for calculating LAP
Our framework is divided into the two categories, which contains the executables and scripts for computing the LAP using either alignment tools (aligner) or the dynamic programing methods (dynamic).
Aligner
When it is impractical to calculate exact probabilities for large sets of reads, we can approximate these probabilities using fast and memory-efficient alignment search programs, which internally model the sequencing process. We use Bowtie 2 (Langmead and Salzberg 2012) to align the reads to the assembly. However, it is easy to adapt our programs to use any read alignment tool that stores the alignment results in SAM(Li et al. 2009) format.
calc_prob.py
calc_prob.py
is a python script that calculates the read probabilities using Bowtie 2 (version 2.0.2) with the –very-sensitive preset. The location of this version of Bowtie2 (or possibly a more recent version) must be stored in the shell variable $BT2_HOME
. The output is a list of tab-delimited read_id, probability tuples.
Usage: single reads
$ALIGNER_HOME/calc_prob.py -a <assembly.fasta> -i <r.fasta>
Usage: mate pairs
$ALIGNER_HOME/calc_prob.py -a <assembly.fasta> -1 <m1.fasta> -2 <m2.fasta> −X <max_ins> −I <min_ins> −o <orient> −m <ins_size> −t <std_dev>
Options
|
The file of the assembly you wish to evaluate. |
|
Comma-separated list of files containing mate 1s (filename usually includes |
|
Comma-separated list of files containing mate 2s (filename usually includes |
|
Comma-separated list of files containing unpaired reads to be aligned, e.g. |
|
The maximum fragment length for valid paired-end alignments. Please see Bowtie 2 manual. |
|
The minimum fragment length for valid paired-end alignments. Please see Bowtie 2 manual. |
|
Comma-separated list of the upstream/downstream mate orientations for a valid paired-end alignment against the forward reference strand. E.g., if |
|
Comma-separated list of insert sizes. |
|
Comma-separated list of insert size standard deviation. If standard deviation is not known, use 10% of the average insert size. |
|
Number of threads to use. |
|
Contig abundance file in the format of: contig_id \t cvg, used for comparing metagenomic assemblies. |
A more detailed list of additional parameters coming soon...
sum_prob.py
The output read probabilities are piped into the python script sum_prob.py
, which takes in a list of read probabilities, to calculate the LAP using a threshold probability for unaligned reads:
cat output.prob | $ALIGNER_HOME/sum_prob.py -t "1e-30"
Dynamic
For large, highly repetitive genomes, Bowtie2 may be unable to align a few reads to all locations in the assemblies. Thus, our framework includes a dynamic programming method for computing the LAP.
mprobability
Usage: single reads
cat input.fast[aq] | $DYNAMIC_HOME/mprobability −a reference.fasta -f fast[aq]
Usage: mate pairs
cat stitched_reads.fast[aq] | $DYNAMIC_HOME/mprobability −a reference.fasta -f fast[aq] --use-mates --separation-mean=<sm> --separation-std-dev=<st>
Options
|
The file of the assembly you wish to evaluate. |
|
The format ('fasta' or 'fastq') of the file containing the reads (default=fastq). |
|
Calculate mate-pair probability. |
|
The mean length of the separation between the paired reads. |
|
The standard deviation of the mean length of the separation. |
|
Print the header of each read before its probability |
|
Number of threads. |
|
The probability that a base is read accurately (default=0.98). |
mean
The output read probabilities can either be piped into the python script $ALIGNER/sum_prob.py
, or into mean
which takes in a list of read probabilities, to calculate the LAP using a threshold probability for unaligned reads:
cat output.prob | $DYNAMIC_HOME/mean -t "1e-30"
Getting started with LAP
Evaluating E. Coli MG1655 assemblies
Here we evaluate E. Coli MG1655 assemblies of Illumina MiSeq sequences using our LAP framework. Download and extract the sequences: Read 1 (FASTQ) and Read 2 (FASTQ). In addition, download our sample assemblies, which contains the reference genome and a SOAPdenovo assembly.
First, let's calculate the probability of the reads given the reference genome. (Note: this takes approximately four hours on a single core).
$ALIGNER_HOME/calc_prob.py -p 8 -a tutorial/e_coli_MG1655.fasta -q -1 MiSeq_Ecoli_MG1655_110721_PF_R1.fastq -2 MiSeq_Ecoli_MG1655_110721_PF_R2.fastq -X 500 -I 0 -o fr -m 300 -t 30 > e_coli_reference.prob
Once we have calculated the read probabilities, we calculate the LAP:
$ALIGNER_HOME/sum_prob.py -i e_coli_reference.prob
-14.363667413 0.00375997783309
Now, lets compare the LAP score of our SOAPdenovo assembly. Perform the same steps as above, but replace it with the SOAPdenovo assembly:
$ALIGNER_HOME/calc_prob.py -p 8 -a tutorial/e_coli_SOAPdenovo.fasta -q -1 MiSeq_Ecoli_MG1655_110721_PF_R1.fastq -2 MiSeq_Ecoli_MG1655_110721_PF_R2.fastq -X 500 -I 0 -o fr -m 300 -t 30 > e_coli_SOAPdenovo.prob
$ALIGNER_HOME/sum_prob.py -i e_coli_SOAPdenovo.prob
-14.4367803065 0.00375997783309
As expected, you can see that the LAP score of the reference (-14.364) is larger than the LAP score of our SOAPdenovo assembly (-14.437).
Instead of using the complete read set to determine the LAP, we can use a subset of the reads. Let's try generating a 100,000 read sample:
$ALIGNER_HOME/gen_rand_sample.py -1 MiSeq_Ecoli_MG1655_110721_PF_R1.fastq -2 MiSeq_Ecoli_MG1655_110721_PF_R2.fastq -k 100000
The reads will be written by default to ./[SAMPLE_SIZE]/0_[12].fastq
. Now, lets re-compute the LAP scores of both assemblies. Perform the same steps as above, but replace the reads with the newly created sample reads:
$ALIGNER_HOME/calc_prob.py -p 8 -a tutorial/e_coli_MG1655.fasta -q -1 100000/0_1.fastq -2 100000/0_2.fastq -X 500 -I 0 -o fr -m 300 -t 30 > e_coli_reference.prob $ALIGNER_HOME/calc_prob.py -p 8 -a tutorial/e_coli_SOAPdenovo.fasta -q -1 100000/0_1.fastq -2 100000/0_2.fastq -X 500 -I 0 -o fr -m 300 -t 30 > e_coli_SOAPdenovo.prob
$ALIGNER_HOME/sum_prob.py -i e_coli_reference.prob
-14.3700277661 0.0284604989415
$ALIGNER_HOME/sum_prob.py -i e_coli_SOAPdenovo.prob
-14.4457980897 0.0284604989415
Using less than 2% of the original reads, we are able to show the reference is the more likely assembly.
GAGE analysis
Here we provide the steps to re-create our GAGE assembly evaluation. Download gage_scripts.zip, which contains the necessary scripts to download the GAGE data and run LAP.
download_gage_data.sh downloads and extracts the Assembly.tgz and Data.original.tgz for each organism: Staphylococcus auereus, Rhodobacter spaeroides, and Human chromosome 14.
Next, the users can simply run ./run_calc_gage_probs.sh which contains the command line to call a helper script to calculate the LAP for each of the assembly (IMPORTANT: calc_prob.py and sum_prob.py must be in the same directory as the scripts):
# Run staph for mates, union AND contigs, scaffolds. ./calc_asm_probs.sh s m ctg Allpaths-LG,SOAPdenovo,Velvet,ABySS,SGA,Bambus2,MSR-CA,close,close2,truth 10 ./calc_asm_probs.sh s u ctg Allpaths-LG,SOAPdenovo,Velvet,ABySS,SGA,Bambus2,MSR-CA,close,close2,truth 10 ... ./sum_asm_probs.sh s m ctg "Allpaths-LG,SOAPdenovo,Velvet,ABySS,SGA,Bambus2,MSR-CA,close,close2,truth" "1e-30" >> bact_results ./sum_asm_probs.sh s u ctg "Allpaths-LG,SOAPdenovo,Velvet,ABySS,SGA,Bambus2,MSR-CA,close,close2,truth" "1e-30" >> bact_results
The results will be available in a file called bact_results:
s_aureus ctg mates Allpaths-LG -23.9740731239 0.00969440802919 s_aureus ctg mates SOAPdenovo -23.8918983037 0.00969440802919 s_aureus ctg mates Velvet -24.2581637228 0.00969440802919 s_aureus ctg mates ABySS -24.6920227231 0.00969440802919
Assemblathon1 analysis
Here we provide the steps to re-create the Assemblathon1 assembly evaluation. Download asm1_scripts.zip, which contains the necessary scripts to download the Assemblathon1 data and run LAP.
First, run ./download_asm1_data.sh to get the sequences and Assemblathon1 assemblies (which have had their E. coli contaminated regions removed). Next, simply run the ./run_calc_asm1_prob.sh which will calculate the LAP on a subset of the 10k insert library.
Best practices
Bacterial genomes
For bacterial genomes that are a few megabases in size, the ideal method for calculating LAP is using the aligner method (with Bowtie2). If there are only a few million reads, then it is safe to calculate the LAP using the complete read set.
Eukaryotic genomes
There are a few things to take into consideration when evaluating eukaryotic genomes. If Bowtie2 is able to successfully find all alignments, then the aligner method should be the used over the dynamic programming method. However, in our Assemblathon1 evaluation, we were unable to use Bowtie2 successfully. If Bowtie2 can't be used, then users should give the aligner a SAM file (-s option) produced from another read alignment tool more suited for finding all alignments (e.g., mrsFAST). If all else fails, then use our dynamic programming method.
Depending how different the assemblies are, users can use a subset of the reads to calculate the LAP. In our experience, most eukaroytic assemblies are different enough that the LAP can be calculated on a small fraction of the data set.