Canu: Genome Assembly for PacBio and Nanopore Long-Reads (Detailed Guide)
What is Canu?
Long-read sequencing using Pacific Biosciences (PacBio) or Oxford Nanopore technologies revolutionized the generation of reference quality genomes, especially for large and repetitive genomes.
Canu (successor of Celera Assembler) is a single-molecule sequence hierarchical de novo assembler for large genomes which produces more continuous genome assemblies. Canu is well suited for PacBio and Oxford Nanopore long-read data considering their relatively high-error rate.
Canu has better run time performance, requires lower sequencing coverage, and works better for genomes with large repeats.
A Canu assembly pipeline contains three main stages: correction of sequence consensus (correction), trimming corrected reads (trimming), and assembly of trimmed corrected sequences (assembly).
A minimum coverage of 30x to 60x is recommended for eukaryotic genomes. Assemblies are better with higher coverage.
Alternatively, you can also combine canu genome assembly with short reads for generating the high quality finished assembly.
Getting started with canu
This tutorial explains the computational requirements for Canu, how to download and install Canu, and how to assemble the long-read (PacBio and NanoPore) using Canu.
Computational requirements for Canu
Canu automatically detects the available resources (memory and cores) on your computer for starting the assembly process. If you have insufficient resources, you may get memory errors.
You may assemble a bacterial genome using 8 GB of memory and 8 cores. But if you want to assemble larger eukaryotic genomes such as humans or other mammals, you may need at least 64 GB of memory and sufficient disk space (3 TB). If the genome is highly repetitive, you may need more disk space.
Tip: You should consider HPC for assembling large Eukaryotic genomes. Bacterial genomes can be assembled on desktop/laptop computers.
How to download and install Canu
The easiest way to install Canu is by downloading the pre-compiled binaries. You can download the pre-compiled binaries as below
# download for Linux
curl -L https://github.com/marbl/canu/releases/download/v2.2/canu-2.2.Linux-amd64.tar.xz --output canu-2.2.Linux.tar.xz
# extract
tar -xJf canu-2.2.Linux.tar.xz
# add binaries to PATH
export PATH=$PATH:/home/renesh/software/canu-2.2/bin
# check canu version
canu -version
canu 2.2
Once the binaries are added to the PATH, you should able to see a complete usage using the canu -h
command.
Assemble PacBio reads
We will use Banana PacBio reads for assembly. PacBio input reads should be in FASTQ or FASTA format. I have not shared the FASTQ file due to its large size.
By default, Canu will correct, trim and assemble the reads into the contigs.
You can use the following code to generate the Canu assembly,
canu -p banana -d banana_pacbio_out genomeSize=523m -pacbio pacbio.fastq
Where,
Parameter | Description | |
---|---|---|
-p |
assembly prefix. The ouput files will have this prefix. | |
-d |
The output directory to save the assembly files | |
genomeSize |
Haplod genome size. 523m means 523 Mbp. Use g for Gbp and k for Kbp. If you do not know exact genome size, you can use best approximate value. This is necessary for estimating the coverage in input sequence data. | |
-pacbio |
Long-read sequencing technology |
You can also add other parameters for Canu for memory, coverage, and error adjustments. Read more here for other parameters for Canu
Once the Canu is successfully completed for assembling PacBio reads, you should get the following output files in the output directory.
Files | Description | |
---|---|---|
banana.report | This is detailed analysis report. This report includes histogram of read lengths and k-mers, summary of corrected data, summary of overlaps, and the summary of contig lengths. | |
banana.correctedReads.fasta.gz | Contains the reads after correction | |
banana.trimmedReads.fasta.gz | Contains the corrected reads after overlapped based trimming | |
banana.contigs.fasta | Full assembly of contigs | |
banana.unassembled.fasta | Unassembled reads and contigs (low coverage) |
Read more for detailed information about output files
Assemble Nanopore reads
We will use the example of Banana Nanopore reads for assembly. The input PacBio reads should be either in FASTQ or FASTA format.
You can the following code to generate the Canu assembly using Nanopore reads
canu -p banana -d banana_nanopore_out genomeSize=523m -nanopore pacbio.fastq
Summary
You have learned how to use Canu for de novo genome assembly using PacBio and Nanopore long-reads in this article.
Enhance your skills with courses on genomics and bioinformatics
- Genomic Data Science Specialization
- Biology Meets Programming: Bioinformatics for Beginners
- Python for Genomic Data Science
- Bioinformatics Specialization
- Command Line Tools for Genomic Data Science
- Introduction to Genomic Technologies
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.