New Quality Control Program Optimized for Long Read Nucleotide Sequencing

Contributed by Cory Schlesener, B.S.

DNA sequencing, in a high throughput process, generates errors in the reads, such as low confidence nucleotide based calling. Quality control is needed to evaluate the quality of the sequencing output and identify undesired features. The program FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc) has become one of the most popular tools for assessing quality of short read sequencing output (e.g. illumina sequencing). However, the tool is not optimized for newer long read sequencing technologies (e.g. PacBio and Nanopore sequencing), which can have higher error rates and have non-standard formats for recording raw sequence data. These newer long read outputs need alternatively optimized metrics for assessment, as the structure of individual reads and population of reads is quite different compared to the massive population of short reads from older technologies. A useful new program, LongQC (https://github.com/yfukasawa/LongQC), provides an alternative QC analysis optimized for long reads. LongQC assessment includes general statistics, read length, read coverage, GC content, sequence complexity, and sequence error estimation.

Fukasawa, Y., Ermini, L., Wang, H., Carty, K. and Cheung, M.S., 2020. LongQC: A Quality Control Tool for Third Generation Sequencing Long Read Data. G3: Genes, Genomes, Genetics, 10(4), pp.1193-1196. https://doi.org/10.1534/g3.119.400864

This entry was posted in uncategorized. Bookmark the permalink.