Sequence comparative analysis using networks: Software for evaluating de novo transcript assembly from next-generation sequencing

Document Type


Date of Original Version



DNA sequencing technology is becoming more accessible to a variety of researchers as costs continue to decline. As researchers begin to sequence novel transcriptomes, most of these data sets lack a reference genome and will have to rely on de novo assemblers. Making comparisons across assemblies can be difficult: each program has its strengths and weaknesses, and no tool exists to comparatively evaluate these data sets. We developed software in R, called Sequence Comparative Analysis using Networks (SCAN), to perform statistical comparisons between distinct assemblies. SCAN uses a reference data set to identify the most accurate de novo assembly and the "good" transcripts in the user's data. We tested SCAN on three publicly available transcriptomes, each assembled using three assembly programs. Moreover, we sequenced the transcriptome of the oomycete Achlya hypogyna and compared de novo assemblies from Velvet, ABySS, and the CLC Genomics Workbench assembly algorithms. One thousand one hundred twenty-eight of the CLC transcripts were statistically similar to the reference, compared with 49 of the Velvet transcripts and 937 of the ABySS transcripts. SCAN's strength is providing statistical support for transcript assemblies in a biological context. However, SCAN is designed to compare distinct node sets in networks, therefore it can also easily be extended to perform statistical comparisons on any network graph regardless of what the nodes represent. © The Author 2013.

Publication Title, e.g., Journal

Molecular Biology and Evolution