:::: MENU ::::
Browsing posts in: Little scripts for formatting

Wrapped fastq problem

Recently I started working on some denovo assemblies of Illumina sequences. My approach starts with mapping to a known contaminant genome and storing reads that didn’t map to the contaminant in a fastq file. I do this to reduce memory and time requirements for the assembly. Unfortunatelly, for some reason, the mapper produced a fastq file that had sequences and qualities wrapped on a second line after the 79th character. This threw of the assemblers I tried to use – they inferred that there are more sequences in the file than there actually were.

I spent a better part of the morning trying to come up with a way to unwrap the fastq file (not very programming smart). The file looked like this:

@1101_1131_2245
CCTCATCAGTGCGAGTGTAGTAGTCAAGGTTTAGAACATGGGATAATCCATGTGTCCTAGCAGTTGCTTTGAGCTGACG
ATGCCATTGTTTCCATTTGTTG
+
@@@FFFFFHFHFH@G@EGIJGGEIEDEHIEEHDCFHGIIGIJGHIJJGIIIIIJGHIBHGIIGAFHIGGIJ=HIJEHHH
;BDCECCDEAACC@>CCC;AA#

And here is the solution I came up with:

awk ‘{if ( $1 ~ /.*_.*_.*/ || $1 ~ /^+$/) print “\n”$1; else printf”%s”, $0}’ | awk ‘NF’

The first condition has a regular expression that recognizes the fastq tag, and I will need to change it for fastq files with differently formatted tags. The last bit after the pipe is to remove any blank lines in the output.

Unfortunatelly this oneliner is not that fast. It took 23 minutes to convert a file of 18 milion reads.

Not perfect but it did the job…


Pages:123