The short awk script we wrote to extract the radIIB tags from longer sequences (previous post) has one major shortcoming. Namely, it works only on fastA files that lack information on the quality of the base calls. This is suboptimal for two reasons. First, we have to filter the fastQ file output by the sequencer to remove any low quality reads. Then, in a another unnecessary step, convert the fastQ to a fastA file so the awk script can read it. The second, and more important shortcoming, has to do with the downstream application of the produced file. Most short read mappers (e.g. bowtie, bwa, shrimp) use the information stored in the quality string to calculate mapping scores. Ideally therefore, we would want to use a fastQ input file for mapping to the reference genome. So our script had to be modified to read and output a fastQ file.
It turned out that this wasn’t such a daunting task. All we needed to do was to print the second next line in the file after the nucleotide sequence. Making sure of course to output only those characters from the quality string that corresponded to the start and end of the matched regular expression. The modified code looks like this and is similarly fast:
if ($1~ “@HWI”)
Of course, in both scripts, the regular expression we want to match (and other parts of the code) are hard-coded and would have to be manually adjusted to work with of fastA/Q files with differently formatted parts (e.g. read names).