:::: MENU ::::
Posts tagged with: планета

Modifying the ‘extract radIIB tags’ script to work with fastQ files

The short awk script we wrote to extract the radIIB tags from longer sequences (previous post) has one major shortcoming. Namely, it works only on fastA files that lack information on the quality of the base calls. This is suboptimal for two reasons. First, we have to filter the fastQ file output by the sequencer to remove any low quality reads. Then, in a another unnecessary step, convert the fastQ to a fastA file so the awk script can read it. The second, and more important shortcoming, has to do with the downstream application of the produced file. Most short read mappers (e.g. bowtie, bwa, shrimp) use the information stored in the quality string to calculate mapping scores. Ideally therefore, we would want to use a fastQ input file for mapping to the reference genome. So our script had to be modified to read and output a fastQ file.

It turned out that this wasn’t such a daunting task. All we needed to do was to print the second next line in the file after the nucleotide sequence. Making sure of course to output only those characters from the quality string that corresponded to the start and end of the matched regular expression. The modified code looks like this and is similarly fast:


  if ($1~ “@HWI”)
  else {
    where=match($0, /.{10}GCA.{6}TGC.{12}/)
    if (where)
      print “@”name”\n”substr($0,RSTART,RLENGTH)”\n+”
      print substr($0,RSTART,RLENGTH)

Of course, in both scripts, the regular expression we want to match (and other parts of the code) are hard-coded and would have to be manually adjusted to work with of fastA/Q files with differently formatted parts (e.g. read names).

Extracting radIIB tags from a fasta file

With my lab mate, Mariska Brady, we started working on Restriction Site associated DNA (RAD) sequence data. The data was produced with a type IIB restriction enzyme that cuts several nucleotides up and down stream from the recognition site. The total size of the produced fragment is 34 nucleotides. The reads produced by the sequencer however are 101 bases and continue well inside the 3′ adapter sequence. This extra sequence needed to be removed. Also, due to some problems with the library construction (unknown reasons for now) not all reads had the required fragment in the same position. i.e. in most reads, the 34 meaningful bases started at the 3 nucleotide on the 5′ side, but not in all. Therefore, simply cutting 34 bases (characters) based on position was a bad option because of loss of data. To match only those 34 bases that represent the meaningful fragment, we turned to awk and regular expressions. The answer we came up with is pretty much straight from the awk manual:


{if($1~ “>”) name=substr($1,27); else {x=match($0, /.{10}GCA.{6}TGC.{12}/); if (x) print “>”name “\n” substr($0,RSTART,RLENGTH) } }

In this little script, we first find the lines of the fasta file that represent read names (first if statement) and store the informative part in a variable called ‘name’. Then we try to match the regular expression representing our wanted fragment to the following line (the nucleotide sequence). When a line matches the regex we store it in a variable ‘x’. In the second if statement we print the name and the subset of nucleotides corresponding to the restriction fragment. Only those reads that matched the regex will be reported.

We run this awk script like so:

awk –posix -f extract_rad2b_fromFasta.awk input > output

and it processed about 9 million reads in little over 3 minutes.