:::: MENU ::::
Posts tagged with: RAD Sequencing

Finding the expected radIIB fragments from a reference genome

The next step in our analyses of RAD IIB data was to determine the total number of sites in the reference genome that would be recognized by the restriction enzyme. Moreover, we wanted to extact the sequences and positions of the expected fragments along the chromosome and store this information in a fastA file. This because mapping against the expected sites only would be faster and it makes sense since those are the only sites that we can compare. Following through with awk and examples of loops found on the web we came up with this short code:

#!/usr/bin/awk

{
  if ($1~”>”)
    name=$1
  else
    sum=0;
    while (match($0, /.{10}GCA.{6}TGC.{12}/))
    {
      printf name”_”sum+RSTART”\n”substr($0, RSTART, RLENGTH)”\n”;
      $0=substr($0, RSTART+RLENGTH);
      sum+=RSTART+RLENGTH-1
    }
}

In the while loop we go through the line representing a chromosome, searching for matches to our regular expression. When a match is found, we extract its sequence and record its position along the chromosome. Then, we update the chromosome string to start at the end of the previously matched site. And also, update the variable sum that keeps track how far down the chromosome we are. This last bit was important because we wanted to print the start position of the extracted fragments in the names of the newly generated fastA file. Note that for this script to work, each chromosome needs to be represented by a single line in the input fasta file. We converted the wrapped fasta files to one line per chromosome using (awk again):

awk ‘{if ($1 !~ “>”) printf”%s”, $0; else print $0}’


Modifying the ‘extract radIIB tags’ script to work with fastQ files

The short awk script we wrote to extract the radIIB tags from longer sequences (previous post) has one major shortcoming. Namely, it works only on fastA files that lack information on the quality of the base calls. This is suboptimal for two reasons. First, we have to filter the fastQ file output by the sequencer to remove any low quality reads. Then, in a another unnecessary step, convert the fastQ to a fastA file so the awk script can read it. The second, and more important shortcoming, has to do with the downstream application of the produced file. Most short read mappers (e.g. bowtie, bwa, shrimp) use the information stored in the quality string to calculate mapping scores. Ideally therefore, we would want to use a fastQ input file for mapping to the reference genome. So our script had to be modified to read and output a fastQ file.

It turned out that this wasn’t such a daunting task. All we needed to do was to print the second next line in the file after the nucleotide sequence. Making sure of course to output only those characters from the quality string that corresponded to the start and end of the matched regular expression. The modified code looks like this and is similarly fast:

#!/usr/bin/awk

{
  if ($1~ “@HWI”)
    name=substr($1,27)
  else {
    where=match($0, /.{10}GCA.{6}TGC.{12}/)
    if (where)
      print “@”name”\n”substr($0,RSTART,RLENGTH)”\n+”
      getline
      getline
      print substr($0,RSTART,RLENGTH)
  }
}

Of course, in both scripts, the regular expression we want to match (and other parts of the code) are hard-coded and would have to be manually adjusted to work with of fastA/Q files with differently formatted parts (e.g. read names).


Pages:12