:::: MENU ::::

Finding the expected radIIB fragments from a reference genome

The next step in our analyses of RAD IIB data was to determine the total number of sites in the reference genome that would be recognized by the restriction enzyme. Moreover, we wanted to extact the sequences and positions of the expected fragments along the chromosome and store this information in a fastA file. This because mapping against the expected sites only would be faster and it makes sense since those are the only sites that we can compare. Following through with awk and examples of loops found on the web we came up with this short code:

#!/usr/bin/awk

{
  if ($1~”>”)
    name=$1
  else
    sum=0;
    while (match($0, /.{10}GCA.{6}TGC.{12}/))
    {
      printf name”_”sum+RSTART”\n”substr($0, RSTART, RLENGTH)”\n”;
      $0=substr($0, RSTART+RLENGTH);
      sum+=RSTART+RLENGTH-1
    }
}

In the while loop we go through the line representing a chromosome, searching for matches to our regular expression. When a match is found, we extract its sequence and record its position along the chromosome. Then, we update the chromosome string to start at the end of the previously matched site. And also, update the variable sum that keeps track how far down the chromosome we are. This last bit was important because we wanted to print the start position of the extracted fragments in the names of the newly generated fastA file. Note that for this script to work, each chromosome needs to be represented by a single line in the input fasta file. We converted the wrapped fasta files to one line per chromosome using (awk again):

awk ‘{if ($1 !~ “>”) printf”%s”, $0; else print $0}’


Comments are closed.