:::: MENU ::::
Posts tagged with: Secondary structure

Creating a partition file based on secondary structure alignment in WUSS format

Recently I’ve been using Erik Nawrocki’s nice program SSU-ALIGN to line up a bunch of ribosomal sequences to covariance models of secondary structure of ribosomal RNA. The program is really well documented, fast, and creates very informative output. The consensus secondary structure of the aligned rRNA sequences is captured in WUSS format which looks something like this:

{{{{{-{{{{{{{–{{,,{{{{{{{{{{{{{{{{{,,,,<<<<________>>>>,,,,,,,,{{{{{{{,,,,<<<<<—->>>>>,,,[[-[[[[–[[[[[[[[[,,,((((((<<<____>>>,,<<<<______>>>>,,)))))),,,,,((((,<<<-<<<<–<<<__________>>>–>>>>>>>,,<<<<<_______>>>>>,,,,,)))),,,,]]]]-]]]—]-]]]]]]],,,,,}}}}}}},,,}}-}}}}}}}}}}

The various brackets represent paired sites (in stems), whereas dashes, dots, underscores unpaired sites (in loops) in the molecule. Different types of brackets denote different types of intramolecular pairing between nucleotides (more here). Since I wanted to partition the rRNA data based on secondary structure (stems vs. loops) I needed to parse this output. My really simplistic way of doing this in an interactive terminal session follows.

cat SS_cons | tr “[]{}()<>” “1” | tr “:.,~_” “2” | sed ‘s/-/2/g’ | sed ‘s/\(.\)/\1 \n/g’ | awk ‘{print NR “\t” “c”$0}’ > out

In this rather long oneliner I am replacing all the characters denoting paired sites with ”1” (first tr command) and all characters denoting unpaired sites with ”2” (second tr command). The first sed command is for the dashes (which for some reason are not replaced by tr). The second sed command I ripped off the internet and is to print a newline character after each character in the string. The awk command is to prepend a line number to each character and a ”c” before the number for partition assignment (I needed this for easier grep in the next step). Finally, I am redirecting output to a file. The first six lines of output from this messy oneliner looks like this:

1 c1

2 c1

3 c1

4 c1

5 c1

6 c2

 

So for each character in the alignment we have the position (first column) and the partition prefaced with a ”c” (second column). Now to find all the character positions that need to go in the paired partition I did:

grep “c1” out | cut -f 1 | tr “\n” “,” > paired_sites

In this command I grep all lines that have c1 (meaning first partition), then cut only the first column (discarding c1) and finally replacing the newline characters with a comma (tr command). The output is redirected to a file and ready paste in a RAxML-type partition file. Change the comma to a space for a Nexus-type partition. Rinse and repeat the last command for the characters in the second partition (unpaired sites).

And that’s it. I am sure there is a much simpler way of doing this. But it worked for me and I easily moved on to the more interesting part of my work. Building the phylogeny.