:::: MENU ::::
Posts tagged with: awk

Convert a sequential phylip to fasta (with awk)

Of course, its not that difficult to open Mesquite or Seaview and do this there. But I don’t want to do it for more than two files. And I’m sure there are plenty of other more robust (perl/python) ways to do this. But this is pretty simple:

awk –posix ‘{if ($1 ~ /[:alpha:]/) print “>”$1″\n”$2}’ input_seqPhy > output_fasta


A quick and easy way to fix the names of sequences downloaded from GenBank

I have been working on some newly generated sequence data for some unknown parasitic Oomycetes. We are trying to assess, molecularly at least, what these strains are. And to do so, we downloaded a number of sequences from GenBank. When producing the fasta file GenBank privides us with plenty of useful information stored in the name of each sequence. For example the names of sequences usually look something like this:

>gi|403226878|gb|JQ340014.1| Saprolegnia ferax strain GH1 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial
>gi|320337811|gb|HQ709057.1| Saprolegnia unispora voucher CBS110066 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial
>gi|320337807|gb|HQ709055.1| Saprolegnia turfosa voucher CBS110065 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial
>gi|320337803|gb|HQ709053.1| Saprolegnia turfosa voucher CBS32735 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial
>gi|320337799|gb|HQ709051.1| Saprolegnia terrestris voucher CBS53367 cytochrome oxidase subunit 1 (COI) gene, partial cds; mitochondrial

Although having all this information is great, sequences with such long names are difficult and cumbersome to edit in most alignment viewers like Mesquite and Seaview. Plus its a horrible waste of time. Additionally, viewing phylogenetic trees with names as long as that is never pretty. Here is a quick and easy way to modify the names in this fasta file while retaining the most important information about genus, species and GenBank accession number:

awk ‘{if ($1 ~ “>”) print “>”$2″_”$3″_”substr($1,18,8); else print $0}’ input.fasta > output fasta

The produced fasta file will have names looking like this:

>Saprolegnia_ferax_JQ340014
>Saprolegnia_unispora_HQ709057
>Saprolegnia_turfosa_HQ709055
>Saprolegnia_turfosa_HQ709053
>Saprolegnia_terrestris_HQ709051


Pages:123