:::: MENU ::::
Posts tagged with: Illumina

What do novel genomic sequences of diatoms BLAST to?

I blasted some 113K contigs to the Whole Genome Shotgun database from NCBI. These contigs originate from a diatom (heterokonts), way far phylogenetically from any metazoan. Bellow are the only hits that I got. Its funny and frustrating at the same time. Funny because I get all sorts of weird and interesting creatures in my results (star-nosed mole; armadillo; aye-aye lemur). Frustrating because these results don’t really help me beyond saying that my sequences are probably not bacteria.

>gb|AFFD01005562.1| Drosophila biarmipes ctg7180000294536, whole genome shotgun sequence == INSECT
>emb|CAKG01019517.1| Drosophila suzukii WGS project CAGK00000000 data, contig drspszk_2104792, == INSECT
>gb|AEQU011255777.1| Python molurus contig26097431, whole genome shotgun sequence == SNAKE
>gb|ACUP01002672.1| Glycine max chromosome 4 GLYMAchr_04_Cont2672, whole genome shotgun == SOY BEAN
>gb|AAFI02000003.1| Dictyostelium discoideum AX4 chromosome 1 DDB0232442.02, whole == SLIME MOLD
>gb|AEAQ01006275.1| Solenopsis invicta Si_gnG.contig09098, whole genome shotgun sequence == ANT
>gb|AKZM01041212.1| Ceratotherium simum simum contig041212, whole genome shotgun == WHITE RHINOCEROS
>gb|AGTM011602453.1| Daubentonia madagascariensis contig_1605346, whole genome shotgun == AYE-AYE LEMUR
>gb|AAPN01433181.1| Ornithorhynchus anatinus Cont293.145, whole genome shotgun sequence == PLATYPUS
>gb|AFNC01022486.1| Arabidopsis thaliana Contig_3958_18, whole genome shotgun sequence == FLOWER
>gb|AEFG01022091.1| Petromyzon marinus contig_22090, whole genome shotgun sequence == LAMPREY
>gb|AAGV03065875.1| Dasypus novemcinctus Contig65953, whole genome shotgun sequence == ARMADILLO
>gb|ABRT010150056.1| Tarsius syrichta cont1.150055, whole genome shotgun sequence == PHILIPPINE TARSIER
>gb|AHZZ01102666.1| Papio anubis, whole genome shotgun sequence == OLIVE BABOON
>gb|AANI01015498.1| Drosophila virilis strain TSC#15010-1051.87 Ctg01_15536, whole == INSECT
>emb|CABZ01111607.1| Danio rerio strain Tuebingen, whole genome shotgun sequencing, == ZEBRA FISH
>gb|AFSA01033859.1| Lactuca sativa, whole genome shotgun sequence == LETTUCE
>gb|AGCE01114062.1| Saimiri boliviensis boliviensis contig114062, whole genome shotgun == SQUIRREL MONKEY
>gb|AJVK01038101.1| Phlebotomus papatasi Contig6480.1, whole genome shotgun sequence == DIPTERAN
>gb|ADFV01027118.1| Nomascus leucogenys contig_72117, whole genome shotgun sequence == WHITE-CHEEKED GIBBON
>gb|AJFE01086779.1| Pan paniscus cntg87012, whole genome shotgun sequence == BONOBO
>gb|AABR06063353.1| Rattus norvegicus strain BN/SsNHsdMCW Contig63439, whole genome == RAT
>gb|AGDA01037861.1| Mengenilla moldrzyki contig16548, whole genome shotgun sequence == PARASITIC INSECT
>gb|AEQM02009745.1| Bombus impatiens ctg_1080001, whole genome shotgun sequence == BUMBLEBEE
>gb|AFNY01011920.1| Neolamprologus brichardi contig011920, whole genome shotgun sequence == TANGANYIKAN CICHLID
>gb|ABQD01000052.1| Phaeodactylum tricornutum CCAP 1055/1 chromosome 13 PHATRchr_13_Cont52, == DIATOM
>gb|AEAC01002359.1| Harpegnathos saltator strain R22 G/1 HarSal_1.0_1.contig2359, == JUMPING ANT
>gb|ABWE02003637.1| Hyaloperonospora arabidopsidis Emoy2 Contig474.0, whole genome == OOMYCETE
>gb|AFNH02001894.1| Gregarina niphandrodes contig01894, whole genome shotgun sequence == APICOMPLEXAN
>ref|NZ_AEWG01000018.1| Rubrivivax benzoatilyticus JA2 contig_18, whole genome shotgun == BETAPROTEOBACTERIUM
>gb|AAPU01010980.1| Drosophila mojavensis strain TSC#15081-1352.22 Ctg01_10981, whole == INSECT
>gb|AJFV01050767.1| Condylura cristata contig050767, whole genome shotgun sequence == STAR-NOSED MOLE
>emb|CAIC01021708.1| Musa acuminata subsp. malaccensis WGS project CAIC00000000 data, == BANANA
>gb|AGTM010534631.1| Daubentonia madagascariensis contig_535897, whole genome shotgun == AYE-AYE LEMUR
>gb|AASC02043444.1| Aplysia californica cont2.43443, whole genome shotgun sequence == SEA SLUG
>gb|AHZO01011761.1| Alatina moseri contig11761, whole genome shotgun sequence == BOX JELLYFISH
>gb|AEFK01032559.1| Sarcophilus harrisii Chr1_contig_000032558, whole genome shotgun == TASMANIAN DEVIL
>gb|AGXQ01000024.1| Bacteroides fragilis HMW 610 cont1.24, whole genome shotgun sequence == OBLIGATE GUT ANAEROB
>gb|AABR06098521.1| Rattus norvegicus strain BN/SsNHsdMCW Contig98680, whole genome == RAT
>dbj|BAAF04004842.1| Oryzias latipes DNA, contig4842 in scaffold5, strain: Hd-rR, == JAPANESE KILLFISH

Finding the expected radIIB fragments from a reference genome

The next step in our analyses of RAD IIB data was to determine the total number of sites in the reference genome that would be recognized by the restriction enzyme. Moreover, we wanted to extact the sequences and positions of the expected fragments along the chromosome and store this information in a fastA file. This because mapping against the expected sites only would be faster and it makes sense since those are the only sites that we can compare. Following through with awk and examples of loops found on the web we came up with this short code:


  if ($1~”>”)
    while (match($0, /.{10}GCA.{6}TGC.{12}/))
      printf name”_”sum+RSTART”\n”substr($0, RSTART, RLENGTH)”\n”;
      $0=substr($0, RSTART+RLENGTH);

In the while loop we go through the line representing a chromosome, searching for matches to our regular expression. When a match is found, we extract its sequence and record its position along the chromosome. Then, we update the chromosome string to start at the end of the previously matched site. And also, update the variable sum that keeps track how far down the chromosome we are. This last bit was important because we wanted to print the start position of the extracted fragments in the names of the newly generated fastA file. Note that for this script to work, each chromosome needs to be represented by a single line in the input fasta file. We converted the wrapped fasta files to one line per chromosome using (awk again):

awk ‘{if ($1 !~ “>”) printf”%s”, $0; else print $0}’