:::: MENU ::::
Posts tagged with: denovo assembly

What do novel genomic sequences of diatoms BLAST to?

I blasted some 113K contigs to the Whole Genome Shotgun database from NCBI. These contigs originate from a diatom (heterokonts), way far phylogenetically from any metazoan. Bellow are the only hits that I got. Its funny and frustrating at the same time. Funny because I get all sorts of weird and interesting creatures in my results (star-nosed mole; armadillo; aye-aye lemur). Frustrating because these results don’t really help me beyond saying that my sequences are probably not bacteria.

>gb|AFFD01005562.1| Drosophila biarmipes ctg7180000294536, whole genome shotgun sequence == INSECT
>emb|CAKG01019517.1| Drosophila suzukii WGS project CAGK00000000 data, contig drspszk_2104792, == INSECT
>gb|AEQU011255777.1| Python molurus contig26097431, whole genome shotgun sequence == SNAKE
>gb|ACUP01002672.1| Glycine max chromosome 4 GLYMAchr_04_Cont2672, whole genome shotgun == SOY BEAN
>gb|AAFI02000003.1| Dictyostelium discoideum AX4 chromosome 1 DDB0232442.02, whole == SLIME MOLD
>gb|AEAQ01006275.1| Solenopsis invicta Si_gnG.contig09098, whole genome shotgun sequence == ANT
>gb|AKZM01041212.1| Ceratotherium simum simum contig041212, whole genome shotgun == WHITE RHINOCEROS
>gb|AGTM011602453.1| Daubentonia madagascariensis contig_1605346, whole genome shotgun == AYE-AYE LEMUR
>gb|AAPN01433181.1| Ornithorhynchus anatinus Cont293.145, whole genome shotgun sequence == PLATYPUS
>gb|AFNC01022486.1| Arabidopsis thaliana Contig_3958_18, whole genome shotgun sequence == FLOWER
>gb|AEFG01022091.1| Petromyzon marinus contig_22090, whole genome shotgun sequence == LAMPREY
>gb|AAGV03065875.1| Dasypus novemcinctus Contig65953, whole genome shotgun sequence == ARMADILLO
>gb|ABRT010150056.1| Tarsius syrichta cont1.150055, whole genome shotgun sequence == PHILIPPINE TARSIER
>gb|AHZZ01102666.1| Papio anubis, whole genome shotgun sequence == OLIVE BABOON
>gb|AANI01015498.1| Drosophila virilis strain TSC#15010-1051.87 Ctg01_15536, whole == INSECT
>emb|CABZ01111607.1| Danio rerio strain Tuebingen, whole genome shotgun sequencing, == ZEBRA FISH
>gb|AFSA01033859.1| Lactuca sativa, whole genome shotgun sequence == LETTUCE
>gb|AGCE01114062.1| Saimiri boliviensis boliviensis contig114062, whole genome shotgun == SQUIRREL MONKEY
>gb|AJVK01038101.1| Phlebotomus papatasi Contig6480.1, whole genome shotgun sequence == DIPTERAN
>gb|ADFV01027118.1| Nomascus leucogenys contig_72117, whole genome shotgun sequence == WHITE-CHEEKED GIBBON
>gb|AJFE01086779.1| Pan paniscus cntg87012, whole genome shotgun sequence == BONOBO
>gb|AABR06063353.1| Rattus norvegicus strain BN/SsNHsdMCW Contig63439, whole genome == RAT
>gb|AGDA01037861.1| Mengenilla moldrzyki contig16548, whole genome shotgun sequence == PARASITIC INSECT
>gb|AEQM02009745.1| Bombus impatiens ctg_1080001, whole genome shotgun sequence == BUMBLEBEE
>gb|AFNY01011920.1| Neolamprologus brichardi contig011920, whole genome shotgun sequence == TANGANYIKAN CICHLID
>gb|ABQD01000052.1| Phaeodactylum tricornutum CCAP 1055/1 chromosome 13 PHATRchr_13_Cont52, == DIATOM
>gb|AEAC01002359.1| Harpegnathos saltator strain R22 G/1 HarSal_1.0_1.contig2359, == JUMPING ANT
>gb|ABWE02003637.1| Hyaloperonospora arabidopsidis Emoy2 Contig474.0, whole genome == OOMYCETE
>gb|AFNH02001894.1| Gregarina niphandrodes contig01894, whole genome shotgun sequence == APICOMPLEXAN
>ref|NZ_AEWG01000018.1| Rubrivivax benzoatilyticus JA2 contig_18, whole genome shotgun == BETAPROTEOBACTERIUM
>gb|AAPU01010980.1| Drosophila mojavensis strain TSC#15081-1352.22 Ctg01_10981, whole == INSECT
>gb|AJFV01050767.1| Condylura cristata contig050767, whole genome shotgun sequence == STAR-NOSED MOLE
>emb|CAIC01021708.1| Musa acuminata subsp. malaccensis WGS project CAIC00000000 data, == BANANA
>gb|AGTM010534631.1| Daubentonia madagascariensis contig_535897, whole genome shotgun == AYE-AYE LEMUR
>gb|AASC02043444.1| Aplysia californica cont2.43443, whole genome shotgun sequence == SEA SLUG
>gb|AHZO01011761.1| Alatina moseri contig11761, whole genome shotgun sequence == BOX JELLYFISH
>gb|AEFK01032559.1| Sarcophilus harrisii Chr1_contig_000032558, whole genome shotgun == TASMANIAN DEVIL
>gb|AGXQ01000024.1| Bacteroides fragilis HMW 610 cont1.24, whole genome shotgun sequence == OBLIGATE GUT ANAEROB
>gb|AABR06098521.1| Rattus norvegicus strain BN/SsNHsdMCW Contig98680, whole genome == RAT
>dbj|BAAF04004842.1| Oryzias latipes DNA, contig4842 in scaffold5, strain: Hd-rR, == JAPANESE KILLFISH


Wrapped fastq problem

Recently I started working on some denovo assemblies of Illumina sequences. My approach starts with mapping to a known contaminant genome and storing reads that didn’t map to the contaminant in a fastq file. I do this to reduce memory and time requirements for the assembly. Unfortunatelly, for some reason, the mapper produced a fastq file that had sequences and qualities wrapped on a second line after the 79th character. This threw of the assemblers I tried to use – they inferred that there are more sequences in the file than there actually were.

I spent a better part of the morning trying to come up with a way to unwrap the fastq file (not very programming smart). The file looked like this:

@1101_1131_2245
CCTCATCAGTGCGAGTGTAGTAGTCAAGGTTTAGAACATGGGATAATCCATGTGTCCTAGCAGTTGCTTTGAGCTGACG
ATGCCATTGTTTCCATTTGTTG
+
@@@FFFFFHFHFH@G@EGIJGGEIEDEHIEEHDCFHGIIGIJGHIJJGIIIIIJGHIBHGIIGAFHIGGIJ=HIJEHHH
;BDCECCDEAACC@>CCC;AA#

And here is the solution I came up with:

awk ‘{if ( $1 ~ /.*_.*_.*/ || $1 ~ /^+$/) print “\n”$1; else printf”%s”, $0}’ | awk ‘NF’

The first condition has a regular expression that recognizes the fastq tag, and I will need to change it for fastq files with differently formatted tags. The last bit after the pipe is to remove any blank lines in the output.

Unfortunatelly this oneliner is not that fast. It took 23 minutes to convert a file of 18 milion reads.

Not perfect but it did the job…