Comparative Genomics: Evolution of orthologs
- open Mozilla
- open two terminal windows
- Get E. coli genome by typing:
ftp http://www.bx.psu.edu/~anton/share/bmmb598f/hp.gbk ftp http://www.bx.psu.edu/~anton/share/bmmb598f/wol.gbk ftp http://www.bx.psu.edu/~anton/share/bmmb598f/hp_genome.fna ftp http://www.bx.psu.edu/~anton/share/bmmb598f/wol_genome.fna ftp http://www.bx.psu.edu/~anton/share/bmmb598f/bbh.sql ftp http://www.bx.psu.edu/~anton/share/bmmb598f/hProt.sql ftp http://www.bx.psu.edu/~anton/share/bmmb598f/wProt.sql ftp http://www.bx.psu.edu/~anton/share/bmmb598f/mysql ftp ftp://ftp.ebi.ac.uk/pub/software/unix/clustalw/clustalw1.82.mac-osx.tar.gz chmod 755 mysql
Open mysql by typing
./mysql -u XXX -h ZZZZ XXX
drop all tables and type quit Now load tables you just downloaded using these:
mysql -u XXX -h ZZZZ XXX < bbh.sql mysql -u XXX -h ZZZZ XXX < hProt.sql mysql -u XXX -h ZZZZ XXX < wProt.sql
To make sense of orthologous relationship we established last time we need to generate alignments. There several layers of complications:
- We need to get nucleotide sequences encoding proteins we used to find BBHs
- We need to align these nucleotide sequences codon by codon (not nucleotide by nucleotide)
Getting coding sequences
To get coding sequences we need to parse Genbank *.gbk files, which is a brutal thing to do. In your directory you have two files: hp.gbk and wol.gbk (Both Bioperl and Biopython offer useful functions for solving this problem, but we'll use bare hands here).
more hp.gbk egrep ' CDS | \/db_xref=\"GI' hp.gbk | more egrep ' \/db_xref=\"GI' hp.gbk > gi.txt egrep ' CDS ' hp.gbk > cds.txt paste cds.txt gi.txt | sed s/\ *//g | sed s/^CDS// | sed s/\\/db_xref=\"GI:// | sed s/\"$// > hp_dna2prot.txt
Now let's do the same thing for Wolinella:
egrep ' CDS ' wol.gbk > cds.txt egrep ' \/db_xref=\"GI' wol.gbk > gi.txt paste cds.txt gi.txt | sed s/\ *//g | sed s/^CDS// | sed s/\\/db_xref=\"GI:// | sed s/\"$// > wol_dna2prot.txt
To get CDS we can now run this program:
#! /usr/bin/perl -w
# getCDS.pl [coord file] [seq file]
open (COORD, "<$ARGV[0]") or die "Cannot open $ARGV[0]:$!\n";
@coord = <COORD>;
close COORD;
chop @coord;
open (SEQ, "<$ARGV[1]") or die "Cannot open $ARGV[1]:$!\n";
while (<SEQ>) {
if (!m/^>/) {
chop;
$seq .= $_;
}
}
close SEQ;
foreach (@coord) {
$comp = 0;
if (!m/join/) {
$comp = 1 if m/complement/;
m/\({0,1}(\d+)\.\.(\d+)\){0,1}\t(\d+)$/;
print "$3\t";
if ($comp == 1) {
print reverse_complement(substr($seq, $1-1, ($2-$1+1)));
} else {
print substr($seq, $1-1, ($2-$1+1));
}
print "\n";
}
}
sub reverse_complement
{ my @temp = split (//, $_[0]);
my @result;
foreach (@temp)
{ if (m/a/i)
{ push(@result, "T");
} elsif (m/t/i)
{ push(@result, "A");
} elsif (m/c/i)
{ push(@result, "G");
} elsif (m/g/i)
{ push(@result, "C");
} elsif (m/u/i)
{ push(@result, "C");
} elsif (m/r/i)
{ push(@result, "Y");
} elsif (m/y/i)
{ push(@result, "R");
} elsif (m/m/i)
{ push(@result, "K");
} elsif (m/k/i)
{ push(@result, "M");
} else
{ push(@result, $_);
}
}
return uc(join("", reverse(@result)));
}
As the output we will get a tab delimited files containing GIs and DNA sequences. We will create a table in MySQL and finally generate a summary file that contains both Protein and DNA sequences.
Thu Nov 20 19:55:58 2008