Comparative Genomics: Evolution of orthologs


  • open Mozilla
  • open two terminal windows
  • Get E. coli genome by typing:
    ftp http://www.bx.psu.edu/~anton/share/bmmb598f/hp.gbk
    ftp http://www.bx.psu.edu/~anton/share/bmmb598f/wol.gbk
    ftp http://www.bx.psu.edu/~anton/share/bmmb598f/hp_genome.fna
    ftp http://www.bx.psu.edu/~anton/share/bmmb598f/wol_genome.fna
    ftp http://www.bx.psu.edu/~anton/share/bmmb598f/bbh.sql
    ftp http://www.bx.psu.edu/~anton/share/bmmb598f/hProt.sql
    ftp http://www.bx.psu.edu/~anton/share/bmmb598f/wProt.sql
    ftp http://www.bx.psu.edu/~anton/share/bmmb598f/mysql
    ftp ftp://ftp.ebi.ac.uk/pub/software/unix/clustalw/clustalw1.82.mac-osx.tar.gz
    chmod 755 mysql
    

Open mysql by typing

./mysql -u XXX -h ZZZZ XXX

drop all tables and type quit Now load tables you just downloaded using these:

mysql -u XXX -h ZZZZ XXX < bbh.sql
mysql -u XXX -h ZZZZ XXX < hProt.sql
mysql -u XXX -h ZZZZ XXX < wProt.sql

To make sense of orthologous relationship we established last time we need to generate alignments. There several layers of complications:

  1. We need to get nucleotide sequences encoding proteins we used to find BBHs
  2. We need to align these nucleotide sequences codon by codon (not nucleotide by nucleotide)

Getting coding sequences

To get coding sequences we need to parse Genbank *.gbk files, which is a brutal thing to do. In your directory you have two files: hp.gbk and wol.gbk (Both Bioperl and Biopython offer useful functions for solving this problem, but we'll use bare hands here).

more hp.gbk
egrep ' CDS | \/db_xref=\"GI' hp.gbk | more
egrep ' \/db_xref=\"GI' hp.gbk > gi.txt
egrep ' CDS ' hp.gbk > cds.txt
paste cds.txt gi.txt | sed s/\ *//g | sed s/^CDS// | sed s/\\/db_xref=\"GI:// | sed s/\"$// > hp_dna2prot.txt

Now let's do the same thing for Wolinella:

egrep ' CDS ' wol.gbk > cds.txt
egrep ' \/db_xref=\"GI' wol.gbk > gi.txt
paste cds.txt gi.txt | sed s/\ *//g | sed s/^CDS// | sed s/\\/db_xref=\"GI:// | sed s/\"$// > wol_dna2prot.txt 

To get CDS we can now run this program:

#! /usr/bin/perl -w

# getCDS.pl [coord file] [seq file]

open (COORD, "<$ARGV[0]") or die "Cannot open $ARGV[0]:$!\n";
@coord = <COORD>;
close COORD;
chop @coord;

open (SEQ, "<$ARGV[1]") or die "Cannot open $ARGV[1]:$!\n";
while (<SEQ>) {
  if (!m/^>/) {
    chop;
    $seq .= $_;
  }
}

close SEQ;

foreach (@coord) {
  $comp = 0;
  if (!m/join/) {
    $comp = 1 if m/complement/;
    m/\({0,1}(\d+)\.\.(\d+)\){0,1}\t(\d+)$/;
    print "$3\t";
    if ($comp == 1) {
      print reverse_complement(substr($seq, $1-1, ($2-$1+1)));
    } else {
      print substr($seq, $1-1, ($2-$1+1));
    }
    print "\n";
  }
}


sub reverse_complement
{ my @temp = split (//, $_[0]);
  my @result;
  foreach (@temp)
  { if (m/a/i)
    { push(@result, "T");
    } elsif (m/t/i)
    { push(@result, "A");
    } elsif (m/c/i)
    { push(@result, "G");
    } elsif (m/g/i)
    { push(@result, "C");
    } elsif (m/u/i)
    { push(@result, "C");
    } elsif (m/r/i)
    { push(@result, "Y");
    } elsif (m/y/i)
    { push(@result, "R");
    } elsif (m/m/i)
    { push(@result, "K");
    } elsif (m/k/i)
    { push(@result, "M");
    } else
    { push(@result, $_);
    }
  }
  return uc(join("", reverse(@result)));
}

As the output we will get a tab delimited files containing GIs and DNA sequences. We will create a table in MySQL and finally generate a summary file that contains both Protein and DNA sequences.


Thu Nov 20 19:55:58 2008