Viterbiアルゴリズム

自主セミナーで以下のような問題をといて見ました。

以下のような2つの状態からなるHMMを考える。
State1:酵母のCodon Usage
State2:ATGC頻度から求めたトリヌクレオチドの確率
このモデルとViterbiアルゴリズムを用いて与えられた
DNA配列からcoding regionを同定せよ。

ちなみにHMMのtransition probabilityは

state0 state1
state0 0.9 0.1
state1 0.1 0.9

またemission probabilityは

state0 state1
TTT 0.0261 0.027
TCT 0.0235 0.018
TAT 0.0188 0.027
TGT 0.0081 0.018
TTC 0.0184 0.018
TCC 0.0142 0.012
TAC 0.0148 0.018
TGC 0.0048 0.012
TTA 0.0262 0.027
TCA 0.0187 0.018
TAA 0.0011 0.027
TGA 0.0007 0.018
TTG 0.0272 0.018
TCG 0.0086 0.012
TAG 0.0005 0.018
TGG 0.0104 0.012
CTT 0.0123 0.018
CCT 0.0135 0.012
CAT 0.0136 0.018
CGT 0.0064 0.012
CTC 0.0054 0.012
CCC 0.0068 0.008
CAC 0.0078 0.012
CGC 0.0026 0.008
CTA 0.0134 0.018
CCA 0.0183 0.012
CAA 0.0273 0.018
CGA 0.0030 0.012
CTG 0.0105 0.012
CCG 0.0053 0.008
CAG 0.0121 0.012
CGG 0.0017 0.008
ATT 0.0301 0.027
ACT 0.0203 0.018
AAT 0.0357 0.027
AGT 0.0142 0.018
ATC 0.0172 0.018
ACC 0.0127 0.012
AAC 0.0248 0.018
AGC 0.0098 0.012
ATA 0.0178 0.027
ACA 0.0178 0.018
AAA 0.0419 0.027
AGA 0.0213 0.018
ATG 0.0209 0.018
ACG 0.0080 0.012
AAG 0.0308 0.018
AGG 0.0092 0.012
GTT 0.0221 0.018
GCT 0.0212 0.012
GAT 0.0376 0.018
GGT 0.0239 0.012
GTC 0.0118 0.012
GCC 0.0126 0.008
GAC 0.0202 0.012
GGC 0.0098 0.008
GTA 0.0118 0.018
GCA 0.0162 0.012
GAA 0.0456 0.018
GGA 0.0109 0.012
GTG 0.0108 0.012
GCG 0.0062 0.008
GAG 0.0192 0.012
GGG 0.0060 0.008

とする。
特にスタートコドンや終止コドンの制約を設けずに
以下のようなプログラムを書いてみた。

#!/usr/local/bin/perl

#Viterbi algorithm
#observed symbols : trinucleotide
#hidden state : coding region, normal

#update log
#2008/05/31 1st version coded by Tanakky

use strict;

############################### main ##################################

#initialization
my ($seq_file, $trans_file, $emit_file) = &init;
my %trans = (); #transition probability
my %emit = (); #emittion probability
my $nrow = 0; #the number of row in matrix
my $ncol = 0; #the number of column in matrix
my $nuc = ""; #trinucleotide
my $seq = ""; #input sequence
my $seq2 = ""; #input sequence
my @init = (); #initialization probability
my %prob = (); #probability of optimal path
my %path = (); #optimal path
my $num = 0; #position of sequence.
my $max = 0; #max of probability
my $candi = 0; #candidate of probability
my $max_state = ""; #the hidden state of max probability

#get transition probability data
print STDERR "Get transition probability data\n";
open(TRANS, $trans_file) || die "cannot open $emit_file, $!\n";
while(){
chomp($_);
$ncol = 0;
while($_ =~ /^([^\t]+)\t/){
$trans{"$nrow:$ncol"} = log($1);
++$ncol;
$_ =~ s/^([^\t]+)\t//;
}
$trans{"$nrow:$ncol"} = log($_);
++$ncol;
++$nrow;
}
close TRANS;

#check transition data
if($nrow != $ncol){
print STDERR "The transition probability file is fault.\n";
exit;
}

#get initialization probability
print STDERR "Set initialization probability as uniform distribution.\n";
for(my $i=0; $i<$nrow; ++$i){
push(@init, log(1/$nrow));
}

#get emission probability data
print STDERR "Get emittion probability data\n";
open(EMIT, $emit_file) || die "cannot open $emit_file, $!\n";
while(){
#get trinucleotide
chomp($_);
$_ =~ s/^([^\t]+)\t//;
$nuc = $1;
$ncol = 0;
while($_ =~ /^([^\t]+)\t/){
$emit{"$ncol:$nuc"} = log($1);
++$ncol;
$_ =~ s/^[^\t]+\t//;
}
$emit{"$ncol:$nuc"} = log($_);
++$ncol;
if($nrow != $ncol){
print STDERR "The number of hidden state doesn't correspond to emittion data.\n";
exit;
}
}
close EMIT;

#get sequence file
print STDERR "Get sequence data\n";
open(IN, $seq_file) || die "Cannot open $seq_file,$!\n";
while(){
chomp($_);
$seq .= uc($_);
}
close IN;

#apply viterbi algorithm to 3-frame sequence
for(my $i=0; $i<3; ++$i){
$seq2 = $seq;
print STDERR "Apply viterbi algorithm ";
if($i == 0){
print STDERR "to first frame sequence.\n";
}elsif($i == 1){
$seq2 =~ s/^\w//;
print STDERR "to second frame sequence.\n";
}else{
$seq2 =~ s/^\w{2}//;
print STDERR "to third frame sequence.\n";
}

#initialize
%prob = ();
%path = ();
$num = 0;
$seq2 =~ s/^(\w{3})//;
$nuc = $1;
for(my $j=0; $j<$nrow; ++$j){
$prob{"$num:$j"} = $init[$j] + $emit{"$j:$nuc"};
$path{"$num:$j"} = $j;
}
++$num;

#Main loop
while($seq2 =~ /^(\w{3})/){
$nuc = $1;
for(my $j=0; $j<$nrow; ++$j){
$max = -100000000000000;
$max_state = "";
--$num;
for(my $k=0; $k<$nrow; ++$k){
$candi = $prob{"$num:$k"} + $trans{"$k:$j"} + $emit{"$nuc:$j"};
if($candi > $max){
$max = $candi;
$max_state = $path{"$num:$k"};
}
}
++$num;

#add maximum probability and path
#print STDERR "$max_state\t$max\n";
$prob{"$num:$j"} = $max;
$path{"$num:$j"} = $max_state . $j;
}
++$num;
$seq2 =~ s/^\w{3}//;
}

#Finale
$max = -1000000000000000;
$max_state = "";
--$num;
for(my $k=0; $k<$nrow; ++$k){
$candi = $prob{"$num:$k"};
if($candi > $max){
$max = $candi;
$max_state = $k;
}
}
if($i == 0){
print STDOUT "#First frame\n";
}elsif($i == 1){
print STDOUT "#Second frame\n";
}else{
print STDOUT "#Third frame\n";
}
print STDOUT "#Optimal probability (log): $max\n";
print STDOUT "#Optimal path : ";
print STDOUT $path{"$num:$max_state"};
print STDOUT "\n";
}

exit;

############################### init ##################################

sub init{
my $seq_file = "";
my $trans_file = "";
my $emit_file = "";
if(defined $ARGV[0] and $#ARGV == 2){
$seq_file = shift(@ARGV);
$trans_file = shift(@ARGV);
$emit_file = shift(@ARGV);
}else{
print STDERR "Usage : $0 [sequence file] [transition probability file] [emittion probability file], $!\n";
exit;
}
return($seq_file, $trans_file, $emit_file);
}

その結果は

[user-mac:~/viterbi] user% perl viterbi.pl sequence.txt transition_matrix emission_matrix
Get transition probability data
Set initialization probability as uniform distribution.
Get emittion probability data
Get sequence data
Apply viterbi algorithm to first frame sequence.
#First frame
#Optimal probability (log): -66.8781620581004
#Optimal path : 0000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000
Apply viterbi algorithm to second frame sequence.
#Second frame
#Optimal probability (log): -66.3444024177054
#Optimal path : 0000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000
Apply viterbi algorithm to third frame sequence.
#Third frame
#Optimal probability (log): -66.8688445914625
#Optimal path : 0000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000

全部0って。結局失敗(ToT)
結構モデルやパラメータを簡単に作っちゃったからな〜。
改良したらうまくいくかな(?_?)