MOTIF

MOTIF ENUMERATION is unfortunately rather slow for large values of k and d,
1. Implanted Motif: Find all (k, d)-motifs in a collection of strings.
Input: A collection of strings Dna, and integers k and d.
3 1
ATTTGGC
TGCCTTA
CGGTATC
GAAAATT

Output: All (k, d)-motifs in Dna. ATA ATT GTT TTT

Any (k, d)-motif must be at most d mismatches apart from some k-mer appearing in one of the strings of Dna, generate all such k-mers and then check which of them are (k, d)-motifs.

import collections   
import itertools

# read input
in_file = open('w_3_1_data_set1.txt', 'r')
ou_file = open('data_set_out.txt', 'w')

line = 1
in_kmer = 0
in_dmut=0
in_dna=[]
in_result=''
ldata = ''
for in_data in in_file:
    ldata = in_data.strip(' \t\n\r')
    if (line==1):
        ldatav = ldata.split()
        in_kmer=int(ldatav[0])
        in_dmut=int(ldatav[1])
    elif (line>1 and len(ldata)!=0):
        in_dna.append(ldata)
    elif (line==0 or len(ldata)==0):
        line = -1
        in_result = ldata
    line+=1
    
dist_dna ="ACTG"    

#generate ACTG, 3 kmer =>  AAA,AAT,AAC, AAG,...
motifs =  map("".join, itertools.product(dist_dna, repeat=in_kmer))

#convert dna to dna kmer
dnas={}
for d in in_dna:
    sd= [d[i:i+in_kmer] for i in range(0, len(d)-in_kmer+1, 1)]
    dnas[d] = sd

# find mistake
ou_motif=[]
def FindMistake(iv,kv):
    mistake = 0
    for i,v in enumerate(iv):
        if (v!=kv[i]):
            mistake+=1
            if(mistake>in_dmut):  return False
    return True

#check motifs of dnas
for iv in motifs:
    k=0
    for j in dnas:
        find=False
        for jv in dnas[j]:
            find=FindMistake(iv,jv)
            if (find):
                k+=1
                break
        if(not find): break
    if(k==len(dnas)): ou_motif.append(iv)                
print ou_motif

Score(Motifs).

The number of lower case letters in the motif matrix, column-by-column (3 + 4 + 0 + 0 + 1 + 1 + 1 + 5 + 2 + 3 + 6 + 4 = 30).

The number of lower case letters in the motif matrix, computed row-by-row (3 + 4 + 2 + 4 + 3 + 2 + 3 + 2 + 4 + 3 = 30).

Each element in the sum 3 + 4 + 2 + 4 + 3 + 2 + 3 + 2 + 4 + 3 = 30 represents the number of mismatches between the consensus string TCGGGGATTTCC and a motif represented by a row in the motif matrix.

The number of mismatches between two k-mers v and w is called the Hamming distance between v and w and is denoted by d(v, w). For example, d(TCGGGGATTTCC, TCGGGGgTTTtt) = 3 for the 1st row of the motif matrix below.

Motifs             T C    G   G   G   G    g    T   T    T    t    t  3      
                        c   C    G   G   t    G   A    c    T    T    a   C  4
                        a   C    G   G   G   G   A    T   T    T    t    C  2
                        T   t    G   G   G   G   A    c    T    T    t    t    4
                        a   a    G   G   G   G   A    c    T    T    C  C  3
                        T   t    G   G   G   G   A    c    T    T    C  C  2
                        T   C   G   G   G   G   A    T   T    c    a    t    3
                        T   C   G   G   G   G   A    T   T    c    C   t  2
                        T   a    G   G   G   G   A    a    c    T    a   C  4
                        T   C   G   G   G   t    A    T   a    a    C   C  3

Score               3 + 4 + 0 + 0 + 1 + 1 + 1 + 5 + 2 + 3 + 6 + 4 = 30

Count        A: 2   2      0   0   0   0   9     1   1   1     3   0
C: 1   6      0   0   0   0   0     4   1   2     4   6
G: 0   0   10 10  9   9   1     0   0   0     0   0
T : 7   2     0   0   1   1     0    5   8    7    3   4

Profile       A: .2   .2   0   0   0   0     .9   1   .1   .1     .3   0
                   C: .1   .6   0   0   0    0     0 . 4   .1   .2   .4 .6
                   G: 0   0     1   1   .9 .9     .1   0     0   0     0   0
                   T: .7   .2   0     0   .1 .1     0 .5   .8   .7   .3   .4

Consensus  T  C  G    G  G G    A  T    T  T  C  C

consesus

String            A   C   G   G   G   G   A   T   T   A   C C

Probability    .2 ·.6 · 1 · 1 ·.9 ·.9 ·.9 ·.5 ·.8 ·.1   ·.4 ·.6 = 0.000839808 

The probability Pr(ACGGGGATTACC | Profile) = 0.000839808 

Entropy is a measure of the uncertainty of a probability distribution (p1, …, pN), and is defined as

H(p1,…,pN)=−∑i=1 toN pi⋅log2pi

For example, the entropy of the probability distribution (0.2, 0.6, 0.0, 0.2) corresponding to the 2nd column of the NF-κB profile matrix is:

−(0.2log20.2+0.6log20.6+0.0log20.0+0.2log20.2)≈1.371

whereas the entropy of the more conserved final column (0.0, 0.6, 0.0, 0.4) is:

−(0.0log20.0+0.6log20.6+0.0log20.0+0.4log20.4)≈0.971

and the entropy of the very conserved 5th column (0.0, 0.0, 0.9, 0.1) is:

−(0.0log20.0+0.0log20.0+0.9log20.9+0.1log20.1)≈0.467

2. Median String: Find a median string.
Input: A collection of strings Dna and an integer k.
Output: A k-mer Pattern that minimizes d(Pattern, Dna) among all k-mers Pattern.

Given a k-mer Pattern and a set of strings
Dna = {Dna1, … , Dnat}, we define
td(Pattern,Dna)=∑d(Pattern,Dnai)
i=1
as the sum of distances between Pattern and all strings in Dna. For example, for the following strings Dna,
d(AAA, Dna) = 1 + 1 + 2 + 0 + 1 = 5
:ttaccttAAc 1
gAtAtctgtc 1
Acggcgttcg 2
ccctAAAgag 0
cgtcAgAggt 1
Find a k-mer Pattern that minimizes d(Pattern, Dna) over all k-mers Pattern, call such a Pattern a median string for Dna.
in_file = open('w_3_2_data_set0.txt', 'r')

line = 1
in_kmer = 0
in_dna=[]
in_result=''
res=False
ldata = ''
dist_dna=''
for in_data in in_file:
    ldata = in_data.strip(' \t\n\r')
    if (line==1):
        in_kmer=int(ldata)
    elif (line>1 and len(ldata)!=0):
        in_dna.append(ldata)
    line+=1
    
dist_dna ="ACTG"
motifs =  map("".join, itertools.product(dist_dna, repeat=in_kmer))

#print motifs
ou_motif=[]
def FindMistake(iv,kv):
    mistake = 0
    for i,v in enumerate(iv):
        if (v!=kv[i]):
            mistake+=1
    return mistake

#convert dna to dna kmer
dnas={}
for d in in_dna:
    sd= [d[i:i+in_kmer] for i in range(0, len(d)-in_kmer+1, 1)]
    dnas[d] = sd

'''
3
AAATTGACGCAT
GACGACCACGTT
CGTCAGCGCCTG
GCTGAGCACCGG
AGTACGGGACAG

GAC

md_sum = 5*3= 15
AAA => AAATTGACGCAT => AGT 2(mismatch) 3(c =least mismatch), GTA 2 2, TAC 2 2,
ACG 2 2, CGG 3 2, GGG 3 2, GGA 2 2,GAC 2 2, ACA 1 2, CAG 2 1, sum_c = 1
AAA => GACGACCACGTT => ... 2(c=least mismatch), sum_c= 3
AAA => GCTGAGCACCGG => ... 2(c=least mismatch), sum_c= 5
AAA => CGTCAGCGCCTG => ... 2(c=least mismatch), sum_c= 7
(sum_c<=md_sum) => 7<15 =>  md_sum = 7, md_va = AAA
AAC => sum_c = 5 =>  => 5<7 =>  md_sum = 5, md_va = AAC
AAG => sum_c = 5 =>  => 5<7 =>  md_sum = 5, md_va = AAG
ACA => sum_c = 4 =>  => 5<7 =>  md_sum = 4, md_va = ACA
ACG => sum_c = 2 =>  => 2<5 =>  md_sum = 2, md_va = ACG
...
GAC => sum_c = 2 =>  => 2<5 =>  md_sum = 2, md_va = GAC

2 GAC
'''
md_sum= len(in_dna) * in_kmer
md_va=''
for iv in motifs:
    sum_c=0
    for k in dnas:
        c=in_kmer
        for kv in dnas[k]:
            nm = FindMistake(iv,kv)
            if(nm==0):
                c=0
                break
            elif(nm<c):
                c=nm
        sum_c+=c
        #print sum_c, md_sum
        if (md_sum>0 and sum_c>md_sum): break
    if(sum_c<=md_sum):
        md_sum = sum_c
        md_va= iv
    print sum_c, iv, md_sum,md_va

print md_sum,md_va

3.Profile-most Probable k-mer.
Input:
A string Text – ACCTGTTTATTGCCTAAGTTCCGAACAAACCCAATATAGCCCGAGGGCCT
an integer k – 5
a k × 4 matrix Profile
A C G T
0.2 0.4 0.3 0.1
0.2 0.3 0.3 0.2
0.3 0.1 0.5 0.1
0.2 0.5 0.2 0.1
0.3 0.1 0.4 0.2

A Profile-most probable k-mer in Text Output: CCGAG

in_file = open('w_3_3_data_set0.txt', 'r')
line = 1
in_kmer = 0
in_dna=''
in_profile=[]
in_matrix=[]
ldata = ''
m=0
for in_data in in_file:
    ldata = in_data.strip(' \t\n\r')
    if (line==1):
        in_dna = ldata
    elif (line==2):
        in_kmer=int(ldata)
    elif (line==3):
        in_profile=ldata.split()    
    elif (line>1 and len(ldata)!=0):
        m+=1
        in_matrix.append(ldata)
    line+=1
    
d_pr_matrix = {}
for i,p in enumerate(in_profile):
    pm=[]
    for j in range(0,m):
        pm.append((in_matrix[j].split())[i])
    d_pr_matrix[p]=pm
    
max_mp=0
max_dna=''
for i in range(0, len(in_dna)-in_kmer+1):
    p = in_dna[i:i+in_kmer]
    sum_mp=sum(float(d_pr_matrix[pl][j]) for j,pl in enumerate(p))
    if(sum_mp> max_mp):
        max_mp = sum_mp
        max_dna = p

print    max_mp,   max_dna

A DNA array (also known as a microarray) is a collection of DNA molecules attached to a solid surface. Each spot on the microarray is assigned a unique DNA sequence called a probe that measures the expression level of a specific gene, known as a target.

Dormancy survival regulator (DosR), a transcription factor that regulates many genes whose expression dramatically changes under hypoxic conditions.

4.Greedy MOTIF Search.
Input:
3 5
GGCGTTCAGGCA
AAGAATCAGTCA
CAAGGAGTTCGC
CACGTCAATCAC
CAATAATATTCG

Output:
CAG
CAG
CAA
CAA
CAA

The greedy motif search algorithm iteratively finds k-mers in the 1st, string from Dna, 2nd string from Dna, 3rd string from Dna, etc. After finding i – 1 k-mers Motifs in the first i – 1 strings of Dna, this algorithm constructs Profile(Motifs) and selects the Profile-most probable k-mer from the i-th string based on this profile matix (ties are broken arbitrarily).
To form the initial motif matrix BestMotifs, the algorithm starts by selecting arbitrary k-mers in each string from Dna; the following pseudocode selects the first k-mer in each string.
Greedy MOTIF Search(Dna, k,t)
form a set of k-mers BestMotifs by selecting 1st k-mers in each string from Dna
for each k-mer Motif in the 1st string from Dna
Motif1 ← Motif
for i = 2 to t
form Profile from motifs Motif1, …, Motifi – 1
Motifi ← Profile-most probable k-mer in the i-th string in Dna
Motifs ← (Motif1, …, Motift)
if Score(Motifs) < Score(BestMotifs)
BestMotifs ← Motifs
output BestMotifs

import collections   
import itertools
import datetime

in_file = open('w_3_4_data_set1.txt', 'r')

line = 1
in_kmer = 0
in_dna=[]
ldata = ''
in_m=0
for in_data in in_file:
    ldata = in_data.strip(' \t\n\r')
    if (line==1):
        in_kmer = int(ldata.split()[0])
        in_m = int(ldata.split()[1])
    elif (line>=2):
        in_dna.append(ldata)    
    line+=1
    
motifs=[]
profile_motif_matrix={}

'''
GGCGTTCAGGCA
GGC A[0.0, 0.0, 0.0] C[0.0, 0.0, 1.0] T[0.0, 0.0, 0.0] G[1.0, 1.0, 0.0]
    => ['AAGAATCAGTCA', 'CAAGGAGTTCGC', 'CACGTCAATCAC', 'CAATAATATTCG']
    => ['GGC', 'AAG', 'CAA', 'CAC', 'CAA'] (max_probability_dna_kmer)
    => A[1.0, 4.0, 2.0], C[3.0, 0.0, 2.0], T[0.0, 0.0, 0.0], G[1.0, 1.0, 1.0]/5(in_m)
        => max(Col1 1,3,0,1) => 3 - 5(in_m) = 2 (score_motif)
        => max(Col2 4,0,0,1) => 4 - 5(in_m) = 1 (score_motif)
        => max(Col3 2,2,0,1) => 2 - 5(in_m) = 3 (score_motif) => 6.0
    => A[0.2, 0.8, 0.4], C[0.6, 0.0, 0.4], T[0.0, 0.0, 0.0], G[0.2, 0.2, 0.2]
GCG A[0.0, 0.0, 0.0] C[0.0, 1.0, 0.0] T[0.0, 0.0, 0.0] G[1.0, 0.0, 1.0]
    => ['AAGAATCAGTCA', 'CAAGGAGTTCGC', 'CACGTCAATCAC', 'CAATAATATTCG']
    => ['GCG', 'AAG', 'CAA', 'CAC', 'CAA'] (max_probability_dna_kmer)
    => A[0.2, 0.8, 0.4], C[0.6, 0.2, 0.2], T[0.0, 0.0, 0.0], G[0.2, 0.0, 0.4] (/5)
...
GCA A[0.0, 0.0, 1.0] C[0.0, 1.0, 0.0] T[0.0, 0.0, 0.0] G[0.0, 0.0, 1.0]
    => ['AAGAATCAGTCA', 'CAAGGAGTTCGC', 'CACGTCAATCAC', 'CAATAATATTCG']
    => ['GCA', 'AAG', 'CAA', 'CAC', 'CAA'] (max_probability_dna_kmer)
    => A[0.2, 0.8, 0.6], C[0.6, 0.2, 0.2], T[0.0, 0.0, 0.0], G[0.2, 0.0, 0.2] (/5)
'''
def ScoreMotif():
    alist=[]
    clist=[]
    tlist=[]
    glist=[]
    for i in range(0,in_kmer):
        alist.append(0.00)
        clist.append(0.00)
        tlist.append(0.00)
        glist.append(0.00)
    for j, motif in enumerate(motifs):
        for i in range(0,in_kmer):
            single =motif[i]
            ac = single.count('A')
            cc = single.count('C')
            tc = single.count('T')
            gc = single.count('G')
            alist[i]=(alist[i]+float(ac))
            clist[i]=(clist[i]+float(cc))
            tlist[i]=(tlist[i]+float(tc))
            glist[i]=(glist[i]+float(gc))
        profile_motif_matrix['A'] =alist
        profile_motif_matrix['C'] =clist
        profile_motif_matrix['G'] =glist
        profile_motif_matrix['T'] =tlist

    score_motif=0
    if(len(motifs)==in_m):
        for i in range(0,in_kmer):
            lv=0
            a = profile_motif_matrix['A'][i]
            c = profile_motif_matrix['C'][i]
            g = profile_motif_matrix['G'][i]
            t = profile_motif_matrix['T'][i]
            if(a>lv and a!=0):
                lv=a
            if(c>lv and c!=0):
                lv=c
            if(g>lv and g!=0):
                lv=g
            if(t>lv and t!=0):
                lv=t
            lv=in_m-lv   
            score_motif+=lv

    for i in range(0,in_kmer):
        profile_motif_matrix['A'][i]/=(j+1.0)
        profile_motif_matrix['C'][i]/=(j+1.0)
        profile_motif_matrix['G'][i]/=(j+1.0)
        profile_motif_matrix['T'][i]/=(j+1.0)
    return score_motif

'''
['GGCGTTCAGGCA', 'AAGAATCAGTCA', 'CAAGGAGTTCGC', 'CACGTCAATCAC', 'CAATAATATTCG']
GGC
    => motifs[GGC]
    =>'AAGAATCAGTCA' => AAG 0 (probability_dna_kmer), AGA 0, GAA 0, ...
        => AAG 0 (max_probability_dna_kmer)
        => motifs[GGC, AAG]
    =>'CAAGGAGTTCGC' => CAA 0 (probability_dna_kmer), AAG 0, AGG 0, ...
        => CAA 0 (max_probability_dna_kmer)
        => motifs[GGC, AAG, CAA]
    ...
    motifs['GGC', 'AAG', 'CAA', 'CAC', 'CAA']
        => motif_score = 6.0
        => best_motif ['GGC', 'AAG', 'CAA', 'CAC', 'CAA'], best_motif_score 6.0
    
GCG
    => motifs[GCG]
    =>'AAGAATCAGTCA' => AAG 0 (probability_dna_kmer), AGA 0, GAA 0, ...
        => AAG 0 (max_probability_dna_kmer)
        => motifs[GCG, AAG]
    =>'CAAGGAGTTCGC' => CAA 0 (probability_dna_kmer), AAG 0, AGG 0, ...
        => CAA 0 (max_probability_dna_kmer)
        => motifs[GCG, AAG, CAA]
    ...
    motifs['GGC', 'AAG', 'CAA', 'CAC', 'CAA']
        => motif_score = 6.0
        => best_motif ['GGC', 'AAG', 'CAA', 'CAC', 'CAA'], best_motif_score 6.0
...
CAG
    => motifs[CAG]
    =>'AAGAATCAGTCA' => AAG 1.x (probability_dna_kmer), AGA 1.x, ... CAG 1...
        => CAG 1 (max_probability_dna_kmer)
        => motifs[CAG, CAG]
    =>'CAAGGAGTTCGC' => CAA 0 (probability_dna_kmer), AAG 0, AGG 0, ...
        => CAA 0 (max_probability_dna_kmer)
        => motifs[CAG, CAG, CAA]
    ...
    motifs['CAG', 'CAG', 'CAA', 'CAC', 'CAA']
        => motif_score = 3.0
        => best_motif ['CAG', 'CAG', 'CAA', 'CAC', 'CAA'], best_motif_score 3.0

ans ['CAG', 'CAG', 'CAA', 'CAC', 'CAA']
'''
best_motif_score =in_m*in_kmer
best_motif= []
for k in range(0,len(in_dna[0])-in_kmer+1):
    dna_kmer = in_dna[0][k:k+in_kmer]
    motifs.append(dna_kmer)
    ScoreMotif()
    for i in range(1,len(in_dna)):
        max_probability_dna_kmer = -1
        dna_kmer=''
        for j in range(0,len(in_dna[i])-in_kmer+1):
            p = in_dna[i][j:j+in_kmer]
            probability_dna_kmer =1
            for l in range(0,len(p)):
                probability_dna_kmer*= profile_motif_matrix[p[l]][l]
            if(probability_dna_kmer>max_probability_dna_kmer):
                max_probability_dna_kmer = probability_dna_kmer
                dna_kmer = p
        motifs.append(dna_kmer)
        motif_score = ScoreMotif()
        if(len(motifs)==in_m):
            if (motif_score<best_motif_score):
                best_motif=motifs
                best_motif_score = motif_score
    motifs=[]
    profile_motif_matrix={}

for i, f in enumerate(best_motif):
    print f

Cromwell’s rule is relevant to the calculation of the probability of a string based on a profile matrix. For example, consider the following Profile.
Profile A: .2 .2 0 0 0 0 .9 .1 .1 .1 .3 0
C: .1 .6 0 0 0 0 0 .4 .1 .2 .4 .6
G: 0 0 1 0 .9 .9 .1 0 0 0 0 0
T: .7 .2 0 0 .1 .1 0 .5 .8 .7 .3 .4
String T C G T G G A T T T C C
Probability .7 •.6 • 1 • 0 •.9 •.9 •.9 •.5 •.8 •.7 •.4 •.6 = 0
Pr(TCGTGGATTTCC | Profile) = 0

Laplace’s Rule of Succession – adding 1 to each element of Count(Motifs)
Motifs T A A C
G T C T
A C T A
A G G T
Count A: 2 1 1 1
C: 0 1 1 1
G: 1 1 1 0
T: 1 1 1 2
Profile A: 2/4 1/4 1/4 1/4
C: 0 1/4 1/4 1/4
G: 1/4 1/4 1/4 0
T: 1/4 1/4 1/4 2/4

Laplace’s Rule of Sucession updates these count and profile matrices to the following:
Count A: 2+1 1+1 1+1 1+1
C: 0+1 1+1 1+1 1+1
G: 1+1 1+1 1+1 0+1
T: 1+1 1+1 1+1 2+1
Profile A: 3/8 2/8 2/8 2/8
C: 1/8 2/8 2/8 2/8
G: 2/8 2/8 2/8 1/8
T: 2/8 2/8 2/8 3/8

Randomized Motif Search may change all t motifs in Motifs = (Motif1, …, Motift) in a single iteration. This strategy may prove reckless, since some correct motifs may potentially be discarded at the next iteration.
Randomized Motif Search(Dna, k, t)
randomly select k-mers Motifs = (Motif1, …, Motift) in each string from Dna
BestMotifs ← Motifs
while forever
Profile ← Profile(Motifs)
Motifs ← Motifs(Profile, Dna)
if Score(Motifs) < Score(BestMotifs)
BestMotifs ← Motifs
else
output BestMotifs
return

Gibbs Sampler is a more cautious iterative algorithm that discards a single k-mer from the current set of motifs at each iteration and replaces it with a new one (or keeps it). It thus moves with more caution in the space of all motifs, as illustrated below.
It starts with randomly chosen k-mers in each of t DNA sequences, but it makes a random rather than a deterministic choice at each iteration. It uses randomly selected k-mers (Motif1, …, Motift) to come up with another (hopefully higher scoring) set of k-mers.
It randomly selects an integer i between 1 and t and then randomly changes only one Motifi.
Gibbs Sampler(Dna, k, t, N)
randomly select k-mers Motifs = (Motif1, …, Motift) in each string from Dna
BestMotifs ← Motifs
for i from 1 to N
i ← Random(t)
construct profile matrix Profile from all strings in Motifs except for Motifi
Motifi ← Profile-randomly generated k-mer in the i-th sequence
if Score(Motifs) < Score(BestMotifs)
BestMotifs ← Motifs
output BestMotifs

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: