DNA Replication – Frequent Words, Reverse Complement, Pattern Matching, Clump Finding, Skewi, Mismatches

Genome replication is one of the most important tasks carried out in the cell. Before a cell can divide, it must first replicate its genome so that each of the two daughter cells inherits its own copy.

Replication begins in a genomic region called the replication origin (denoted oriC) and is carried out by molecular copy machines called DNA polymerases.

Locating oriC presents an important task not only for understanding how cells replicate but also for various biomedical problems. For example, some gene therapy methods use genetically engineered mini-genomes, which are called viral vectors because they are able to penetrate cell walls (just like real viruses). Viral vectors carrying artificial genes have been widely used in agriculture.
In the following problem, we assume that a genome has a single oriC.

1.Frequent Words

:  Count(Text, Pattern) = k
The Pattern = k-mer to refer to a string of length k
k = The number of times that a k-mer Pattern appears as a substring of Text
For example,  Count(ACAACTATGCATACTATCGGGAACTATCCT, ACTAT) = 3
Count(CGATATATCCATAG, ATA) =  3  (not 2) , count for overlapping occurrences of Pattern in Text.

|Text| – k + 1 A straightforward algorithm for finding the most frequent words in a string Text
The overall number of steps of this algorithm is at most (|Text| – k + 1) • (|Text| – k + 1) • k.
The runtime of this algorithm has an upper bound of |Text|2 • k steps and refer to the complexity of this algorithm as O(|Text|2 • k).

Example   input ACGTTGCATGTCGCATGATGCATGAGAGCT, 4 = > output    CATG GCAT

import collections   

file = open('w1_1_data_set3.txt', 'r')
line = 1
kstr = "";
kmer = 12
for ldate in file:
    if (line==1):
        kstr = ldate
    elif (line==2):    
        kmer = int(ldate)
    line+=1        

    count = 0
    words = collections.defaultdict(list)
    mwords=[]
    for i in range(0, len(kstr) -kmer+1):
        word =  kstr[i:i+kmer]
        if len(word)==kmer:
            if word in words:
                words[word] = words[word]+1
                if words[word] > count :
                    count = words[word]
                    mwords=[]
                    mwords.append(word)
                elif words[word] == count:
                    mwords.append(word)
            else:
                words[word] = 1
  
  for w in mwords:
        print w

2. Reverse Complement:

The nucleotides A and T are complements of each other, as are G and C

example:  AAAACCCGGT = > ACCGGGTTTT

in_file = open('w1_2_data_set0.txt', 'r')

line = 1
in_str = ''
for in_data in in_file:
in_str = in_data.strip(' \t\n\r')

ou_str = in_str
ou_str = ou_str.replace("A","1")
ou_str = ou_str.replace("G","2")
ou_str = ou_str.replace("T","A")
ou_str = ou_str.replace("1","T")
ou_str = ou_str.replace("C","G")
ou_str = ou_str.replace("2","C")
print  ou_str[::-1])

Vibrio cholerae – which consists of 1,108,250 nucleotides.

The initiation of replication is mediated by DnaA, a protein that binds to a short segment within the oriC known as a DnaA box.

3. Pattern Matching

: Find all occurrences of a pattern in a string.
Sample Input:
ATAT
GATATATGCATATACTT

Sample Output:
1 3 9

1    3              9
GATATATGCATATACTT

for n in xrange(len(in_Genome)):
if (in_Genome.find(in_Pattern, n) == n):
print n

Thermotoga petrophila, a bacterium that strives in extremely hot environments; its name derives from its discovery in the water beneath oil reservoirs, where temperatures can exceed 80° Celsius (176° Fahrenheit), which consists of  1,823,512  (2 millions) nucleotides.

Ori-Finder, (http://tubic.tju.edu.cn/Ori-Finder/) a software tool for finding replication origins in DNA sequences.

4.Clump Finding

Find patterns forming clumps in a string.
Input: A string Genome, and integers k, L, and t.
Output: All distinct k-mers forming (L, t)-clumps in Genome.
L = fixed length (short interval)  the genome.
k = k-mer as a “clump” if it appears many times within a short interval of the genome.

Given integers L and t, a k-mer Pattern forms an (L, t)-clump inside a (larger) string Genome if there is an interval of Genome of length L in which this k-mer appears at least t times.

Sample Input:
CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA
5 50 4

Sample Output:
CGACA GAAGA

CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGA [(‘CGACA’, 4)]
GGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAG [(‘CGACA’, 4)]


GACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGA []

ATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGA [(‘GAAGA’, 4)]
TGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAG

CAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA []

import collections

in_file = open('w1_4_data_set0.txt','r')
ou_file = open('data_set_out.txt', 'w')

line = 1
genome = "";
k=0
l=0
t=0

for in_data in in_file:
if (line==1):
genome = in_data.strip(' \t\n\r')
elif (line==2):
in_line2 = (in_data.strip(' \t\n\r')).split()
k = int(in_line2[0])
l = int(in_line2[1])
t = int(in_line2[2])
line+=1

#print in_genome
print k, l, t

clumps=[]
for j in xrange(len(genome)):
kstr = genome[j:j+l]
if (len(kstr)==l):

patterns = collections.defaultdict(list)
for i in range(0, len(kstr)-k+1):
c =  kstr[i:i+k]
if c  not in clumps:
if c in patterns:
patterns[c] += 1
if patterns[c] == t:
clumps.append(c)
else:
patterns[c] = 1
print clumps

Solve the Clump Finding Problem by applying  Frequent Words Problem algorithm. However, if the FWP is not very efficient, then such an approach may be impractical.
For example, the FWP took roughly L2 • k steps. Applying this algorithm to each of length L in Genome will take approximately L2 • k • |Genome| steps. Since the length of most bacterial genomes varies from 1 million to 10 million nucleotides, such an approach may take your computer too long to solve the Clump Finding Problem.
E. coli genome of  4,639,676  (4.6 millions) nucleotides

http://education-portal.com/academy/lesson/how-okazaki-fragments-of-the-lagging-strand-dna-are-replicated.html#lesson

The replication terminus (denoted terC) is located roughly opposite to oriC in the chromosome

5. Skewi(Genome)

as the difference between the total number of occurrences of G and C in the first i nucleotides of Genome. A skew diagram is defined by plotting each value of i (as i ranges from 0 to |Genome|) against Skewi (Genome).

skew

Sample Input:
CATGGGCATCGGCCATACGCC
Sample Output:
0 -1 -1 -1 0 1 2 1 1 1 0 1 2 1 0 0 0 0 -1 0 -1 -2

for i,v in enumerate(in_genome):
if (v=="C"):
count_c+=1
elif(v=="G"):
count_g+=1
print count_g-count_c

6.Minimum Skew:

Find a position in a genome minimizing the skew.

Input: A DNA string Genome.
Output: All integer(s) i minimizing Skewi (Genome) among all values of i (from 0 to |Genome|).
Sample Input:
TAAAGACTGCCGAGAGGCCAACACGAGTGCTAGAACGAGGGGCGTAAACGCGGGTCCGAT
Sample Output:
11 24

in_file = open('w1_5_data_set0.txt', 'r')

line = 1
in_genome = "";
for in_data in in_file:
if (line==1):
in_genome = in_data.strip(' \t\n\r')
line+=1

count_c =0
count_g =0
max_diff = 0
pos_diff = ""
for i,v in enumerate(in_genome):
if (v=="C"):
count_c+=1
elif(v=="G"):
count_g+=1
diff_cg = count_c - count_g
#skew diagram
#print v,count_c, count_g, (count_g-count_c)
if (diff_cg>0 and diff_cg > max_diff):
pos_diff = str(i+1) + " "
max_diff = diff_cg
elif (diff_cg == max_diff):
pos_diff += str(i+1) + " "
#print v,"\t",count_c,"\t", count_g,"\t", diff_cg,"\t", max_diff,"\t", pos_diff

print pos_diff

7. Approximate Pattern Matching:

The position i in k-mers p1 … pk and q1 … qk is a mismatch if pi ? qi. For example, CGAAT and CGGAC have two mismatches.
Sample Input:
ATTCTGGA
CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT
3
Sample Output:
6 7 26 27

in_file = open('w1_6_data_set0.txt', 'r')

line = 1
in_pattern = "";
in_genome = "";
in_mistake = 0
ou_result =""
for in_data in in_file:
if (line==1):
in_pattern = in_data.strip(' \t\n\r')
elif (line==2):
in_genome = in_data.strip(' \t\n\r')
elif (line==3):
in_mistake = int(in_data.strip(' \t\n\r'))
line+=1

kmer = len(in_pattern)

def FindMistake(v):
mistake = 0
for i in range(0, kmer, 1):
if (v[i]!=in_pattern[i]):
mistake+=1

if(mistake>in_mistake):
return False
return True

for i in xrange(len(in_genome)-kmer+1):
v = in_genome[i:i+kmer]
if FindMistake(v):
ou_result+= str(i) + " "

print ou_result

8. Frequent Words with Mismatches

Sample Input:
ACGTTGCATGTCGCATGATGCATGAGAGCT 4 1
Sample Output:

GATG ATGC ATGT

import collections
import itertools

in_file = open('w1_7_data_set0.txt', 'r')

line = 1
in_genome = "";
in_kmer =0
in_mistake = 0
in_result=""
for in_data in in_file:
if (line==1):
line1 = (in_data.strip(' \t\n\r')).split()
in_genome = line1[0]
in_kmer = int(line1[1])
in_mistake = int(line1[2])
line+=1

distinct_letters = "".join(set(in_genome))

def diff_letters(a,b):
count = 0
for i in range(len(a)):
if(a[i] != b[i]):
count +=1
if(count>in_mistake):
return False
return True

genomes = [in_genome[i:i+in_kmer] for i in range(0, len(in_genome), 1) if len(in_genome[i:i+in_kmer]) ==in_kmer]
patterns = map("".join, itertools.product(distinct_letters, repeat=in_kmer))

mis_patterns=collections.defaultdict(list)
max_count =0
for i,iv in enumerate(patterns):
count = 0
for j,jv in enumerate(genomes):
if diff_letters(iv,jv):
count+=1
mis_patterns[iv]=count
if count>0 and count>max_count:
max_count = count

for i,p in enumerate(mis_patterns):
if mis_patterns[p]==max_count:
print i,p, mis_patterns[p]

9. Frequent Words with Mismatches and Reverse Complements:

Find the most frequent k-mers (with mismatches and reverse complements) in a DNA string.  All k-mers Pattern maximizing the sum Countd(Text, Pattern) + Countd(Text, Pattern)   over all possible k-mers.

Sample Input:
ACGTTGCATGTCGCATGATGCATGAGAGCT
4 1

Sample Output:
ATGT ACAT

import collections
import itertools

in_file = open('w1_8_data_set0.txt', 'r')

line = 1
in_genome = "";
in_kmer =0
in_mistake = 0
in_result=""
for in_data in in_file:
if (line==1):
in_genome = in_data.strip(' \t\n\r')
elif (line==2):
line2 = (in_data.strip(' \t\n\r')).split()
in_kmer = int(line2[0])
in_mistake = int(line2[1])
line+=1

distinct_letters = "".join(set(in_genome))
pos_genomes = [in_genome[i:i+in_kmer] for i in range(0, len(in_genome), 1) if len(in_genome[i:i+in_kmer]) == in_kmer]

patterns = map("".join, itertools.product(distinct_letters, repeat=in_kmer))

def diff_letters(a,b):
count = 0
for i in range(len(a)):
if(a[i] != b[i]):
count +=1
if(count>in_mistake):
return False
return True

def Reverse(in_str):
ou_str = in_str
ou_str = ou_str.replace("A","1")
ou_str = ou_str.replace("G","2")
ou_str = ou_str.replace("T","A")
ou_str = ou_str.replace("1","T")
ou_str = ou_str.replace("C","G")
ou_str = ou_str.replace("2","C")
return ou_str[::-1]

reverse_pos_genomes = []
for j,g in enumerate(pos_genomes):
rev = Reverse(g)
reverse_pos_genomes.append(rev)

genomes = pos_genomes + reverse_pos_genomes

mis_patterns=collections.defaultdict(list)
max_count = 0
for i,pp in enumerate(patterns):
count = 0
for j,g in enumerate(genomes):
if(diff_letters(pp,g)):
count+=1
mis_patterns[pp]=count
if count>0 and count>max_count:
max_count = count

for i,p in enumerate(mis_patterns):
if mis_patterns[p]==max_count:
print i,p, mis_patterns[p]
Advertisement

2 responses to “DNA Replication – Frequent Words, Reverse Complement, Pattern Matching, Clump Finding, Skewi, Mismatches

  1. Sean February 17, 2015 at 7:05 pm

    Great post! However when I run #9 I get “0 6” as output, not the actual stings. Any thoughts? Also do you have this information on a gist of some sort. It’s a bit hard to read without indentation.

  2. Anonymous September 12, 2015 at 3:30 pm

    What is the runtime of applying the FrequentWords algorithm to every wndowof length Lin a
    DNA string Genome in order to find k-mers forming (L, f)-clumps?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: