Graph – Overlap, De Bruijn

1. String Composition: Form the k-mer composition of a string.
Input: An integer k and a string Text.
Output: Compositionk(Text), where the k-mers are written in lexicographic order.
Composition3(TATGGGGTGC) = ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG

dnas = [in_dna[i:i+k] for i in range(0, len(in_dna)) if len(in_dna[i:i+k])==k]
dnas.sort()

For example TAATGCCATGGGATGTT are linked together to form the genome path.

s1
For example, each of the three occurrences of ATG should be connected to TGC, TGG, and TGT; these connections are overlapping 3-mers. The graph showing all overlap connections.

s2
The structure in the figure above is an example of a graph, or a network of nodes connected by edges. This graph is an example of a directed graph, whose edges have a direction and are represented by arrows (as opposed to undirected graphs whose edges do not have directions).

Sequence much shorter DNA fragments called reads.

For example, arrange the 3-mers lexicographically, which produces the overlap graph shown in the figure below. The genome path has disappeared to the naked eye, but it must be still there.

s3

The original genome path, highlighted in the overlap graph. Solve the String Reconstruction Problem, we are looking for a path in the overlap graph that visits every node exactly once. In general, a path visiting every node once is called a Hamiltonian path, as shown in the figure below (a Hamiltonian path in the overlap graph).

s4

2. Overlap Graph: Construct the overlap graph of a collection of k-mers.
Input: A collection Patterns of k-mers.
ATGCG
GCATG
CATGC
AGGCA
GGCAT

Output: The overlap graph Overlap
(Patterns).
AGGCA -> GGCAT
CATGC -> ATGCG
GCATG -> CATGC
GGCAT -> GCATG

in_patterns.sort()
used = [True] * len(in_patterns)
for i,pat in enumerate(in_patterns):
   k=1
   pr = pat[k:len(pat)]
   for j,mpat in enumerate(in_patterns):
       if (used[j] and pr == mpat[0:len(mpat)-k]):
           used[j] = False
           ou_file.write(pat + " -> " +mpat +"\n")

How to represent graphs in your programs. For example the two graphs in the figure below with five nodes and nine edges. (look different, but both are the same).graph.

s5

Two different renderings of the same graph on five nodes and nine edges.

There are two standard ways of representing a graph. For a directed graph with n nodes, the n × n adjacency matrix (Ai,j) is defined by the following rule: Ai,j = 1 if a directed edge connects node i to node j, and Ai,j = 0 otherwise. Another (more memory-efficient) way of representing a graph is to use an adjacency list, for which we simply list all nodes connected to each node.

adjacency matrix adjacency list
     a b c d e
a   0 1 0 0 1 a adjacent to b and e
b   0 0 0 1 0 b adjacent to d
c   1 0 0 0 0 c adjacent to a
d   1 0 0 0 0 d adjacent to a
e   0 1 1 1 0 e adjacent to b, c, and d

Each pair of consecutive edges represents consecutive 3-mers that overlap in 2 nucleotides. For example, the node with incoming edge CAT and outgoing edge ATG is labeled AT.

s7

Gluing identically labeled nodes. For example, AT, TG and GG are repeated nodes, glue all nodes labeled into a single node. In the figure below, the path with 16 nodes transformed into a graph with 11 nodes. This graph is called the de Bruijn graph of TAATGCCATGGGATGTT, denoted DeBruijn3(TAATGCCATGGGATGTT).

s8

Solving the String Reconstruction Problem reduces to finding a path in the de Bruijn graph that visits every edge exactly once. Such a path is called an Eulerian Path.

3. De Bruijn Graph: Construct the de Bruijn graph of a string.
Input: An integer k and a string Text.
4
AAGATTCTCTAC
Output: DeBruijnk(Text).
AAG -> AGA
AGA -> GAT
ATT -> TTC
CTA -> TAC
CTC -> TCT
GAT -> ATT
TCT -> CTA,CTC
TTC -> TCT

in_kmer = 4
in_dna= AAGATTCTCTAC

'''
AAGATTCTCTAC => [AAGA (AAG, AGA)] [AGAT (AGA, GAT)] ...
kdna => ['AAGA', 'AGAT', 'GATT', 'ATTC', 'TTCT', 'TCTC', 'CTCT', 'TCTA', 'CTAC', 'TAC']
ldna => ['AAG', 'AGA', 'GAT', 'ATT', 'TTC', 'TCT', 'CTC', 'TCT', 'CTA', 'TAC']
rdna => ['AGA', 'GAT', 'ATT', 'TTC', 'TCT', 'CTC', 'TCT', 'CTA', 'TAC', 'TAC']
sort
['AAGA', 'AGAT', 'ATTC', 'CTAC', 'CTCT', 'GATT', 'TAC', 'TCTA', 'TCTC', 'TTCT']
['AAG', 'AGA', 'ATT', 'CTA', 'CTC', 'GAT', 'TAC', 'TCT', 'TCT', 'TTC']
['AGA', 'ATT', 'CTA', 'CTC', 'GAT', 'TAC', 'TAC', 'TCT', 'TCT', 'TTC']
'''
rdna = []
ldna = []
kdna = []
for i in range(0,len(in_dna)):
    dna = in_dna[i:i+in_kmer]
    if (len(dna)>=in_kmer-1):
        kdna.append(dna)
        tdna = dna[0:in_kmer-1]
        ldna.append(tdna)
        if (len(dna)==in_kmer):
            mdna = dna[1:in_kmer] 
        else:
            mdna = dna
        rdna.append(mdna)

kdna.sort()
ldna.sort()
rdna.sort()

'''
i lpat return mdnas
0 AAG  -1     []
1 AGA  -1     [AAG AGA]
...
3 CTA  -1     [AAG AGA, AGA GAT, ATT TTC]
        3     [AAG AGA, AGA GAT, ATT TTC, CTA TAC]
....
'''
def CheckValue(lpat):
   for i,md in enumerate(mdnas):
       md0 = md.split()[0]
       if(md0==lpat):
           return i
   return -1     

'''
(lp=rp && kp=rpat) = C
i lpat lp kp  j rpat rp  C mdnas       pos
0 AAG  AG AGA 0 AGA  AG  T AAG AGA     -1
              ...
              9 TTC  TT  F
1 AGA  GA GAT 4 GAT  GA  T AGA GAT     -1                   
2 ATT  TT TTC 9 TTC  TT  T ATT TTC     -1
3 CTA  TA TAC 5 TAC  TA  T CTA TAC     -1
              6 TAC  TA  T              3,mdnas[3].find(rpat)=F   
...
7 TCT  CT CTA 2 CTA  CT  T TCT CTA     -1
8 TCT  CT CTC 3 CTC  CT  T TCT CTA,CTC  6,mdnas[6].find(rpat)=T   
'''
k= 1
mdnas=[]
for i,lpat in enumerate(ldna):
    lp = lpat[k:len(lpat)]
    kp = kdna[i][1:in_kmer]
    match = ''
    for j,rpat in enumerate(rdna):
        rp =rpat[0:len(rpat)-k]
        if (lp == rp and kp == rpat):
            pos = CheckValue(lpat)
            if (pos==-1):
                mdnas.append(lpat + " " + rpat)
            else:
                if(mdnas[pos].find(rpat)<0):
                    mdnas[pos]+= "," + rpat
                    
print mdnas
'''
['AAG AGA', 'AGA GAT', 'ATT TTC', 'CTA TAC', 'CTC TCT', 'GAT ATT', 'TCT CTA,CTC', 'TTC TCT']
'''

The figure below represents the 3-mer composition of TAATGCCATGGGATGTT as a composition graph CompositionGraph3(TAATGCCATGGGATGTT).

s9

The glue the 15 isolated edges in CompositionGraph3(TAATGCCATGGGATGTT) into a path PathGraph3(TAATGCCATGGGATGTT), which results in DeBruijn3(TAATGCCATGGGATGTT).

s10

Construct de Bruijn graphs without gluing.

Given a collection of k-mers Patterns, the nodes of DeBruijnk(Patterns) are simply all unique (k − 1)-mers occurring as a prefix or suffix in Patterns. For example, say we are given the following Patterns:
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT

Then the set of 11 unique 2-mers occurring as a prefix or suffix in this composition is as follows:
AA AT CA CC GA GC GG GT TA TG TT

For every k-mer in Patterns, we connect its prefix to its suffix by a directed edge to produce DeBruijn(Patterns). You can verify that for the above example, this process produces the same de Bruijn graph that we have been working with (shown below).

s11

4. DeBruijn Graph: Construct the de Bruijn graph from a set of k-mers.
Input: A collection of k-mers Patterns.
GAGG
GGGG
GGGA
CAGG
AGGG
GGAG
Output: The adjacency list of the de Bruijn graph DeBruijn(Patterns).
AGG -> GGG
CAG -> AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG

in_dna = ['GAGG', 'GGGG', 'GGGA', 'CAGG', 'AGGG', 'GGAG']
'''
dna  = ['GAG', 'GGG', 'GGG', 'CAG', 'AGG', 'GGA']
unqiue dna =  ['AGG', 'CAG', 'GAG', 'GGA', 'GGG']
'''
kmer = len(in_dna[0])-1
dna = [in_dna[i][:kmer] for i in range(0,len(in_dna))]
dna = list(set(dna))
dna.sort()

'''
C = mv.find(dr)
D =d[:kmer]
M = d[:kmer]==dna and mv.find(dr)<0)
dna i d    dr  D   C M mv 
AGG 0 GAGG AGG GAG -1 F
    1 GGGG GGG GGG -1 F
    2 GGGA GGA GGG -1 F 
    3 CAGG AGG CAG -1 F
    4 AGGG GGG AGG -1 T GGG
    5 GGAG GAG GGA -1 F
CAG 3 CAGG AGG CAG -1 T AGG
..
GGG 1 GGGG GGG GGG -1 T GGG
    2 GGGA GGA GGG -1 T GGG, GGA
'''
def FindMatch(dna):
    mv=''
    for i,d in enumerate(in_dna):
        dr = d[1:kmer+1]
        if (d[:kmer]==dna and mv.find(dr)<0):
            mv+=dr +","
    return mv

'''
i d   mv
0 AGG GGG
1 CAG AGG
2 GAG AGG
3 GGA GAG
4 GGG GGG, GGA
'''
for i,d in enumerate(dna):
    mv = FindMatch(d)
    print d + " -> " + mv[:len(mv)-1]
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: