1. String Composition: Form the k-mer composition of a string.
Input: An integer k and a string Text.
Output: Compositionk(Text), where the k-mers are written in lexicographic order.
Composition3(TATGGGGTGC) = ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG
dnas = [in_dna[i:i+k] for i in range(0, len(in_dna)) if len(in_dna[i:i+k])==k]
For example TAATGCCATGGGATGTT are linked together to form the genome path.
For example, each of the three occurrences of ATG should be connected to TGC, TGG, and TGT; these connections are overlapping 3-mers. The graph showing all overlap connections.
The structure in the figure above is an example of a graph, or a network of nodes connected by edges. This graph is an example of a directed graph, whose edges have a direction and are represented by arrows (as opposed to undirected graphs whose edges do not have directions).
Sequence much shorter DNA fragments called reads.
For example, arrange the 3-mers lexicographically, which produces the overlap graph shown in the figure below. The genome path has disappeared to the naked eye, but it must be still there.
The original genome path, highlighted in the overlap graph. Solve the String Reconstruction Problem, we are looking for a path in the overlap graph that visits every node exactly once. In general, a path visiting every node once is called a Hamiltonian path, as shown in the figure below (a Hamiltonian path in the overlap graph).
2. Overlap Graph: Construct the overlap graph of a collection of k-mers.
Input: A collection Patterns of k-mers.
Output: The overlap graph Overlap
AGGCA -> GGCAT
CATGC -> ATGCG
GCATG -> CATGC
GGCAT -> GCATG
used = [True] * len(in_patterns)
for i,pat in enumerate(in_patterns):
pr = pat[k:len(pat)]
for j,mpat in enumerate(in_patterns):
if (used[j] and pr == mpat[0:len(mpat)-k]):
used[j] = False
ou_file.write(pat + " -> " +mpat +"\n")
How to represent graphs in your programs. For example the two graphs in the figure below with five nodes and nine edges. (look different, but both are the same).graph.
Two different renderings of the same graph on five nodes and nine edges.
There are two standard ways of representing a graph. For a directed graph with n nodes, the n × n adjacency matrix (Ai,j) is defined by the following rule: Ai,j = 1 if a directed edge connects node i to node j, and Ai,j = 0 otherwise. Another (more memory-efficient) way of representing a graph is to use an adjacency list, for which we simply list all nodes connected to each node.
| a b c d e
|a 0 1 0 0 1
||a adjacent to b and e
|b 0 0 0 1 0
||b adjacent to d
|c 1 0 0 0 0
||c adjacent to a
|d 1 0 0 0 0
||d adjacent to a
|e 0 1 1 1 0
||e adjacent to b, c, and d
Each pair of consecutive edges represents consecutive 3-mers that overlap in 2 nucleotides. For example, the node with incoming edge CAT and outgoing edge ATG is labeled AT.
Gluing identically labeled nodes. For example, AT, TG and GG are repeated nodes, glue all nodes labeled into a single node. In the figure below, the path with 16 nodes transformed into a graph with 11 nodes. This graph is called the de Bruijn graph of TAATGCCATGGGATGTT, denoted DeBruijn3(TAATGCCATGGGATGTT).
Solving the String Reconstruction Problem reduces to finding a path in the de Bruijn graph that visits every edge exactly once. Such a path is called an Eulerian Path.
3. De Bruijn Graph: Construct the de Bruijn graph of a string.
Input: An integer k and a string Text.
AAG -> AGA
AGA -> GAT
ATT -> TTC
CTA -> TAC
CTC -> TCT
GAT -> ATT
TCT -> CTA,CTC
TTC -> TCT
in_kmer = 4
AAGATTCTCTAC => [AAGA (AAG, AGA)] [AGAT (AGA, GAT)] ...
kdna => ['AAGA', 'AGAT', 'GATT', 'ATTC', 'TTCT', 'TCTC', 'CTCT', 'TCTA', 'CTAC', 'TAC']
ldna => ['AAG', 'AGA', 'GAT', 'ATT', 'TTC', 'TCT', 'CTC', 'TCT', 'CTA', 'TAC']
rdna => ['AGA', 'GAT', 'ATT', 'TTC', 'TCT', 'CTC', 'TCT', 'CTA', 'TAC', 'TAC']
['AAGA', 'AGAT', 'ATTC', 'CTAC', 'CTCT', 'GATT', 'TAC', 'TCTA', 'TCTC', 'TTCT']
['AAG', 'AGA', 'ATT', 'CTA', 'CTC', 'GAT', 'TAC', 'TCT', 'TCT', 'TTC']
['AGA', 'ATT', 'CTA', 'CTC', 'GAT', 'TAC', 'TAC', 'TCT', 'TCT', 'TTC']
rdna = 
ldna = 
kdna = 
for i in range(0,len(in_dna)):
dna = in_dna[i:i+in_kmer]
tdna = dna[0:in_kmer-1]
mdna = dna[1:in_kmer]
mdna = dna
i lpat return mdnas
0 AAG -1 
1 AGA -1 [AAG AGA]
3 CTA -1 [AAG AGA, AGA GAT, ATT TTC]
3 [AAG AGA, AGA GAT, ATT TTC, CTA TAC]
for i,md in enumerate(mdnas):
md0 = md.split()
(lp=rp && kp=rpat) = C
i lpat lp kp j rpat rp C mdnas pos
0 AAG AG AGA 0 AGA AG T AAG AGA -1
9 TTC TT F
1 AGA GA GAT 4 GAT GA T AGA GAT -1
2 ATT TT TTC 9 TTC TT T ATT TTC -1
3 CTA TA TAC 5 TAC TA T CTA TAC -1
6 TAC TA T 3,mdnas.find(rpat)=F
7 TCT CT CTA 2 CTA CT T TCT CTA -1
8 TCT CT CTC 3 CTC CT T TCT CTA,CTC 6,mdnas.find(rpat)=T
for i,lpat in enumerate(ldna):
lp = lpat[k:len(lpat)]
kp = kdna[i][1:in_kmer]
match = ''
for j,rpat in enumerate(rdna):
if (lp == rp and kp == rpat):
pos = CheckValue(lpat)
mdnas.append(lpat + " " + rpat)
mdnas[pos]+= "," + rpat
['AAG AGA', 'AGA GAT', 'ATT TTC', 'CTA TAC', 'CTC TCT', 'GAT ATT', 'TCT CTA,CTC', 'TTC TCT']
The figure below represents the 3-mer composition of TAATGCCATGGGATGTT as a composition graph CompositionGraph3(TAATGCCATGGGATGTT).
The glue the 15 isolated edges in CompositionGraph3(TAATGCCATGGGATGTT) into a path PathGraph3(TAATGCCATGGGATGTT), which results in DeBruijn3(TAATGCCATGGGATGTT).
Construct de Bruijn graphs without gluing.
Given a collection of k-mers Patterns, the nodes of DeBruijnk(Patterns) are simply all unique (k − 1)-mers occurring as a prefix or suffix in Patterns. For example, say we are given the following Patterns:
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
Then the set of 11 unique 2-mers occurring as a prefix or suffix in this composition is as follows:
AA AT CA CC GA GC GG GT TA TG TT
For every k-mer in Patterns, we connect its prefix to its suffix by a directed edge to produce DeBruijn(Patterns). You can verify that for the above example, this process produces the same de Bruijn graph that we have been working with (shown below).
4. DeBruijn Graph: Construct the de Bruijn graph from a set of k-mers.
Input: A collection of k-mers Patterns.
Output: The adjacency list of the de Bruijn graph DeBruijn(Patterns).
AGG -> GGG
CAG -> AGG
GAG -> AGG
GGA -> GAG
GGG -> GGA,GGG
in_dna = ['GAGG', 'GGGG', 'GGGA', 'CAGG', 'AGGG', 'GGAG']
dna = ['GAG', 'GGG', 'GGG', 'CAG', 'AGG', 'GGA']
unqiue dna = ['AGG', 'CAG', 'GAG', 'GGA', 'GGG']
kmer = len(in_dna)-1
dna = [in_dna[i][:kmer] for i in range(0,len(in_dna))]
dna = list(set(dna))
C = mv.find(dr)
M = d[:kmer]==dna and mv.find(dr)<0)
dna i d dr D C M mv
AGG 0 GAGG AGG GAG -1 F
1 GGGG GGG GGG -1 F
2 GGGA GGA GGG -1 F
3 CAGG AGG CAG -1 F
4 AGGG GGG AGG -1 T GGG
5 GGAG GAG GGA -1 F
CAG 3 CAGG AGG CAG -1 T AGG
GGG 1 GGGG GGG GGG -1 T GGG
2 GGGA GGA GGG -1 T GGG, GGA
for i,d in enumerate(in_dna):
dr = d[1:kmer+1]
if (d[:kmer]==dna and mv.find(dr)<0):
i d mv
0 AGG GGG
1 CAG AGG
2 GAG AGG
3 GGA GAG
4 GGG GGG, GGA
for i,d in enumerate(dna):
mv = FindMatch(d)
print d + " -> " + mv[:len(mv)-1]