String Reconstruction

String Reconstruction Problem: Reconstruct a string from its k-mer composition
Input: An integer k and a collection Patterns of k-mers. (AAT ATG GTT TAA TGT)
Output: A string Text with k-mer composition equal to Patterns (if such a string exists).

TAA
   AAT
    ATG
     TGT
      GTT
 TAATGTT

1.The String Reconstruction Problem reduces to finding an Eulerian path in the de Bruijn graph generated from reads.  Solve the String Reconstruction Problem.
Input: The adjacency list of a directed graph that has an Eulerian path.
CTT -> TTA
ACC -> CCA
TAC -> ACC
GGC -> GCT
GCT -> CTT
TTA -> TAC
Output: An Eulerian path in this graph. GGCTTACCA

import collections   
import itertools
in_file = open('w_5_2_data_set0.txt', 'r')
line = 1
in_adjacency=[]
lpos = 0
for in_data in in_file:
    ldata = in_data.strip(' \t\n\r')
    if (len(ldata)==0):
        break
    else:
        ls = ldata.split()
        in_adjacency.append(ls[0] +  ls[2][-1])

# in_adjacency => ['CTTA', 'ACCA', 'TACC', 'GGCT', 'GCTT', 'TTAC']
lpos = len(in_adjacency[0])-2     
i=0            
tot_iadj = len(in_adjacency)
pos =1
'''
I ia         end_ia2  j3  ja4      sta_ja5 2=5   i!j
0 CTTA TTA        0   CTTA CTT     False False True
…
0 CTTA TTA         5   TTAC TTA    True True True
   Remove ‘CTTA’, ‘ TTAC’ and merge CTTAC
   ['ACCA', 'TACC', 'GGCT', 'GCTT', 'CTTAC']

0 ACCA CCA ACCA  ACC False False True
0 ACCA CA  ACCA  ACC False False True
0 ACCA A    ACCA  ACC False False True
0 ACCA        ACCA  ACC False False True

1 TACC ACC ACCA ACC True True True =>   ['GGCT', 'GCTT', 'CTTAC', 'TACCA']

0 GGCT GCT  GGCT GGC False False True
0 GGCT GCT  GCTT GCT True True True =>   ['CTTAC', 'TACCA', 'GGCTT']
…
0 GGCTT CTT GGCTT     GGC False False True
0 GGCTT CTT CTTACCA CTT True True True =>  ['GGCTTACCA']
'''
while (1<tot_iadj):
    ia = in_adjacency[i]
    end_ia = ia[pos:len(ia)]
    match=False
    for j,ja in enumerate(in_adjacency):
        sta_ja = ja[0:len(ia)-pos]
        if (end_ia==sta_ja and ja != ia and len(end_ia)>lpos):
            in_adjacency.remove(ja)
            in_adjacency.remove(ia)
            in_adjacency.append(ia +  ja[len(sta_ja):len(ja)])
            i=-1
            tot_iadj = len(in_adjacency)
            match=True
            pos = 1
            break
    if(not match):
        pos+=1
        if (len(end_ia)==0):
            i+=1
            pos=1
    else:
        i+=1
for i,d in enumerate(in_adjacency):
   print d

2.k-universal circular strings. – For example, 00011101 is a 3-universal circular string, as it contains each of the eight binary 3-mers (000, 001, 011, 111, 110, 101,010, and 100) exactly.
kcircle
Input: An integer k. 4,  Output: A k-universal circular string. 0000111101100101

from collections import defaultdict
import collections   
import itertools

in_binary= 4
lpos = in_binary-1

last_binary ='1'*in_binary
bin_int = int(last_binary, 2)

last_before = "1"+('0'*lpos)
first = '0'*in_binary
nodes = defaultdict(list)
for i in range(0,bin_int+1):
       a = (bin(i)[2:].zfill(in_binary))
        if (a!=last_before and a!=first):
            #print sys.getsizeof(a), a
            s = a[0:lpos]
            e = a[1:in_binary]
            #print a, s,e
            nodes[s].append(e)
            nodes[e].append(s)
tour = []
start = '0'*(in_binary-1)
tour.append(start)
while(len(nodes[start])>0):
    estart = start[1:len(start)]
    if estart+"1" in  nodes[start]:
        tour.append(estart+"1")
        nodes[start].remove(estart+"1")
        nodes[estart+"1"].remove(start)
        start = estart+"1"
    else:
        tour.append(estart+"0")
        nodes[start].remove(estart+"0")
        nodes[estart+"0"].remove(start)
        start = estart+"0"      

ou_res ='0'
for i,d in enumerate(tour):
   ou_res+=d[0]
   
print ou_res

'''
nodes '010': ['001', '100', '101', '101'], '011': ['001', '110', '111', '101'], '001': ['000', '010', '011', '100'], '000': ['001'], '111': ['011', '110', '111', '111'], '110': ['011', '100', '101', '111'], '100': ['010', '001', '110'], '101': ['010', '010', '011', '110']

000 => 00=> 001 | 000, 001 find, remove ‘000’: 001 & ‘001’ :‘000’
nodes '010': ['001', '100', '101', '101'], '011': ['001', '110', '111', '101'], '001': [ '010', '011', '100'], '000': [], '111': ['011', '110', '111', '111'], '110': ['011', '100', '101', '111'], '100': ['010', '001', '110'], '101': ['010', '010', '011', '110']
tour ['000', '001']

 001 => 01 => 011 | 010 find, remove ‘001’: 011 & ‘011’ :‘001’
nodes '010': ['001', '100', '101', '101'], '011': ['110', '111', '101'], '001': [ '010', '100'], '000': [], '111': ['011', '110', '111', '111'], '110': ['011', '100', '101', '111'], '100': ['010', '001', '110'], '101': ['010', '010', '011', '110']
tour ['000', '001', ‘011’]
…
defaultdict(<type 'list'>, {'010': [], '011': [], '001': [], '000': [], '111': [], '110': [], '100': [], '101': []})
['000', '001', '011', '111', '111', '110', '101', '011', '110', '100', '001', '010', '101', '010', '100']

Take first digit of all values => 0+000111101100101
'''

PairedCompositionk,d(Text) – The (k,d)-mer composition of Text, is the collection of all (k,d)- mers in Text (including repeated (k,d)-mers).

Given a string Text, a (k,d)-mer is a pair of k-mers in Text separated by distance d. For example, PairedComposition3,1(TAATGCCATGGGATGTT):

TAA GCC 
 AAT CCA 
  ATG CAT 
   TGC ATG 
    GCC TGG 
     CCA GGG 
      CAT GGA 
       ATG GAT 
        TGG ATG 
         GGG TGT 
          GGA GTT
TAATGCCATGGGATGTT

List the lexicographic order of the 6-mers formed by their concatenated 3-mers: (AAT|CCA), (ATG|CAT), (ATG|GAT), (CAT|GGA),(CCA|GGG), (GCC|TGG), (GGA|GTT), (GGG|TGT), (TAA|GCC), (TGC|ATG), (TGG|ATG)
Consecutive (k, d)-mers appearing in Text, the suffix of the first (k, d)-mer = the prefix of the second (k, d)-mer. For example, for the consecutive (k, d)-mers Suffix((TAA | GCC))=Prefix((AAT | CCA))=(AA | CC) in TAATGCCATGGGATGTT

PathGraphk,d(Text) – A graph represents a path formed by |Text| – ( k + d + k) + 1 edges corresponding to all (k, d)-mers in Text. PathGraph3,1(TAATGCCATGGGATGTT).
 pr1
DeBruijnk,d(Text), called the paired de Bruijn graph, is formed by gluing identically labeled nodes in PathGraphk,d(Text);  the construction of the paired de Bruijn graph DeBruijn3,1(TAATGCCATGGGATGTT).
 pr2
PathGraph3,1(TAATGCCATGGGATGTT)  (fig 1) is formed by 11 edges and 12 nodes. Only two of these nodes have the same label (TG|AT). (fig 2) Bringing the two identically labeled nodes closer to each other in preparation for gluing. (fig 3) The paired de Bruijn graph DeBruijn3,1(TAATGCCATGGGATGTT) is obtained from PathGraph3,1(TAATGCCATGGGATGTT) by gluing the nodes sharing the label (TG|AT). This paired de Bruijn graph has a unique Eulerian path, which spells out TAATGCCATGGGATGTT.

3.String Reconstruction from Read-Pairs Problem: Reconstruct a string from its (k,d)-mer composition.
Input: A collection of paired k-mers PairedReads and an integer d.
Output: A string Text with (k,d)-mer composition equal to PairedReads (if such a string exists).

import collections   
import itertools

in_file = open('w_5_4_data_set0.txt', 'r')

line = 0
in_pkmer=[]
in_skmer=[]
in_kmer={}
bkmer=[]
in_distance = 0
okmer=[]
i=0
for in_data in in_file:
    ldata = in_data.strip(' \t\n\r')
    if (len(ldata)==0):
        break
    elif (line==0):
        in_distance =int(ldata)
    else:
        ls = ldata.split("|")
        in_kmer[i]=[ls[0],ls[1]]
        i+=1
    line+=1

bkmer=[False]*len(in_kmer)
print in_kmer
print bkmer

i=0
pos = len(in_kmer[0][0])-1
print pos
while(bkmer.count(False)>1):
    if(not bkmer[i]):
        ikmer = in_kmer[i][0]
        ikmer1 = in_kmer[i][1]
        ek = ikmer[len(ikmer)-pos:len(ikmer)]
        ek1 = ikmer1[len(ikmer1)-pos:len(ikmer1)]
        for j in range(0,len(in_kmer)):
            if(not bkmer[j]):
                jkmer = in_kmer[j][0]
                jkmer1 = in_kmer[j][1]
                sk = jkmer[0:pos]
                sk1 = jkmer1[0:pos]
                if(ek==sk and ek1==sk1):
                    npref =ikmer+jkmer[pos:len(jkmer)]
                    nsuff =in_kmer[i][1]+in_kmer[j][1][pos:len(jkmer)]
                    in_kmer[i][0]=npref
                    in_kmer[i][1]=nsuff
                    bkmer[j]=True
                    i=-1
                    break
    i+=1

bpos = bkmer.index(False)
pre = in_kmer[bpos][0]
suf = in_kmer[bpos][1]
res = pre+suf[len(suf)-(pos+1+in_distance):len(suf)]
print res

{0: ['GAGA', 'TTGA'], 1: ['TCGT', 'GATG'], 2: ['CGTG', 'ATGT'], 3: ['TGGT', 'TGAG'], 4: ['GTGA', 'TGTT'], 5: ['GTGG', 'GTGA'], 6: ['TGAG', 'GTTG'], 7: ['GGTC', 'GAGA'], 8: ['GTCG', 'AGAT'

1 TCGT CGT GATG ATG =  2 CGTG CGT ATGT ATG

1 TCGTG GTG GATGT TGT = 4 GTGA GTG TGTT TGT

1 TCGTGA TGA GATGTT GTT =   6 TGAG TGA GTTG GTT

1 TCGTGAG GAG GATGTTG TTG  =   0 GAGA GAG TTGA TTG

3 TGGT GGT TGAG GAG =   7 GGTC GGT GAGA GAG

3 TGGTC GTC TGAGA AGA =  8 GTCG GTC AGAT AGA

3 TGGTCG TCG TGAGAT GAT =  1 TCGTGAGA TCG GATGTTGA GAT

5 GTGG TGG GTGA TGA =  3 TGGTCGTGAGA TGG TGAGATGTTGA TGA
{0: true, 1: true, 2: true, 3: true, 4: true, 5: ['GTGGTCGTGAGA', 'GTGAGATGTTGA'], 6: true, 7: true, 8: true

GTGGTCGTGAGA 
               GTGAGATGTTGA

GTGGTCGTGAGATGTTGA

Contigs – A path in a graph is called non-branching if in(v) = out(v) = 1 for each intermediate node v of this path, i.e., for each node except the starting and ending node of a path. A maximal non-branching path is a non-branching path that cannot be extended into a longer non-branching path.For example, the de Bruijn graph below, constructed for the 3-mer composition of TAATGCCATGGGATGTT , has 9 maximal non-branching paths that spell out the contigs TAAT, TGTT, TGCCAT, ATG, ATG, ATG, TGG, GGG, and GGAT.
contig
4.Contig Generation: Generate the contigs from a collection of reads (with imperfect coverage).
Input: A collection of k-mers Patterns.
Output: All contigs in DeBruijn(Patterns).
CODE CHALLENGE: Solve the Contig Generation Problem.
Sample Input:
ATG
ATG
TGT
TGG
CAT
GGA
GAT
AGA
Sample Output:
AGA ATG ATG CAT GAT TGGA TGT

import collections   
import itertools

in_file = open('w_5_5_data_set0.txt', 'r')

line = 0
in_kmer=[]
bkmer=[]
for in_data in in_file:
    ldata = in_data.strip(' \t\n\r')
    if (len(ldata)==0):
        break
    else:
        in_kmer.append(ldata)
    line+=1

bkmer=[False]*len(in_kmer)
nkmer=[]
pos = len(in_kmer[0])-1
inout=[]
for i,im in enumerate(in_kmer):
    sim = im[0:pos]
    eim = im[len(im)-pos:len(im)]
    ii = 0
    io = 0
    for j,jm in enumerate(in_kmer):
        sjm = jm[0:pos]
        ejm = jm[len(jm)-pos:len(jm)]
        if(sim==sjm):
            ii+=1
        if(sim==ejm):
            io+=1
    if(ii==1 and io==1):
        inout.append(im)
    else:
        if(not bkmer[i]):
            bkmer[i]=True
            nkmer.append(im)

i=0
while(len(inout)>0):
    iv=inout[i]
    liv = iv[0:pos]
    print iv, liv
    for x in nkmer:
        if liv in x:
            nkmer.append(x+iv[pos:len(iv)])
            nkmer.remove(x)
            inout.remove(iv)
            i=-1
            break
    i+=1
        
    
for i,im in enumerate(sorted(nkmer)):
    print(im);
  
'''
['ATG', 'ATG', 'TGT', 'TGG', 'CAT', 'GGA', 'GAT', 'AGA']
0 ATG AT TG
  0 ATG AT TG 1 0 ...
inout []
nkmer ['ATG']
bkmer [True, False, False, False, False, False, False, False]

1 ATG AT TG
[]
['ATG', 'ATG']

..
5 GGA GG GA
['GGA']
['ATG', 'ATG', 'TGT', 'TGG', 'CAT']

7 AGA AG GA
['GGA']
['ATG', 'ATG', 'TGT', 'TGG', 'CAT', 'GAT', 'AGA']

while loop
GGA GG
  ATG False ..
  TGG True
[''] removed
['ATG', 'ATG', 'TGT', 'TGGA', 'CAT', 'GAT', 'AGA'] merged

sorted result 
AGA ATG ATG CAT GAT TGGA TGT
'''

a

Advertisement

6 responses to “String Reconstruction

  1. Yifang Tan December 13, 2017 at 6:29 pm

    Could you please elaborate how you create the adjacency list from paried-end files (normally two separate files in fasta/fastq format), in your example being the file “w_5_4_data_set0.txt”.
    Thanks a lot! –Yifang

    • dnsmak December 14, 2017 at 6:13 pm

      i will look, it’s almost 3 years, give me couple of days. thanks

    • dnsmak December 15, 2017 at 4:12 am

      w_5_4_data_set0.txt [Sample input]
      2
      GAGA|TTGA
      TCGT|GATG
      CGTG|ATGT
      TGGT|TGAG
      GTGA|TGTT
      GTGG|GTGA
      TGAG|GTTG
      GGTC|GAGA
      GTCG|AGAT

      Sample Output:
      GTGGTCGTGAGATGTTGA

      My Algorithm code process is
      ”’
      {0: [‘GAGA’, ‘TTGA’], 1: [‘TCGT’, ‘GATG’], 2: [‘CGTG’, ‘ATGT’],
      3: [‘TGGT’, ‘TGAG’], 4: [‘GTGA’, ‘TGTT’], 5: [‘GTGG’, ‘GTGA’],
      6: [‘TGAG’, ‘GTTG’], 7: [‘GGTC’, ‘GAGA’], 8: [‘GTCG’, ‘AGAT’]}
      [False, False, False, False, False, False, False, False, False]
      3
      0 GAGA AGA TTGA TGA
      0 GAGA GAG TTGA TTG 4
      1 TCGT TCG GATG GAT 4
      2 CGTG CGT ATGT ATG 4
      3 TGGT TGG TGAG TGA 4
      4 GTGA GTG TGTT TGT 4
      5 GTGG GTG GTGA GTG 4
      6 TGAG TGA GTTG GTT 4
      7 GGTC GGT GAGA GAG 4
      8 GTCG GTC AGAT AGA 4
      1 TCGT CGT GATG ATG
      0 GAGA GAG TTGA TTG 4
      1 TCGT TCG GATG GAT 4
      2 CGTG CGT ATGT ATG 4
      TCGTG GATGT {0: [‘GAGA’, ‘TTGA’], 1: [‘TCGTG’, ‘GATGT’], 2: [‘CGTG’, ‘ATGT’], 3: [‘TGGT’, ‘TGAG’], 4: [‘GTGA’, ‘TGTT’], 5: [‘GTGG’, ‘GTGA’], 6: [‘TGAG’, ‘GTTG’], 7: [‘GGTC’, ‘GAGA’], 8: [‘GTCG’, ‘AGAT’]}
      [False, False, True, False, False, False, False, False, False]
      0 GAGA AGA TTGA TGA
      0 GAGA GAG TTGA TTG 4
      1 TCGTG TCG GATGT GAT 5
      3 TGGT TGG TGAG TGA 4
      4 GTGA GTG TGTT TGT 4
      5 GTGG GTG GTGA GTG 4
      6 TGAG TGA GTTG GTT 4
      7 GGTC GGT GAGA GAG 4
      8 GTCG GTC AGAT AGA 4
      1 TCGTG GTG GATGT TGT
      0 GAGA GAG TTGA TTG 4
      1 TCGTG TCG GATGT GAT 5
      3 TGGT TGG TGAG TGA 4
      4 GTGA GTG TGTT TGT 4
      TCGTGA GATGTT {0: [‘GAGA’, ‘TTGA’], 1: [‘TCGTGA’, ‘GATGTT’], 2: [‘CGTG’, ‘ATGT’], 3: [‘TGGT’, ‘TGAG’], 4: [‘GTGA’, ‘TGTT’], 5: [‘GTGG’, ‘GTGA’], 6: [‘TGAG’, ‘GTTG’], 7: [‘GGTC’, ‘GAGA’], 8: [‘GTCG’, ‘AGAT’]}
      [False, False, True, False, True, False, False, False, False]
      0 GAGA AGA TTGA TGA
      0 GAGA GAG TTGA TTG 4
      1 TCGTGA TCG GATGTT GAT 6
      3 TGGT TGG TGAG TGA 4
      5 GTGG GTG GTGA GTG 4
      6 TGAG TGA GTTG GTT 4
      7 GGTC GGT GAGA GAG 4
      8 GTCG GTC AGAT AGA 4
      1 TCGTGA TGA GATGTT GTT
      0 GAGA GAG TTGA TTG 4
      1 TCGTGA TCG GATGTT GAT 6
      3 TGGT TGG TGAG TGA 4
      5 GTGG GTG GTGA GTG 4
      6 TGAG TGA GTTG GTT 4
      TCGTGAG GATGTTG {0: [‘GAGA’, ‘TTGA’], 1: [‘TCGTGAG’, ‘GATGTTG’], 2: [‘CGTG’, ‘ATGT’], 3: [‘TGGT’, ‘TGAG’], 4: [‘GTGA’, ‘TGTT’], 5: [‘GTGG’, ‘GTGA’], 6: [‘TGAG’, ‘GTTG’], 7: [‘GGTC’, ‘GAGA’], 8: [‘GTCG’, ‘AGAT’]}
      [False, False, True, False, True, False, True, False, False]
      0 GAGA AGA TTGA TGA
      0 GAGA GAG TTGA TTG 4
      1 TCGTGAG TCG GATGTTG GAT 7
      3 TGGT TGG TGAG TGA 4
      5 GTGG GTG GTGA GTG 4
      7 GGTC GGT GAGA GAG 4
      8 GTCG GTC AGAT AGA 4
      1 TCGTGAG GAG GATGTTG TTG
      0 GAGA GAG TTGA TTG 4
      TCGTGAGA GATGTTGA {0: [‘GAGA’, ‘TTGA’], 1: [‘TCGTGAGA’, ‘GATGTTGA’], 2: [‘CGTG’, ‘ATGT’], 3: [‘TGGT’, ‘TGAG’], 4: [‘GTGA’, ‘TGTT’], 5: [‘GTGG’, ‘GTGA’], 6: [‘TGAG’, ‘GTTG’], 7: [‘GGTC’, ‘GAGA’], 8: [‘GTCG’, ‘AGAT’]}
      [True, False, True, False, True, False, True, False, False]
      1 TCGTGAGA AGA GATGTTGA TGA
      1 TCGTGAGA TCG GATGTTGA GAT 8
      3 TGGT TGG TGAG TGA 4
      5 GTGG GTG GTGA GTG 4
      7 GGTC GGT GAGA GAG 4
      8 GTCG GTC AGAT AGA 4
      3 TGGT GGT TGAG GAG
      1 TCGTGAGA TCG GATGTTGA GAT 8
      3 TGGT TGG TGAG TGA 4
      5 GTGG GTG GTGA GTG 4
      7 GGTC GGT GAGA GAG 4
      TGGTC TGAGA {0: [‘GAGA’, ‘TTGA’], 1: [‘TCGTGAGA’, ‘GATGTTGA’], 2: [‘CGTG’, ‘ATGT’], 3: [‘TGGTC’, ‘TGAGA’], 4: [‘GTGA’, ‘TGTT’], 5: [‘GTGG’, ‘GTGA’], 6: [‘TGAG’, ‘GTTG’], 7: [‘GGTC’, ‘GAGA’], 8: [‘GTCG’, ‘AGAT’]}
      [True, False, True, False, True, False, True, True, False]
      1 TCGTGAGA AGA GATGTTGA TGA
      1 TCGTGAGA TCG GATGTTGA GAT 8
      3 TGGTC TGG TGAGA TGA 5
      5 GTGG GTG GTGA GTG 4
      8 GTCG GTC AGAT AGA 4
      3 TGGTC GTC TGAGA AGA
      1 TCGTGAGA TCG GATGTTGA GAT 8
      3 TGGTC TGG TGAGA TGA 5
      5 GTGG GTG GTGA GTG 4
      8 GTCG GTC AGAT AGA 4
      TGGTCG TGAGAT {0: [‘GAGA’, ‘TTGA’], 1: [‘TCGTGAGA’, ‘GATGTTGA’], 2: [‘CGTG’, ‘ATGT’], 3: [‘TGGTCG’, ‘TGAGAT’], 4: [‘GTGA’, ‘TGTT’], 5: [‘GTGG’, ‘GTGA’], 6: [‘TGAG’, ‘GTTG’], 7: [‘GGTC’, ‘GAGA’], 8: [‘GTCG’, ‘AGAT’]}
      [True, False, True, False, True, False, True, True, True]
      1 TCGTGAGA AGA GATGTTGA TGA
      1 TCGTGAGA TCG GATGTTGA GAT 8
      3 TGGTCG TGG TGAGAT TGA 6
      5 GTGG GTG GTGA GTG 4
      3 TGGTCG TCG TGAGAT GAT
      1 TCGTGAGA TCG GATGTTGA GAT 8
      TGGTCGTGAGA TGAGATGTTGA {0: [‘GAGA’, ‘TTGA’], 1: [‘TCGTGAGA’, ‘GATGTTGA’], 2: [‘CGTG’, ‘ATGT’], 3: [‘TGGTCGTGAGA’, ‘TGAGATGTTGA’], 4: [‘GTGA’, ‘TGTT’], 5: [‘GTGG’, ‘GTGA’], 6: [‘TGAG’, ‘GTTG’], 7: [‘GGTC’, ‘GAGA’], 8: [‘GTCG’, ‘AGAT’]}
      [True, True, True, False, True, False, True, True, True]
      3 TGGTCGTGAGA AGA TGAGATGTTGA TGA
      3 TGGTCGTGAGA TGG TGAGATGTTGA TGA 11
      5 GTGG GTG GTGA GTG 4
      5 GTGG TGG GTGA TGA
      3 TGGTCGTGAGA TGG TGAGATGTTGA TGA 11
      GTGGTCGTGAGA GTGAGATGTTGA {0: [‘GAGA’, ‘TTGA’], 1: [‘TCGTGAGA’, ‘GATGTTGA’], 2: [‘CGTG’, ‘ATGT’], 3: [‘TGGTCGTGAGA’, ‘TGAGATGTTGA’], 4: [‘GTGA’, ‘TGTT’], 5: [‘GTGGTCGTGAGA’, ‘GTGAGATGTTGA’], 6: [‘TGAG’, ‘GTTG’], 7: [‘GGTC’, ‘GAGA’], 8: [‘GTCG’, ‘AGAT’]}
      [True, True, True, True, True, False, True, True, True]
      1 3 2 12 9 12
      2014-11-15 13:17:19.634000 – 2014-11-15 13:17:22.503000 – 0:00:02.869000
      Done
      >>>
      ”’

      • Yifang Tan December 15, 2017 at 5:11 pm

        After I sent my reply, I felt a need to add more background of my question. I have been trying to figure out how genome assembly is implemented especially from the algorithm part with working code, simply for study purpose. I found your this blog is the best to explain every details except those example files some of which could be figured out from the context, but an explicit explanation would make your serial posts even better.
        My question is related to paired-end files which is the most common files in actual genome assembly. I did figure out the format of your example input file “w_5_4_data_set0.txt”, but how it is created from two input files of paired-end reads is not clear to me. So that I post a couple of rows for each file in my first reply when I assumed you are not working in the sequence lab, but the fact may be the opposite.
        I am also looking forward your more posts with improved efficiency of the algorithms and memory handling. I gave it a shot of my real data with your code (not this paired end based algorithm), it took ~ 20 hours to finish to get nowhere. But, that’s another story.
        Thank you for your great posts!

  2. yifangt December 15, 2017 at 2:50 pm

    Thanks!
    My question was: how did you create the sample input from paired end reads files?
    2
    GAGA|TTGA
    TCGT|GATG
    CGTG|ATGT
    TGGT|TGAG
    GTGA|TGTT
    For example, read1.fasta
    >seq1R1
    ATCGTTGATCGGCA
    >seq2R1
    AGGGCTCATTTAGG
    read2.fasta
    >seq1R1
    ATCGTTGATCCCCC
    >seq2R1
    GCCCAAGGTTAACC

    • Yifang Tan December 15, 2017 at 5:14 pm

      Some typos in my example files.
      For example, read1.fasta
      >seq1_R1
      ATCGTTGATCGGCA
      >seq2_R1
      AGGGCTCATTTAGG
      ……
      read2.fasta
      >seq1_R2
      ATCGTTGATCCCCC
      >seq2_R2
      GCCCAAGGTTAACC
      ……
      Wish I know how to edit the post when I found errors. Sorry about this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: