{Algorithm;}

Sequence Antibiotics

Leave a comment Posted by dnsmak on September 30, 2014

Tyrocidine B1, one of many antibiotics produced by Bacillus brevis. Tyrocidine B1 is defined by the 10 amino acid-long sequence shown below. Val-Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln-Tyr

The Central Dogma of Molecular Biology states that “DNA makes RNA makes protein.”

Transcription simply transforms a DNA string into an RNA string by replacing all occurrences of T with U. The resulting strand of RNA is translated into an amino acid sequence via the genetic code; this process converts each 3-mer of RNA, called a codon, into one of 20 amino acids. As illustrated in the figure below, each of the 64 RNA codons encodes its own amino acid (some codons encode the same amino acid), with the exception of three stop codons that do not translate into amino acids and serve to halt translation. For example, the DNA string TATACGAAA transcribes into the RNA string UAUACGAAA, which in turn translates into the amino acid string Tyr-Thr-Lys.

1. Protein Translation : Translate an RNA string into an amino acid string.
Notes:

The “Stop” codon should not be translated, as shown in the sample below.
A downloadable RNA codon table indicating which codons encode which amino acids.

RNA string Pattern Input: AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
The translation of Pattern into an amino acid string Peptide Output: MAMAPRTEINSTRING

RNA codon table
AAA K AAC N AAG K AAU N ACA T ACC T ACG T ACU T AGA R AGC S AGG R	AGU S AUA I AUC I AUG M AUU I CAA Q CAC H CAG Q CAU H CCA P CCC P	CCG P CCU P CGA R CGC R CGG R CGU R CUA L CUC L CUG L CUU L GAA E	GAC D GAG E GAU D GCA A GCC A GCG A GCU A GGA G GGC G GGG G GGU G	GUA V GUC V GUG V GUU V UAA UAC Y UAG UAU Y UCA S UCC S UCG S	UCU S UGA UGC C UGG W UGU C UUA L UUC F UUG L UUU F

in_file = open('w_2_1_data_set1.txt', 'r')
in_codon = open('w_2_1_RNA_codon_table_1.txt', 'r')
#read input
line = 1
in_rna = "";
in_kmer = 0
in_result=""
for in_data in in_file:
    if (line==1):
        in_rna = in_data.strip(' \t\n\r')
    line+=1

#read rna condon table
codons ={}
for in_data in in_codon:
    line = (in_data.strip(' \t\n\r')).split()
    if len(line)==2:
       codons[line[0]]=line[1]
    else:
       codons[line[0]]=''

def FindStopPos(rna):
    lpos = 0
    for ck in ['UGA','UAA', 'UAG']:
        pos = rna.rfind(ck)-30;
        if (pos>0 and pos>lpos):
           lpos = pos
    return lpos

def ProteinValue(in_rna30):
    rnas30 = [in_rna30[i:i+in_kmer] for i in range(0, len(in_rna30), 3)]
    protein = ''
    for rna in rnas30:
        protein +=codons[rna]
    return protein

lpos = FindStopPos(in_rna)
in_rna30 = in_rna[lpos:lpos+30]
protein30 = ProteinValue(in_rna30)
in_rnarem = in_rna[:lpos]
proteinrem = ProteinValue(in_rnarem)
protein = proteinrem + protein30
print protein

2. Peptide Encoding: Find substrings of a genome encoding a given amino acid sequence.

Input: A DNA string – ATGGCCATGGCCCCCAGAACTGAGATCAATAGTACCCGTATTAACGGGTGA
An amino acid string Peptide – MA

All substrings of encoding Peptide Output:
ATGGCC, GGCCAT, ATGGCC

The solution may contain repeated strings if the same string occurs more than once as a substring of Text and encodes Peptide.

import collections
import itertools
import re
#read input
in_file = open('w_2_2_data_set1.txt', 'r')
in_codon = open('w_2_1_RNA_codon_table_1.txt', 'r')
line = 1
in_dna = "";
in_kmer = 0
in_peptide = ""
in_result=""
for in_data in in_file:
    if (line==1):
       in_dna = in_data.strip(' \t\n\r')
    elif (line==2):
       in_peptide = in_data.strip(' \t\n\r')
    line+=1

aminoacids = collections.defaultdict(list)
for in_data in in_codon:
    line = (in_data.strip(' \t\n\r')).split()
    if len(line)==2:
       condo,aminoacid = line
       aminoacids[aminoacid].append(condo)

def RDNA(dna):
    return dna.replace('T','U')

def DNA(rdna):
    return rdna.replace('U','T')

#reverse complemntary
def Reverse(dna):
    return dna

rdna = RDNA(in_dna)
rev_rdna = RDNA(Reverse(in_dna))
p=in_peptide[:1]
pattern = aminoacids[p]
for p in in_peptide[1:]:
    patterno = [pa+aa for pa in pattern for aa in aminoacids[p]]
    pattern=patterno

'''
M = AUG = >
A = ['GCA', 'GCC', 'GCG', 'GCU'] = > ['AUGGCA', 'AUGGCC', 'AUGGCG', 'AUGGCU']
'''
ou_result =[]
for i, p in enumerate(pattern):
    if (rdna.find(p)>=0):
       ou_result += [DNA(p) for a in list(re.finditer(p, rdna))]
    if (rev_rdna.find(p)>=0):
       ou_result +=[Reverse(DNA(p)) for a in list(re.finditer(p, rev_rdna))]

for j, pa in enumerate(ou_result):
    print pa

Tyrocidines and gramicidins are actually cyclic peptides; the cyclic representation for Tyrocidine B1 is shown in the figure below.

Tyrocidine B1 has ten different linear sequences to find potential 30-mers coding.

Val-Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln-Tyr
Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln-Tyr-Val
⋮
Tyr-Val-Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln

The workhorse of peptide sequencing is the mass spectrometer, an expensive molecular scale that shatters molecules into pieces and then weighs the resulting fragments.

The mass spectrometer measures the mass of a molecule in daltons (Da). 1 Da is approximately equal to the mass of a single nuclear particle (i.e., a proton or neutron).

Adding the number of protons and neutrons, which yields the molecule’s integer mass. For example, the amino acid Gly, which has chemical formula C₂H₃ON, has an integer mass of 57, since 2·12 + 3·1 + 1·16 + 1·14 = 57. Actual Mass 57.02 Da, Integer mass:57.

Yet 1 Da is not exactly equal to the mass of a proton/neutron, and we may need to account for different naturally occurring isotopes of each atom when weighing a molecule. As a result, amino acids typically have non-integer masses (e.g., Gly has total mass equal to approximately 57.02 Da)

Integer mass table, 2 amino acid pairs have equal mass

G	A	S	P	V	T	C	I	L	N	D	K	Q	E	M	H	F	R	Y	W
57	71	87	97	99	101	103	113	113	114	115	128	128	129	131	137	147	156	163	186


Move from 20 amino acids -> 18 integer masses
G	A	S	P	V	T	C	I/L		N	D	K/Q		E	M	H	F	R	Y	W
57	71	87	97	99	101	103	113		114	115	128		129	131	137	147	156	163	186

Tyrocidine B1, which is represented by VKLFPWFNQY in the single-letter amino acid alphabet, has total mass 1322 Da (99 + 128 + 113 + 147 + 97 + 186 + 147 + 114 + 128 + 163 = 1322).

The collection of all the fragment masses generated by the mass spectrometer is called an experimental spectrum. For example, Tyrocidine B1 into two different linear fragments, one way break into LFP and WFNQYVK (with respective masses 357 and 965), and another way break into PWFN and QYVKLF.

The resulting experimental spectrum contains the masses of all possible linear fragments of the peptide, which are called subpeptides. For example, the cyclic peptide NQEL has 12 subpeptides: N, Q, E, L, NQ, QE, EL, LN, NQE, QEL, ELN, and LNQ. The subpeptides may occur more than once if an amino acid occurs multiple times in the peptide, for example ELEL also has 12 subpeptides: E, L, E, L, EL, LE, EL, LE, ELE, LEL, ELE, and LEL.

The theoretical spectrum of a cyclic peptide Peptide, denoted Cyclospectrum(Peptide), is the collection of all of the masses of its subpeptides, in addition to the mass 0 and the mass of the entire peptide.

The theoretical spectrum can contain duplicate elements, as is the case for NQEL, where NQ and EL have the same mass.


0	113	114	128	129	227	242	242	257	355	356	370	371	484
	L	N	Q	E	LN	NQ	EL	QE	LNQ	ELN	QEL	NQE	NQEL

3. Theoretical Spectrum: Generate the theoretical spectrum of a cyclic peptide.
Input: An amino acid string Peptide – LEQN
Output: An amino acid string Peptide – 0 113 114 128 129 227 242 242 257 355 356 370 371 484

in_file = open('w_2_3_data_set1.txt', 'r')
in_mass = open('w_2_3_integer_mass_table.txt', 'r')

#read input
line = 1
in_peptide = "";
in_result=""
for in_data in in_file:
    if (line==1):
       in_peptide = in_data.strip(' \t\n\r')
    line+=1
masss ={}
for in_data in in_mass:
    line = (in_data.strip(' \t\n\r')).split()
    if len(line)==2:
       masss[line[0]]=line[1]

def Total(mc):
    tot = 0
    for j,p in enumerate(mc):
        tot += int(masss[p])
    return tot

lin_peptide_n = []
lin_peptide = []
lp = len(in_peptide)
pep_cycli = in_peptide + in_peptide[:lp-2]
lin_peptide = [pep_cycli[j:j+i+1] for i in range(0,lp-1,1) for j in range(0, lp, 1)]
lin_peptide_n = [Total(i) for i in lin_peptide]
lin_peptide.append(in_peptide)
lin_peptide_n.append(Total(in_peptide))
lin_peptide_n.append(0)

print lin_peptide
print lin_peptide_n
print sorted(lin_peptide_n)

#LEQN LEQNLE
#['L', 'E', 'Q', 'N', 'LE', 'EQ', 'QN', 'NL', 'LEQ', 'EQN', 'QNL', 'NLE', 'LEQN']
#[113, 129, 128, 114, 242, 257, 242, 227, 370, 371, 355, 356, 484, 0]
#[0, 113, 114, 128, 129, 227, 242, 242, 257, 355, 356, 370, 371, 484]

The Beltway Problem – find a set of points on a circle such that the distances between all pairs of points (where distance is measured around the circle) match a given collection of integers.

The Beltway Problem’s analogue in the case when the points lie along a line segment instead of on a circle is called the Turnpike Problem. In the case of n points on a circle and line, the inputs for the Beltway and Turnpike Problems consist of n(n − 1) + 2 and n(n − 1)/2 + 2 distances, respectively. There is a pseudo-polynomial algorithm for the Turnpike Problem.

If A = (a₁ = 0, a₂, …, a_n) is a set of n points on a line segment in increasing order (a₁ < a₂ < · · · < a_n), then ∆A denotes the collection of all pairwise differences between points in A. For example, if A = (0, 2, 4, 7, 10), then

ΔA=(−10,−8,−7,−6,−5,−4,−3,−3,−2,−2,0,0,0,0,0,2,2,3,3,4,5,6,7,8,10)

The turnpike problem asks us to reconstruct A from ∆A.

4. Turnpike Problem: Given all pairwise distances between points on a line segment, reconstruct the positions of those points.
Input: A collection of integers L. -10 -8 -7 -6 -5 -4 -3 -3 -2 -2 0 0 0 0 0 2 2 3 3 4 5 6 7 8 10
Output: A set A such that ∆A = L. 0 2 4 7 10

paris =[-10,-8,-7,-6,-5,-4,-3,-3,-2,-2,0,0,0,0,0,2,2,3,3,4,5,6,7,8,10]
pos =1
x=2
k=1
for i in reversed(paris):
    if (i>=0):
       if (k==pos):
          pos+=x
          x+=1
          print i,
       #print i, k, pos, x
    k+=1

The traditional view that bacteria act as loners and have few interactions with the rest of their colony has been challenged by the discovery of a communication method called quorum sensing. Selenocysteine is a proteinogenic amino acid that exists in all kingdoms of life as a building block of a special class of proteins called selenoproteins. Pyrrolysine is a proteinogenic amino acid that exists in some archaea and methane-producing bacteria.

The dalton (abbreviated Da) is the unit used for measuring atomic masses on a molecular scale. One dalton is equivalent to one twelfth of the mass of carbon-12 and has a value of approximately 1.66 · 10^-27 kg. The monoisotopic mass of a molecule is equal to the sum of the masses of the atoms in that molecule, using the mass of the most abundant isotope for each element. See the table below.

Amino acid	3-letter code	Molecular formula	Mass (Da)
Alanine	Ala	C₃H₅NO	71.03711
Cysteine	Cys	C₃H₅NOS	103.00919
Aspartic acid	Asp	C₄H₅NO₃	115.02694
Glutamic acid	Glu	C₅H₇NO₃	129.04259
Phenylalanine	Phe	C₉H₉NO	147.06841
Glycine	Gly	C₂H₃NO	57.02146
Histidine	His	C₆H₇N₃O	137.05891
Isoleucine	Ile	C₆H₁₁NO	113.08406
Lysine	Lys	C₆H₁₂N₂O	128.09496
Leucine	Leu	C₆H₁₁NO	113.08406
Methionine	Met	C₅H₉NOS	131.04049
Asparagine	Asn	C₄H₆N₂O₂	114.04293
Proline	Pro	C₅H₇NO	97.05276
Glutamine	Gln	C₅H₈N₂O	128.05858
Arginine	Arg	C₆H₁₂N₄O	156.10111
Serine	Ser	C₃H₅NO₂	87.03203
Threonine	Thr	C₄H₇NO₂	101.04768
Valine	Val	C₅H₉NO	99.06841
Tryptophan	Trp	C₁₁H₁₀N₂O	186.07931
Tyrosine	Tyr	C₉H₉NO₂	163.06333

Branch-and-bound algorithms. – a branching step to increase the number of case solutions, followed by a bounding step to remove hopeless cases.

5. CYCLO PEPTIDE SEQUENCING
Input: 0 113 128 186 241 299 314 427
Output: 186-128-113 186-113-128 128-186-113 128-113-186 113-186-128 113-128-186

import collections
import itertools
import re
in_file = open('w_2_4_data_set1.txt', 'r')
in_mass = open('w_2_3_integer_mass_table.txt', 'r')
line = 1
in_spectrum = "";
in_result=""
for in_data in in_file:
    if (line==1):
       in_spectrum = in_data.strip(' \t\n\r')
    line+=1

#parse mass table
masss ={}
for in_data in in_mass:
    line = (in_data.strip(' \t\n\r')).split()
    if len(line)==2:
       masss[line[0]]=line[1]
masss_value = masss.values()

#convert mass letter to number and "-" each letter between
#IKW 113-128-186
def Total(mc):
    tot = ''
    for j,p in enumerate(mc):
        tot += masss[p]+ '-'
    return tot[:len(tot)-1]

#check the value and return mass key
def MassKey(p):
    for k, v in masss.items():
        if (v==p):
           return k

#sum of mass number
def Sum(mc):
    tot = 0
    for j,p in enumerate(mc):
        tot += int(masss[p])
    return tot

in_spectrum_a = (in_spectrum.strip(' \t\n\r')).split()
in_spectrum_s = ''

#Get all the >0 value of spectrum
#113 2 I, 128 2 K, 186 1 W = >IKW
for j,p in enumerate(in_spectrum_a):
    if (masss_value.count(p)>0):
       in_spectrum_s+=MassKey(p)

#same mass key may be repeated, so assign each "in_spectrum_s" one letter
#{'a': 'I', 'c': 'W', 'b': 'K'}
dic_spe={}
for i,p in enumerate(list(map(chr, range(97, 97+len(in_spectrum_s))))):
    dic_spe[p]=in_spectrum_s[i]

#convert assigned value to mass value
#IKW abc, IWK acb
def DictValue(pq):
    dpq = ''
    for j,p in enumerate(pq):
        dpq += dic_spe[p]
    return dpq

#Check sum of "I","IW", "IKW" mass in input spectrum
def Split(p):
    pp = p
    for i in range(0, len(pp)-1):
        mc = pp[i:i+len(p)-1]
        if (not str(Sum(mc)) in in_spectrum_a):
           return False
    return True

#generate ['a', 'b', 'c'] => ['ac', 'ab', 'ca', 'cb', 'ba', 'bc']
# => ['acb', 'abc', 'cab', 'cba', 'bac', 'bca']
#Check sum of "I","IW", "IKW" mass in input spectrum
ou_spectrum_s = list(dic_spe.keys())
in_spectrum_d = list(dic_spe.keys())
for k in range(0,len(in_spectrum_s)-1):
    tmp =[]
    for i,p in enumerate(ou_spectrum_s):
        for j,q in enumerate(in_spectrum_d):
            if (p.find(q)<0):
               dp = DictValue(p)
               dq = DictValue(q)
               if (Split(dp+dq) and str(Sum(dp+dq)) in in_spectrum_a ):
                  tmp.append(p+q)
    ou_spectrum_s =list(tmp)

#convert assinged letter to mass letter
uou_spectrum_s = []
for j, p in enumerate(sorted(ou_spectrum_s)):
    pd =DictValue(p)
    if (not pd in uou_spectrum_s):
       uou_spectrum_s.append(pd)

#convert mass letter to number and "-" each letter between
#IKW 113-128-186, IWK 113-186-128, KIW 128-113-186, KWI 128-186-113
for j, pd in enumerate(sorted(uou_spectrum_s)):
    print Total(pd)

Bio Information CYCLO PEPTIDE SEQUENCING, Peptide Encoding, Protein Translation, python, RNA codon table, Theoretical Spectrum, Turnpike, Tyrocidine

← Big-O notation MOTIF →

{Algorithm;}

Sequence Antibiotics

Leave a comment Cancel reply

Categories

Archives

Follow Blog via Email

Recent Posts

DayNight

Blog Stats

{Algorithm;}

Sequence Antibiotics

Share this:

Related

Leave a comment Cancel reply

Categories

Archives

Follow Blog via Email

Recent Posts

DayNight

Blog Stats