Sequence Antibiotics

Tyrocidine B1, one of many antibiotics produced by Bacillus brevis. Tyrocidine B1 is defined by the 10 amino acid-long sequence shown below. Val-Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln-Tyr

proteinThe Central Dogma of Molecular Biology states that “DNA makes RNA makes protein.”amnioacid

Transcription simply transforms a DNA string into an RNA string by replacing all occurrences of T with U. The resulting strand of RNA is translated into an amino acid sequence via the genetic code; this process converts each 3-mer of RNA, called a codon, into one of 20 amino acids. As illustrated in the figure below, each of the 64 RNA codons encodes its own amino acid (some codons encode the same amino acid), with the exception of three stop codons that do not translate into amino acids and serve to halt translation. For example, the DNA string TATACGAAA transcribes into the RNA string UAUACGAAA, which in turn translates into the amino acid string Tyr-Thr-Lys.

1. Protein Translation : Translate an RNA string into an amino acid string.
Notes:

  1. The “Stop” codon should not be translated, as shown in the sample below.
  2. A downloadable RNA codon table indicating which codons encode which amino acids.

RNA string Pattern Input:  AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
The translation of Pattern into an amino acid string Peptide Output:  MAMAPRTEINSTRING

RNA codon table

AAA K
AAC N
AAG K
AAU N
ACA T
ACC T
ACG T
ACU T
AGA R
AGC S
AGG R
AGU S
AUA I
AUC I
AUG M
AUU I
CAA Q
CAC H
CAG Q
CAU H
CCA P
CCC P
CCG P
CCU P
CGA R
CGC R
CGG R
CGU R
CUA L
CUC L
CUG L
CUU L
GAA E
GAC D
GAG E
GAU D
GCA A
GCC A
GCG A
GCU A
GGA G
GGC G
GGG G
GGU G
GUA V
GUC V
GUG V
GUU V
UAA
UAC Y
UAG
UAU Y
UCA S
UCC S
UCG S
UCU S
UGA
UGC C
UGG W
UGU C
UUA L
UUC F
UUG L
UUU F
in_file = open('w_2_1_data_set1.txt', 'r')
in_codon = open('w_2_1_RNA_codon_table_1.txt', 'r')
#read input
line = 1
in_rna = "";
in_kmer = 0
in_result=""
for in_data in in_file:
    if (line==1):
        in_rna = in_data.strip(' \t\n\r')
    line+=1

#read rna condon table
codons ={}
for in_data in in_codon:
    line = (in_data.strip(' \t\n\r')).split()
    if len(line)==2:
       codons[line[0]]=line[1]
    else:
       codons[line[0]]=''

def FindStopPos(rna):
    lpos = 0
    for ck in ['UGA','UAA', 'UAG']:
        pos = rna.rfind(ck)-30;
        if (pos>0 and pos>lpos):
           lpos = pos
    return lpos

def ProteinValue(in_rna30):
    rnas30 = [in_rna30[i:i+in_kmer] for i in range(0, len(in_rna30), 3)]
    protein = ''
    for rna in rnas30:
        protein +=codons[rna]
    return protein

lpos = FindStopPos(in_rna)
in_rna30 = in_rna[lpos:lpos+30]
protein30 = ProteinValue(in_rna30)
in_rnarem = in_rna[:lpos]
proteinrem = ProteinValue(in_rnarem)
protein = proteinrem + protein30
print protein

2. Peptide Encoding: Find substrings of a genome encoding a given amino acid sequence.

Input: A DNA string – ATGGCCATGGCCCCCAGAACTGAGATCAATAGTACCCGTATTAACGGGTGA
An amino acid string Peptide –  MA

All substrings of encoding Peptide Output:
ATGGCC, GGCCAT, ATGGCC

The solution may contain repeated strings if the same string occurs more than once as a substring of Text and encodes Peptide.

import collections
import itertools
import re
#read input
in_file = open('w_2_2_data_set1.txt', 'r')
in_codon = open('w_2_1_RNA_codon_table_1.txt', 'r')
line = 1
in_dna = "";
in_kmer = 0
in_peptide = ""
in_result=""
for in_data in in_file:
    if (line==1):
       in_dna = in_data.strip(' \t\n\r')
    elif (line==2):
       in_peptide = in_data.strip(' \t\n\r')
    line+=1

aminoacids = collections.defaultdict(list)
for in_data in in_codon:
    line = (in_data.strip(' \t\n\r')).split()
    if len(line)==2:
       condo,aminoacid = line
       aminoacids[aminoacid].append(condo)

def RDNA(dna):
    return dna.replace('T','U')

def DNA(rdna):
    return rdna.replace('U','T')

#reverse complemntary
def Reverse(dna):
    return dna

rdna = RDNA(in_dna)
rev_rdna = RDNA(Reverse(in_dna))
p=in_peptide[:1]
pattern = aminoacids[p]
for p in in_peptide[1:]:
    patterno = [pa+aa for pa in pattern for aa in aminoacids[p]]
    pattern=patterno

'''
M = AUG = >
A = ['GCA', 'GCC', 'GCG', 'GCU'] = > ['AUGGCA', 'AUGGCC', 'AUGGCG', 'AUGGCU']
'''
ou_result =[]
for i, p in enumerate(pattern):
    if (rdna.find(p)>=0):
       ou_result += [DNA(p) for a in list(re.finditer(p, rdna))]
    if (rev_rdna.find(p)>=0):
       ou_result +=[Reverse(DNA(p)) for a in list(re.finditer(p, rev_rdna))]

for j, pa in enumerate(ou_result):
    print pa

Tyrocidines and gramicidins are actually cyclic peptides; the cyclic representation for Tyrocidine B1 is shown in the figure below.

Tyrocidine B1 has ten different linear sequences to find potential 30-mers coding.cyclic

Val-Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln-Tyr
Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln-Tyr-Val

Tyr-Val-Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln

The workhorse of peptide sequencing is the mass spectrometer, an expensive molecular scale that shatters molecules into pieces and then weighs the resulting fragments.

The mass spectrometer measures the mass of a molecule in daltons (Da). 1 Da is approximately equal to the mass of a single nuclear particle (i.e., a proton or neutron).

Adding the number of protons and neutrons, which yields the molecule’s integer mass. For example, the amino acid Gly, which has chemical formula C2H3ON, has an integer mass of 57, since 2·12 + 3·1 + 1·16 + 1·14 = 57. Actual Mass 57.02 Da, Integer mass:57.

Yet 1 Da is not exactly equal to the mass of a proton/neutron, and we may need to account for different naturally occurring isotopes of each atom when weighing a molecule.  As a result, amino acids typically have non-integer masses (e.g., Gly has total mass equal to approximately 57.02 Da)

Integer mass table, 2 amino acid pairs have equal mass

G A S P V T C I L N D K Q E M H F R Y W
57 71 87 97 99 101 103 113 113 114 115 128 128 129 131 137 147 156 163 186
Move from 20 amino acids -> 18 integer masses
G A S P V T C I/L N D K/Q E M H F R Y W
57 71 87 97 99 101 103 113 114 115 128 129 131 137 147 156 163 186

Tyrocidine B1, which is represented by VKLFPWFNQY in the single-letter amino acid alphabet, has total mass 1322 Da (99 + 128 + 113 + 147 + 97 + 186 + 147 + 114 + 128 + 163 = 1322).

The collection of all the fragment masses generated by the mass spectrometer is called an experimental spectrum. For example, Tyrocidine B1 into two different linear fragments, one way break into LFP and WFNQYVK (with respective masses 357 and 965), and another way break into PWFN and QYVKLF.

The resulting experimental spectrum contains the masses of all possible linear fragments of the peptide, which are called subpeptides. For example, the cyclic peptide NQEL has 12 subpeptides: N, Q, E, L, NQ, QE, EL, LN, NQE, QEL, ELN, and LNQ. The subpeptides may occur more than once if an amino acid occurs multiple times in the peptide, for example ELEL also has 12 subpeptides: E, L, E, L, EL, LE, EL, LE, ELE, LEL, ELE, and LEL.

The theoretical spectrum of a cyclic peptide Peptide, denoted Cyclospectrum(Peptide), is the collection of all of the masses of its subpeptides, in addition to the mass 0 and the mass of the entire peptide.

The theoretical spectrum can contain duplicate elements, as is the case for NQEL, where NQ and EL have the same mass.

0 113 114 128 129 227 242 242 257 355 356 370 371 484
L N Q E LN NQ EL QE LNQ ELN QEL NQE NQEL

3. Theoretical Spectrum: Generate the theoretical spectrum of a cyclic peptide.
Input: An amino acid string Peptide – LEQN
Output: An amino acid string Peptide –   0 113 114 128 129 227 242 242 257 355 356 370 371 484

in_file = open('w_2_3_data_set1.txt', 'r')
in_mass = open('w_2_3_integer_mass_table.txt', 'r')

#read input
line = 1
in_peptide = "";
in_result=""
for in_data in in_file:
    if (line==1):
       in_peptide = in_data.strip(' \t\n\r')
    line+=1
masss ={}
for in_data in in_mass:
    line = (in_data.strip(' \t\n\r')).split()
    if len(line)==2:
       masss[line[0]]=line[1]

def Total(mc):
    tot = 0
    for j,p in enumerate(mc):
        tot += int(masss[p])
    return tot

lin_peptide_n = []
lin_peptide = []
lp = len(in_peptide)
pep_cycli = in_peptide + in_peptide[:lp-2]
lin_peptide = [pep_cycli[j:j+i+1] for i in range(0,lp-1,1) for j in range(0, lp, 1)]
lin_peptide_n = [Total(i) for i in lin_peptide]
lin_peptide.append(in_peptide)
lin_peptide_n.append(Total(in_peptide))
lin_peptide_n.append(0)

print lin_peptide
print lin_peptide_n
print sorted(lin_peptide_n)

#LEQN LEQNLE
#['L', 'E', 'Q', 'N', 'LE', 'EQ', 'QN', 'NL', 'LEQ', 'EQN', 'QNL', 'NLE', 'LEQN']
#[113, 129, 128, 114, 242, 257, 242, 227, 370, 371, 355, 356, 484, 0]
#[0, 113, 114, 128, 129, 227, 242, 242, 257, 355, 356, 370, 371, 484]

The Beltway Problem – find a set of points on a circle such that the distances between all pairs of points (where distance is measured around the circle) match a given collection of integers.

The Beltway Problem’s analogue in the case when the points lie along a line segment instead of on a circle is called the Turnpike Problem. In the case of n points on a circle and line, the inputs for the Beltway and Turnpike Problems consist of n(n − 1) + 2 and n(n − 1)/2 + 2 distances, respectively.  There is a pseudo-polynomial algorithm for the Turnpike Problem.

If A = (a1 = 0, a2, …, an) is a set of n points on a line segment in increasing order (a1 < a2 < · · · < an), then ∆A denotes the collection of all pairwise differences between points in A. For example, if A = (0, 2, 4, 7, 10), then

ΔA=(−10,−8,−7,−6,−5,−4,−3,−3,−2,−2,0,0,0,0,0,2,2,3,3,4,5,6,7,8,10)

The turnpike problem asks us to reconstruct A from ∆A.

4. Turnpike Problem: Given all pairwise distances between points on a line segment, reconstruct the positions of those points.
Input: A collection of integers L. -10 -8 -7 -6 -5 -4 -3 -3 -2 -2 0 0 0 0 0 2 2 3 3 4 5 6 7 8 10
Output: A set A such that ∆A = L. 0 2 4 7 10

paris =[-10,-8,-7,-6,-5,-4,-3,-3,-2,-2,0,0,0,0,0,2,2,3,3,4,5,6,7,8,10]
pos =1
x=2
k=1
for i in reversed(paris):
    if (i>=0):
       if (k==pos):
          pos+=x
          x+=1
          print i,
       #print i, k, pos, x
    k+=1

The traditional view that bacteria act as loners and have few interactions with the rest of their colony has been challenged by the discovery of a communication method called quorum sensing. Selenocysteine is a proteinogenic amino acid that exists in all kingdoms of life as a building block of a special class of proteins called selenoproteins.  Pyrrolysine is a proteinogenic amino acid that exists in some archaea and methane-producing bacteria.

The dalton (abbreviated Da) is the unit used for measuring atomic masses on a molecular scale. One dalton is equivalent to one twelfth of the mass of carbon-12 and has a value of approximately 1.66 · 10-27 kg. The monoisotopic mass of a molecule is equal to the sum of the masses of the atoms in that molecule, using the mass of the most abundant isotope for each element. See the table below.

Amino acid 3-letter code Molecular formula Mass (Da)
Alanine Ala C3H5NO 71.03711
Cysteine Cys C3H5NOS 103.00919
Aspartic acid Asp C4H5NO3 115.02694
Glutamic acid Glu C5H7NO3 129.04259
Phenylalanine Phe C9H9NO 147.06841
Glycine Gly C2H3NO 57.02146
Histidine His C6H7N3O 137.05891
Isoleucine Ile C6H11NO 113.08406
Lysine Lys C6H12N2O 128.09496
Leucine Leu C6H11NO 113.08406
Methionine Met C5H9NOS 131.04049
Asparagine Asn C4H6N2O2 114.04293
Proline Pro C5H7NO 97.05276
Glutamine Gln C5H8N2O 128.05858
Arginine Arg C6H12N4O 156.10111
Serine Ser C3H5NO2 87.03203
Threonine Thr C4H7NO2 101.04768
Valine Val C5H9NO 99.06841
Tryptophan Trp C11H10N2O 186.07931
Tyrosine Tyr C9H9NO2 163.06333

Branch-and-bound algorithms. – a branching step to increase the number of case solutions, followed by a bounding step to remove hopeless cases.

5. CYCLO PEPTIDE SEQUENCING
Input:     0 113 128 186 241 299 314 427
Output:     186-128-113 186-113-128 128-186-113 128-113-186 113-186-128 113-128-186

import collections
import itertools
import re
in_file = open('w_2_4_data_set1.txt', 'r')
in_mass = open('w_2_3_integer_mass_table.txt', 'r')
line = 1
in_spectrum = "";
in_result=""
for in_data in in_file:
    if (line==1):
       in_spectrum = in_data.strip(' \t\n\r')
    line+=1

#parse mass table
masss ={}
for in_data in in_mass:
    line = (in_data.strip(' \t\n\r')).split()
    if len(line)==2:
       masss[line[0]]=line[1]
masss_value = masss.values()

#convert mass letter to number and "-" each letter between
#IKW 113-128-186
def Total(mc):
    tot = ''
    for j,p in enumerate(mc):
        tot += masss[p]+ '-'
    return tot[:len(tot)-1]

#check the value and return mass key
def MassKey(p):
    for k, v in masss.items():
        if (v==p):
           return k

#sum of mass number
def Sum(mc):
    tot = 0
    for j,p in enumerate(mc):
        tot += int(masss[p])
    return tot

in_spectrum_a = (in_spectrum.strip(' \t\n\r')).split()
in_spectrum_s = ''

#Get all the >0 value of spectrum
#113 2 I, 128 2 K, 186 1 W = >IKW
for j,p in enumerate(in_spectrum_a):
    if (masss_value.count(p)>0):
       in_spectrum_s+=MassKey(p)

#same mass key may be repeated, so assign each "in_spectrum_s" one letter
#{'a': 'I', 'c': 'W', 'b': 'K'}
dic_spe={}
for i,p in enumerate(list(map(chr, range(97, 97+len(in_spectrum_s))))):
    dic_spe[p]=in_spectrum_s[i]

#convert assigned value to mass value
#IKW abc, IWK acb
def DictValue(pq):
    dpq = ''
    for j,p in enumerate(pq):
        dpq += dic_spe[p]
    return dpq

#Check sum of "I","IW", "IKW" mass in input spectrum
def Split(p):
    pp = p
    for i in range(0, len(pp)-1):
        mc = pp[i:i+len(p)-1]
        if (not str(Sum(mc)) in in_spectrum_a):
           return False
    return True

#generate ['a', 'b', 'c'] => ['ac', 'ab', 'ca', 'cb', 'ba', 'bc']
# => ['acb', 'abc', 'cab', 'cba', 'bac', 'bca']
#Check sum of "I","IW", "IKW" mass in input spectrum
ou_spectrum_s = list(dic_spe.keys())
in_spectrum_d = list(dic_spe.keys())
for k in range(0,len(in_spectrum_s)-1):
    tmp =[]
    for i,p in enumerate(ou_spectrum_s):
        for j,q in enumerate(in_spectrum_d):
            if (p.find(q)<0):
               dp = DictValue(p)
               dq = DictValue(q)
               if (Split(dp+dq) and str(Sum(dp+dq)) in in_spectrum_a ):
                  tmp.append(p+q)
    ou_spectrum_s =list(tmp)

#convert assinged letter to mass letter
uou_spectrum_s = []
for j, p in enumerate(sorted(ou_spectrum_s)):
    pd =DictValue(p)
    if (not pd in uou_spectrum_s):
       uou_spectrum_s.append(pd)

#convert mass letter to number and "-" each letter between
#IKW 113-128-186, IWK 113-186-128, KIW 128-113-186, KWI 128-186-113
for j, pd in enumerate(sorted(uou_spectrum_s)):
    print Total(pd)
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: