Category Archives: Bio Information

String Reconstruction

String Reconstruction Problem: Reconstruct a string from its k-mer composition
Input: An integer k and a collection Patterns of k-mers. (AAT ATG GTT TAA TGT)
Output: A string Text with k-mer composition equal to Patterns (if such a string exists).

TAA
   AAT
    ATG
     TGT
      GTT
 TAATGTT

1.The String Reconstruction Problem reduces to finding an Eulerian path in the de Bruijn graph generated from reads.  Solve the String Reconstruction Problem.
Input: The adjacency list of a directed graph that has an Eulerian path.
CTT -> TTA
ACC -> CCA
TAC -> ACC
GGC -> GCT
GCT -> CTT
TTA -> TAC
Output: An Eulerian path in this graph. GGCTTACCA
Read more of this post

Advertisement

Graph – Overlap, De Bruijn

1. String Composition: Form the k-mer composition of a string.
Input: An integer k and a string Text.
Output: Compositionk(Text), where the k-mers are written in lexicographic order.
Composition3(TATGGGGTGC) = ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG

dnas = [in_dna[i:i+k] for i in range(0, len(in_dna)) if len(in_dna[i:i+k])==k]
dnas.sort()

For example TAATGCCATGGGATGTT are linked together to form the genome path.

s1
For example, each of the three occurrences of ATG should be connected to TGC, TGG, and TGT; these connections are overlapping 3-mers. The graph showing all overlap connections.

s2
The structure in the figure above is an example of a graph, or a network of nodes connected by edges. This graph is an example of a directed graph, whose edges have a direction and are represented by arrows (as opposed to undirected graphs whose edges do not have directions).

Sequence much shorter DNA fragments called reads.

Read more of this post

MOTIF

MOTIF ENUMERATION is unfortunately rather slow for large values of k and d,
1. Implanted Motif: Find all (k, d)-motifs in a collection of strings.
Input: A collection of strings Dna, and integers k and d.
3 1
ATTTGGC
TGCCTTA
CGGTATC
GAAAATT

Output: All (k, d)-motifs in Dna. ATA ATT GTT TTT

Any (k, d)-motif must be at most d mismatches apart from some k-mer appearing in one of the strings of Dna, generate all such k-mers and then check which of them are (k, d)-motifs. Read more of this post

Sequence Antibiotics

Tyrocidine B1, one of many antibiotics produced by Bacillus brevis. Tyrocidine B1 is defined by the 10 amino acid-long sequence shown below. Val-Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln-Tyr

proteinThe Central Dogma of Molecular Biology states that “DNA makes RNA makes protein.”amnioacid

Transcription simply transforms a DNA string into an RNA string by replacing all occurrences of T with U. The resulting strand of RNA is translated into an amino acid sequence via the genetic code; this process converts each 3-mer of RNA, called a codon, into one of 20 amino acids. As illustrated in the figure below, each of the 64 RNA codons encodes its own amino acid (some codons encode the same amino acid), with the exception of three stop codons that do not translate into amino acids and serve to halt translation. For example, the DNA string TATACGAAA transcribes into the RNA string UAUACGAAA, which in turn translates into the amino acid string Tyr-Thr-Lys. Read more of this post

DNA Replication – Frequent Words, Reverse Complement, Pattern Matching, Clump Finding, Skewi, Mismatches

Genome replication is one of the most important tasks carried out in the cell. Before a cell can divide, it must first replicate its genome so that each of the two daughter cells inherits its own copy.

Replication begins in a genomic region called the replication origin (denoted oriC) and is carried out by molecular copy machines called DNA polymerases.

Locating oriC presents an important task not only for understanding how cells replicate but also for various biomedical problems. For example, some gene therapy methods use genetically engineered mini-genomes, which are called viral vectors because they are able to penetrate cell walls (just like real viruses). Viral vectors carrying artificial genes have been widely used in agriculture.
In the following problem, we assume that a genome has a single oriC. Read more of this post