Jupyter Notebook

1) Install Anaconda Python Distribution, most widely used for data science. It contains the most popular libraries and tools (like NumPy, Pandas, Matplotlib …).

2) Once installation done, run Jupyter Notebook (anaconda3) from cmd. You can run the Jupyter Notebook by pip and Docker.

3) A Jupyter server is now running in your terminal, listening to port 8888, and it will automatically launch the application http://localhost:8888 or you can browse if it doesn’t launch.

jupyter

4) You should see your workspace directory

jupyter1

5) Click New -> Select Python 3 – > It opens the newly created notebook in a new tab -> Enter Print(“Hello World”) -> Run -> you can see the output.

jupyter2

6)Python Libraries: NumpyPandasScikit LearnMatplotlibSeaborn

 

Digital sum/root

Digital sum – the sum of all its digits. (12345 => 1+2+3+4+5=>15).

Digital root (also repeated digital sum) – the sum of all its digits untill a single-digit number is reached.
12345 => 1+2+3+4+5 => 15 => 1+5 => 6

In other words, it is the remainder when the number is divided by 9.

12345/9 => 12339 with a remainder of 6 => 1371.66 (1371*9=>12339 +6 => 12345).

Casting Out 9’s (a fast to find digital root).
Ignore the digit 9 in the number or a 9 anywhere in the number.

129 => 12 (remove 9) => 3 (1+2+9 => 12 => 3).

12999999993 => 1+2+3 => 6 (1+2+9+9+9+9+9+9+9+9+3 => 78 => 15 => 6).

1234125678 => (cross it out 1+8,2+7,3+6,4+5), so remaining 1,2 => 1+2 => 3
=> 1+2+3+4+1+2+5+6+7+8 => 39 => 12=> 3.

What is the purpose of the Digital Sum?
when the numbers are added, subtracted, multiplied or divided; their digital sums will also be respectively added, subtracted, multiplied or divided. So, digital sums to check the accuracy of the answer (mostly for big numbers).

123456789 * 123456789 = 15241578750190521
0* 0 = 115515 => 1+1+5+5+1+5 => 18 => 9 => 0 (so it correct)

123456789 * 123456789 = 15241578750190522 (change last digit 1 to 2).
0* 0 = 115552 => 1+1+5+5+5+2 => 19 => 10 => 1 (so it is not correct).

But it won’t find all the mistakes, let us the above exmaple, added 0 end of the result

123456789 * 123456789 = 152415787501905210(added 0, answer is wrong, but digit sum is correct)
0* 0 = 115515 => 1+1+5+5+1+5 => 18 => 9 => 0 (so it correct).

Complex Number

Real numbers – 0,0.3,1,2,-5, symbols (pi,e)

let us assume sqrt(i) = -1, this imaginary number
Imaginary numbers – i,-i, pi i, e i

Combine real and imaginary number = Complex number

z = 5 + 3i , you can’t add these 2 numbers, so this is called complex number

|Ima
|
3|                   z (it is called complex plane)
2|
a   1|
—————————- Re
-2 -1 |0 1 2 3 4 5
-1|
-2|
-3|                b (complex plan)
-4|

a = -2 +i
b= 4- 3i

classify the complex number

2+ 3i + 7i**2 + 5i**3 + 9i**4

let us assume i**2 = -1, i**3 => i**2*i => -i, i**4 => i**2 * i**2 => -1* -1 => 1

2+ 3i + 7i**2 + 5i**3 + 9i**4 => 2+ 3i -7 – 5i + 9 =>  4 – 2i

Data Science Tools – Pandas

Pandas – Handle data in a way suited to analyst(similar to R). Pandas = R + Python, Docs , Tutorial. Pandas allows us to structure and manipulate data in ways that are particularly well-suited for data analysis.  Pandas takes a lot of the best elements from R and implements them in Python.

Install
C:\Python27\Scripts>easy_install pandas
go to python ide, type
import pandas as pd
print pd.__version__

matplotlibhttp://sourceforge.net/projects/matplotlib/files/matplotlib/matplotlib-1.4.0/ OR easy_install matplotlib
import matplotlib.pyplot as plt
print plt.__version__

pyparsing –  http://sourceforge.net/projects/pyparsing/files/pyparsing/pyparsing-2.0.2/ OR easy_install pyparsing

openpyxl – easy_install openpyxl
ValueError: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
easy_install openpyxl==1.8.6
easy_install xlrd
easy_install pyodbc
easy_install sqlalchemy
easy_install xlwt

Install Pandas another way
https://store.continuum.io/cshop/anaconda/
Click “Download Anaconda”, enter email address
Click “Windows 64-bit Python 2.7 Graphical Installer”
Install “Anaconda-2.0.1-Windows-x86_64.exe” and go with the default settings.
Go to cmd prompt, type “conda update conda”
c:\user\mder> conda update conda
Proceed “y” to install lastest version
next type “conda update anaconda”
Proceed “y” to install lastest version
next type “conda update pandas”
Proceed “y” to install lastest version
next type “conda update ipython”
Proceed “y” to install lastest version
next type “conda update tornado”
Proceed “y” to install lastest version

Testing pandas
type ipython notebook
it will open browser, http://localhost:8888/tree/
click “New Notebook”
type
import pandas as pd
print “version : ” + pd.__version__
press “>” play symbol

DataFramesData in Pandas is often contained in a structure called a DataFrame. A DataFrame is a two-dimensional labeled data structure with columns which can
be different types if necessary.

Create a DataFrame, pass a dictionary of lists to the DataFrame constructor: 1) The key of the dictionary will be the column name. 2) The associating list will be the values within that column.

import numpy as np
import pandas as pd
#from pandas import DataFrame, Series
teams = ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
'Lions', 'Lions']
loss = [5, 8, 6, 1, 5, 10, 6, 12];
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': teams, 'wins': [11, 8, 10, 15, 11, 6, 10, 4],
'losses': Series(loss )}
df = pd.DataFrame(data)
print df
'''
losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
...
'''
print df.dtypes
'''
losses     int64
team      object
wins       int64
year       int64
dtype: object
'''
#basic statistics of the dataframe's numerical columns
print df.describe()
'''
losses       wins         year
count   8.000000   8.000000     8.000000
mean    6.625000   9.375000  2011.125000
std     3.377975   3.377975     0.834523
min     1.000000   4.000000  2010.000000
25%     5.000000   7.500000  2010.750000
50%     6.000000  10.000000  2011.000000
75%     8.500000  11.000000  2012.000000
max    12.000000  15.000000  2012.000000
'''
#displays the first five rows of the dataset
print df.head()
#displays the last five rows of the dataset
print df.tail()

Series in Pandas – Series as an one-dimensional object that is similar to an array, list, or column in a database. By default, it will assign an index label to each item in the Series ranging from 0 to N, where N is the number of items in the Series minus one.

import pandas as pd
series = pd.Series(['Python', 42, -1789710578, 12.35])
'''
0         Python
1             42
2    -1789710578
3          12.35
dtype: object
'''
#manually assign indices to the items in the Series when creating the series
series = pd.Series(['Python', 'Mak', 359, 250.55],
index=['Langugae', 'Manager', 'ID', 'Price'])
'''
Langugae    Python
Manager        Mak
ID             359
Price       250.55
dtype: object
'''
#use index to select specific items from the Series
print series['Langugae'] #Python
print series[['Manager', 'Price']] #Manager Mak Price 250.55 dype: object
#use boolean operators to select specific items from the Series
skills = pd.Series([1, 2, 3, 4, 5], index=['C#', 'Java', 'Python',
'Hadoop', 'CRM'])
print skills > 3
'''
C#        False
Java      False
Python    False
Hadoop     True
CRM        True
dtype: bool
'''
print skills[skills > 3]
'''
Hadoop    4
CRM       5
dtype: int64
'''

Notes
1) Selecting a single column from the DataFrame will return a Series.
2) Selecting multiple columns from the DataFrame will return a DataFrame.

print df['year'] #or df.year #shorthand
'''
0    2010
1    2011
2    2012
...
Name: year, dtype: int64
'''
print df[['year','wins']]
'''
   year  wins
0  2010    11
1  2011     8
...
'''

3)Row selection can be done through multiple ways.
1) Slicing
2) An individual index (through the functions iloc or loc)
3) Boolean indexing

print df.iloc[[0]] #or print df.loc[[0]]
'''
   losses   team  wins  year
0       5  Bears    11  2010
'''
print df[3:5]
'''
   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012
'''
print df[df.wins > 10]
'''
   losses     team  wins  year
0       5    Bears    11  2010
3       1  Packers    15  2011
4       5  Packers    11  2012
'''
print df[(df.wins > 10) & (df.team == "Packers")]
'''
   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012
'''
print df['wins'][df['losses']>=6]
print df[(df.wins>=2) | (df.losses>=1)]

Pandas Vectorized Methods -df.apply(numpy.mean)
map() (particular column) and applymap() (entire DataFrame) – create a new Series or DataFrame by applying the lambda function to each element.

Lambda Function are small inline functions that are defined on-the-fly in Python. lambda x: x>= 1 will take an input x and return x>=1, or a boolean that equals True or False.

df['one'].map(lambda x:x>=1)
df.applymap(lambda x:x>=1)

PandaSql – C:\Python27\Scripts>easy_install pandasql
The pandasql package allows us to perform queries on dataframes using the SQLite syntax.

Data Science Tools – NumPy

NumPy
Multidimensional arrays + matrix, mathematical functions useful for statistical analysis (mean,median, standard deviation)

Installation
C:\Python27\Scripts>easy_install numpy
throws error error: Setup script exited with error: Unable to find vcvarsall.bat
Download and install manually http://sourceforge.net/projects/numpy/files/

import numpy
numbers = [1,2,3,4,5]
print numpy.mean(numbers) #3.0
print numpy.median(numbers) #3.0
print numpy.std(numbers) # 1.41

NumPy arrays are different from python lists. Numpy much faster than Python lists directly.

import numpy as np
array = np.array([1, 4, 5, 8], float)  #[ 1.  4.  5.  8.]
#array indexing and slicing
print array[1] #4.0, array[:2] #[ 1.  4.]
array[1] = 5.0 #array[1] = 5.0

# a 2D array/Matrix , Matrix indexing and slicing

array  = np.array([[1, 2, 3], [4, 5, 6]], float)  # [[ 1.  2.  3.]  [ 4.  5.  6.]]
array [1][1] #5.0
array[1, :] #[ 4.  5.  6.]
array[:, 2] #[ 3.  6.]

#Arithmetic operations

array1 = np.array([1, 2, 3], float)
array2 = np.array([5, 2, 6], float)
array1 + array2 #[ 6.  4.  9.]
array1 - array2 #[-4.  0. -3.]
array1 * array2 #[  5.   4.  18.]

#Matrix arithmetic

array = np.array([[1, 2], [3, 4]], float)
array2 = np.array([[5, 6], [7, 8]], float)
array1 + array2 #[[  6.   8.] [ 10.  12.]]
array1 - array2 #[[-4. -4.] [-4. -4.]]
array1 * array2 #[[  5.  12.] [ 21.  32.]]

#Standard arithmetic operations

array1 = np.array([1, 2, 3], float)
array2 = np.array([[6], [7], [8]], float)
np.mean(array1) #2.0
np.dot(array1, array2) #[ 44.]

NumPy Dot

Matrix

A Matrix is an array of numbers, 2 rows and 3 columns 2 X 3

6  4 24
1 -9 8

Adding (same as subtracting) – A matrix with 3 X 5 +  3 X 5, but 3 X 5 # 3 X 4 (the columns don’t match in size).

 3 8 + 4  0 => 3+4 8+0 => 7  8
 4 6   1 -9    4+1 6-9    5 -3

Negative

- 6  4 24 = -6 -4 -24
  1 -9 8    -1  9 -8

Read more of this post

Statistics

Type of statistics
Max – Largest value in L – sorting L(0)
Min – Smallest value in L – L[n-1]
Midpoint – average of max & min – (L[0]+L(n-1))/2
Mean -average of values in L
Mode – most common value in L
Median – “middle” value in L (half bigger, half smaller) – L(n/2)

Statistics by sorting list, all running time 0(1) (constant time)
Statistics on Unsorted list,
1.Sort first, running time 0(n log n) under the comparison model, like sorter
2.Extract the statistic in effectively one scan through the data, running time 0(n), like max, min, median
Read more of this post

Graphs

graph
Clustering Coefficient
clusting coefficient

The clustering coefficient for graph is just the average of the clustering coefficients of the nodes in the graph.
Read more of this post

String Reconstruction

String Reconstruction Problem: Reconstruct a string from its k-mer composition
Input: An integer k and a collection Patterns of k-mers. (AAT ATG GTT TAA TGT)
Output: A string Text with k-mer composition equal to Patterns (if such a string exists).

TAA
   AAT
    ATG
     TGT
      GTT
 TAATGTT

1.The String Reconstruction Problem reduces to finding an Eulerian path in the de Bruijn graph generated from reads.  Solve the String Reconstruction Problem.
Input: The adjacency list of a directed graph that has an Eulerian path.
CTT -> TTA
ACC -> CCA
TAC -> ACC
GGC -> GCT
GCT -> CTT
TTA -> TAC
Output: An Eulerian path in this graph. GGCTTACCA
Read more of this post

Networks

Chain networks consist of a set of nodes and a sequence of edges between them connect them into a single chain.

chain ={}
for i in range(n-1):
    make_link(chain, i, i+1)
0——–0——–0——–0——–0
n1 e1 n2 e2 n3 e3 n4 e4 n5   – nodes (n), edges (e) = n-1
Read more of this post

Eulerian Tour & Path

# For example, if the input graph was
# [(1, 2), (2, 3), (3, 1)]
# A possible Eulerian tour would be [1, 2, 3, 1]

from collections import defaultdict
from random import choice
import itertools
def eulerian_tour(graph):
    tour=[]
    nodes=defaultdict(list)
    for s, e in graph:
        nodes[s].append(e)
        nodes[e].append(s)
    start = choice(nodes.keys())
    tour.append(start)
    while(len(nodes[start])>0):
        last = nodes[start].pop()
        tour.append(last)
        nodes[last].remove(start)
        start = last
    if list(itertools.chain(*nodes.values())):
        #print nodes
        for start in nodes:
            if start in tour and len(nodes[start])>0:
                index = tour.index(start)
                #print start, index, tour    
                while (len(nodes[start])>0):
                    last = nodes[start].pop()
                    nodes[last].remove(start)
                    index+=1
                    tour.insert(index, last)
                    start = last
                    #print tour
                    #print nodes
    return tour

graph = [(0, 1), (1, 5), (1, 7), (4, 5), (4, 8), (1, 6), (3, 7), (5, 9), (2, 4), (0, 4), (2, 5), (3, 6), (8, 9)]
eulerian_tour(graph)

Read more of this post

Analysis of algorithms

Mathematics of algorithms
1)Formalize what you’re doing
2)analyzing correctness
2)analyzing the efficiency

Measuring Times
Simplifying assumptions
1. a simple statement takes one unit of time (e.g. x = x + 1 => 1 unit)
2. a sequence of simple statement is the sum of the individual times (e.g. if y==4: z = z/2 => 2 units)
3. A Loop takes time equal to the body x iterations (e.g. for i in range(4): print i => 4 units)
Read more of this post

Graph, Eulerian Path, Eulerian Tour

Algorithms areGraph just how we organize computations to solve a particular problem. Some algorithms are really straightforward, but other algorithms take advantage of subtle mathematical properties to quickly and accurately provide an answer.

A set of circles and lines connecting between them is graph.
Read more of this post

Graph – Overlap, De Bruijn

1. String Composition: Form the k-mer composition of a string.
Input: An integer k and a string Text.
Output: Compositionk(Text), where the k-mers are written in lexicographic order.
Composition3(TATGGGGTGC) = ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG

dnas = [in_dna[i:i+k] for i in range(0, len(in_dna)) if len(in_dna[i:i+k])==k]
dnas.sort()

For example TAATGCCATGGGATGTT are linked together to form the genome path.

s1
For example, each of the three occurrences of ATG should be connected to TGC, TGG, and TGT; these connections are overlapping 3-mers. The graph showing all overlap connections.

s2
The structure in the figure above is an example of a graph, or a network of nodes connected by edges. This graph is an example of a directed graph, whose edges have a direction and are represented by arrows (as opposed to undirected graphs whose edges do not have directions).

Sequence much shorter DNA fragments called reads.

Read more of this post

MOTIF

MOTIF ENUMERATION is unfortunately rather slow for large values of k and d,
1. Implanted Motif: Find all (k, d)-motifs in a collection of strings.
Input: A collection of strings Dna, and integers k and d.
3 1
ATTTGGC
TGCCTTA
CGGTATC
GAAAATT

Output: All (k, d)-motifs in Dna. ATA ATT GTT TTT

Any (k, d)-motif must be at most d mismatches apart from some k-mer appearing in one of the strings of Dna, generate all such k-mers and then check which of them are (k, d)-motifs. Read more of this post

Sequence Antibiotics

Tyrocidine B1, one of many antibiotics produced by Bacillus brevis. Tyrocidine B1 is defined by the 10 amino acid-long sequence shown below. Val-Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln-Tyr

proteinThe Central Dogma of Molecular Biology states that “DNA makes RNA makes protein.”amnioacid

Transcription simply transforms a DNA string into an RNA string by replacing all occurrences of T with U. The resulting strand of RNA is translated into an amino acid sequence via the genetic code; this process converts each 3-mer of RNA, called a codon, into one of 20 amino acids. As illustrated in the figure below, each of the 64 RNA codons encodes its own amino acid (some codons encode the same amino acid), with the exception of three stop codons that do not translate into amino acids and serve to halt translation. For example, the DNA string TATACGAAA transcribes into the RNA string UAUACGAAA, which in turn translates into the amino acid string Tyr-Thr-Lys. Read more of this post

Big-O notation

Big-O notation compactly describes the running time of an algorithm. An algorithm’s efficiency in terms of its worst-case running time, which is the largest amount of time an algorithm can take given the most difficult input of a fixed size

For example, if your algorithm for sorting an array of n numbers takes roughly n2 operations for the most difficult dataset, we say that the running time of your algorithm is O(n2).

In reality, any number of operations, such as 1.5n2, n2 + n + 2, or 0.5n2 + 1; all these algorithms are O(n2) because big-O notation only cares about the term that grows the fastest with respect to the size of the input.

A function f(n) is Big-O of function g(n), or O(g(n)), when f(n) ≤ c · g(n) for some constant c and sufficiently large n.

Big O notation is to describe the performance or complexity of an algorithm. Big O specifically describes the worst-case scenario, and can be used to describe the execution time required or the space used (e.g. in memory or on disk) by an algorithm. Read more of this post

DNA Replication – Frequent Words, Reverse Complement, Pattern Matching, Clump Finding, Skewi, Mismatches

Genome replication is one of the most important tasks carried out in the cell. Before a cell can divide, it must first replicate its genome so that each of the two daughter cells inherits its own copy.

Replication begins in a genomic region called the replication origin (denoted oriC) and is carried out by molecular copy machines called DNA polymerases.

Locating oriC presents an important task not only for understanding how cells replicate but also for various biomedical problems. For example, some gene therapy methods use genetically engineered mini-genomes, which are called viral vectors because they are able to penetrate cell walls (just like real viruses). Viral vectors carrying artificial genes have been widely used in agriculture.
In the following problem, we assume that a genome has a single oriC. Read more of this post

Length of a Post/Answer Map Reducer Code

Correlation between the length of a post and the length of answers.
Output the length of the post and the average answer length for each post.

Mapper result Reducet result
111 35 111
15084 237 15084
2 145 2
3778 164 3848
3778 69 3778
66193 60 66193
66193 34 66199
66193 302 66196
66193 288 66195
7185 86 7185
10000001 140 323.940451745
10000002 625 465.578947368
10000005 0 35.0
10000006 836 99.6666666667
10000007 4224 580.428571429
66193 59 154.666666667

Read more of this post

Top tag Map Reducer Code

The top 10 tags used in posts, ordered by the number of questions they appear in.
forum csv file contains
“id”    “title”    “tagnames”    “author_id”    “body”    “node_type”    “parent_id”    “abs_parent_id”    “added_at”    “score”    “state_string”    “last_edited_id”    “last_activity_by_id”    “last_activity_at”    “active_revision_id”    “extra”    “extra_ref_id”    “extra_count”    “marked”
Read more of this post