Tag Archives: python

Jupyter Notebook

1) Install Anaconda Python Distribution, most widely used for data science. It contains the most popular libraries and tools (like NumPy, Pandas, Matplotlib …).

2) Once installation done, run Jupyter Notebook (anaconda3) from cmd. You can run the Jupyter Notebook by pip and Docker.

3) A Jupyter server is now running in your terminal, listening to port 8888, and it will automatically launch the application http://localhost:8888 or you can browse if it doesn’t launch.


4) You should see your workspace directory


5) Click New -> Select Python 3 – > It opens the newly created notebook in a new tab -> Enter Print(“Hello World”) -> Run -> you can see the output.


6)Python Libraries: NumpyPandasScikit LearnMatplotlibSeaborn



Data Science Tools – Pandas

Pandas – Handle data in a way suited to analyst(similar to R). Pandas = R + Python, Docs , Tutorial. Pandas allows us to structure and manipulate data in ways that are particularly well-suited for data analysis.  Pandas takes a lot of the best elements from R and implements them in Python.

C:\Python27\Scripts>easy_install pandas
go to python ide, type
import pandas as pd
print pd.__version__

matplotlibhttp://sourceforge.net/projects/matplotlib/files/matplotlib/matplotlib-1.4.0/ OR easy_install matplotlib
import matplotlib.pyplot as plt
print plt.__version__

pyparsing –  http://sourceforge.net/projects/pyparsing/files/pyparsing/pyparsing-2.0.2/ OR easy_install pyparsing

openpyxl – easy_install openpyxl
ValueError: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
easy_install openpyxl==1.8.6
easy_install xlrd
easy_install pyodbc
easy_install sqlalchemy
easy_install xlwt

Install Pandas another way
Click “Download Anaconda”, enter email address
Click “Windows 64-bit Python 2.7 Graphical Installer”
Install “Anaconda-2.0.1-Windows-x86_64.exe” and go with the default settings.
Go to cmd prompt, type “conda update conda”
c:\user\mder> conda update conda
Proceed “y” to install lastest version
next type “conda update anaconda”
Proceed “y” to install lastest version
next type “conda update pandas”
Proceed “y” to install lastest version
next type “conda update ipython”
Proceed “y” to install lastest version
next type “conda update tornado”
Proceed “y” to install lastest version

Testing pandas
type ipython notebook
it will open browser, http://localhost:8888/tree/
click “New Notebook”
import pandas as pd
print “version : ” + pd.__version__
press “>” play symbol

DataFramesData in Pandas is often contained in a structure called a DataFrame. A DataFrame is a two-dimensional labeled data structure with columns which can
be different types if necessary.

Create a DataFrame, pass a dictionary of lists to the DataFrame constructor: 1) The key of the dictionary will be the column name. 2) The associating list will be the values within that column.

import numpy as np
import pandas as pd
#from pandas import DataFrame, Series
teams = ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
'Lions', 'Lions']
loss = [5, 8, 6, 1, 5, 10, 6, 12];
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': teams, 'wins': [11, 8, 10, 15, 11, 6, 10, 4],
'losses': Series(loss )}
df = pd.DataFrame(data)
print df
losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
print df.dtypes
losses     int64
team      object
wins       int64
year       int64
dtype: object
#basic statistics of the dataframe's numerical columns
print df.describe()
losses       wins         year
count   8.000000   8.000000     8.000000
mean    6.625000   9.375000  2011.125000
std     3.377975   3.377975     0.834523
min     1.000000   4.000000  2010.000000
25%     5.000000   7.500000  2010.750000
50%     6.000000  10.000000  2011.000000
75%     8.500000  11.000000  2012.000000
max    12.000000  15.000000  2012.000000
#displays the first five rows of the dataset
print df.head()
#displays the last five rows of the dataset
print df.tail()

Series in Pandas – Series as an one-dimensional object that is similar to an array, list, or column in a database. By default, it will assign an index label to each item in the Series ranging from 0 to N, where N is the number of items in the Series minus one.

import pandas as pd
series = pd.Series(['Python', 42, -1789710578, 12.35])
0         Python
1             42
2    -1789710578
3          12.35
dtype: object
#manually assign indices to the items in the Series when creating the series
series = pd.Series(['Python', 'Mak', 359, 250.55],
index=['Langugae', 'Manager', 'ID', 'Price'])
Langugae    Python
Manager        Mak
ID             359
Price       250.55
dtype: object
#use index to select specific items from the Series
print series['Langugae'] #Python
print series[['Manager', 'Price']] #Manager Mak Price 250.55 dype: object
#use boolean operators to select specific items from the Series
skills = pd.Series([1, 2, 3, 4, 5], index=['C#', 'Java', 'Python',
'Hadoop', 'CRM'])
print skills > 3
C#        False
Java      False
Python    False
Hadoop     True
CRM        True
dtype: bool
print skills[skills > 3]
Hadoop    4
CRM       5
dtype: int64

1) Selecting a single column from the DataFrame will return a Series.
2) Selecting multiple columns from the DataFrame will return a DataFrame.

print df['year'] #or df.year #shorthand
0    2010
1    2011
2    2012
Name: year, dtype: int64
print df[['year','wins']]
   year  wins
0  2010    11
1  2011     8

3)Row selection can be done through multiple ways.
1) Slicing
2) An individual index (through the functions iloc or loc)
3) Boolean indexing

print df.iloc[[0]] #or print df.loc[[0]]
   losses   team  wins  year
0       5  Bears    11  2010
print df[3:5]
   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012
print df[df.wins > 10]
   losses     team  wins  year
0       5    Bears    11  2010
3       1  Packers    15  2011
4       5  Packers    11  2012
print df[(df.wins > 10) & (df.team == "Packers")]
   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012
print df['wins'][df['losses']>=6]
print df[(df.wins>=2) | (df.losses>=1)]

Pandas Vectorized Methods -df.apply(numpy.mean)
map() (particular column) and applymap() (entire DataFrame) – create a new Series or DataFrame by applying the lambda function to each element.

Lambda Function are small inline functions that are defined on-the-fly in Python. lambda x: x>= 1 will take an input x and return x>=1, or a boolean that equals True or False.

df['one'].map(lambda x:x>=1)
df.applymap(lambda x:x>=1)

PandaSql – C:\Python27\Scripts>easy_install pandasql
The pandasql package allows us to perform queries on dataframes using the SQLite syntax.


Type of statistics
Max – Largest value in L – sorting L(0)
Min – Smallest value in L – L[n-1]
Midpoint – average of max & min – (L[0]+L(n-1))/2
Mean -average of values in L
Mode – most common value in L
Median – “middle” value in L (half bigger, half smaller) – L(n/2)

Statistics by sorting list, all running time 0(1) (constant time)
Statistics on Unsorted list,
1.Sort first, running time 0(n log n) under the comparison model, like sorter
2.Extract the statistic in effectively one scan through the data, running time 0(n), like max, min, median
Read more of this post


Clustering Coefficient
clusting coefficient

The clustering coefficient for graph is just the average of the clustering coefficients of the nodes in the graph.
Read more of this post

String Reconstruction

String Reconstruction Problem: Reconstruct a string from its k-mer composition
Input: An integer k and a collection Patterns of k-mers. (AAT ATG GTT TAA TGT)
Output: A string Text with k-mer composition equal to Patterns (if such a string exists).


1.The String Reconstruction Problem reduces to finding an Eulerian path in the de Bruijn graph generated from reads.  Solve the String Reconstruction Problem.
Input: The adjacency list of a directed graph that has an Eulerian path.
Output: An Eulerian path in this graph. GGCTTACCA
Read more of this post


Chain networks consist of a set of nodes and a sequence of edges between them connect them into a single chain.

chain ={}
for i in range(n-1):
    make_link(chain, i, i+1)
n1 e1 n2 e2 n3 e3 n4 e4 n5   – nodes (n), edges (e) = n-1
Read more of this post

Eulerian Tour & Path

# For example, if the input graph was
# [(1, 2), (2, 3), (3, 1)]
# A possible Eulerian tour would be [1, 2, 3, 1]

from collections import defaultdict
from random import choice
import itertools
def eulerian_tour(graph):
    for s, e in graph:
    start = choice(nodes.keys())
        last = nodes[start].pop()
        start = last
    if list(itertools.chain(*nodes.values())):
        #print nodes
        for start in nodes:
            if start in tour and len(nodes[start])>0:
                index = tour.index(start)
                #print start, index, tour    
                while (len(nodes[start])>0):
                    last = nodes[start].pop()
                    tour.insert(index, last)
                    start = last
                    #print tour
                    #print nodes
    return tour

graph = [(0, 1), (1, 5), (1, 7), (4, 5), (4, 8), (1, 6), (3, 7), (5, 9), (2, 4), (0, 4), (2, 5), (3, 6), (8, 9)]

Read more of this post

Analysis of algorithms

Mathematics of algorithms
1)Formalize what you’re doing
2)analyzing correctness
2)analyzing the efficiency

Measuring Times
Simplifying assumptions
1. a simple statement takes one unit of time (e.g. x = x + 1 => 1 unit)
2. a sequence of simple statement is the sum of the individual times (e.g. if y==4: z = z/2 => 2 units)
3. A Loop takes time equal to the body x iterations (e.g. for i in range(4): print i => 4 units)
Read more of this post

Graph – Overlap, De Bruijn

1. String Composition: Form the k-mer composition of a string.
Input: An integer k and a string Text.
Output: Compositionk(Text), where the k-mers are written in lexicographic order.

dnas = [in_dna[i:i+k] for i in range(0, len(in_dna)) if len(in_dna[i:i+k])==k]

For example TAATGCCATGGGATGTT are linked together to form the genome path.

For example, each of the three occurrences of ATG should be connected to TGC, TGG, and TGT; these connections are overlapping 3-mers. The graph showing all overlap connections.

The structure in the figure above is an example of a graph, or a network of nodes connected by edges. This graph is an example of a directed graph, whose edges have a direction and are represented by arrows (as opposed to undirected graphs whose edges do not have directions).

Sequence much shorter DNA fragments called reads.

Read more of this post


MOTIF ENUMERATION is unfortunately rather slow for large values of k and d,
1. Implanted Motif: Find all (k, d)-motifs in a collection of strings.
Input: A collection of strings Dna, and integers k and d.
3 1

Output: All (k, d)-motifs in Dna. ATA ATT GTT TTT

Any (k, d)-motif must be at most d mismatches apart from some k-mer appearing in one of the strings of Dna, generate all such k-mers and then check which of them are (k, d)-motifs. Read more of this post

Sequence Antibiotics

Tyrocidine B1, one of many antibiotics produced by Bacillus brevis. Tyrocidine B1 is defined by the 10 amino acid-long sequence shown below. Val-Lys-Leu-Phe-Pro-Trp-Phe-Asn-Gln-Tyr

proteinThe Central Dogma of Molecular Biology states that “DNA makes RNA makes protein.”amnioacid

Transcription simply transforms a DNA string into an RNA string by replacing all occurrences of T with U. The resulting strand of RNA is translated into an amino acid sequence via the genetic code; this process converts each 3-mer of RNA, called a codon, into one of 20 amino acids. As illustrated in the figure below, each of the 64 RNA codons encodes its own amino acid (some codons encode the same amino acid), with the exception of three stop codons that do not translate into amino acids and serve to halt translation. For example, the DNA string TATACGAAA transcribes into the RNA string UAUACGAAA, which in turn translates into the amino acid string Tyr-Thr-Lys. Read more of this post

Big-O notation

Big-O notation compactly describes the running time of an algorithm. An algorithm’s efficiency in terms of its worst-case running time, which is the largest amount of time an algorithm can take given the most difficult input of a fixed size

For example, if your algorithm for sorting an array of n numbers takes roughly n2 operations for the most difficult dataset, we say that the running time of your algorithm is O(n2).

In reality, any number of operations, such as 1.5n2, n2 + n + 2, or 0.5n2 + 1; all these algorithms are O(n2) because big-O notation only cares about the term that grows the fastest with respect to the size of the input.

A function f(n) is Big-O of function g(n), or O(g(n)), when f(n) ≤ c · g(n) for some constant c and sufficiently large n.

Big O notation is to describe the performance or complexity of an algorithm. Big O specifically describes the worst-case scenario, and can be used to describe the execution time required or the space used (e.g. in memory or on disk) by an algorithm. Read more of this post

DNA Replication – Frequent Words, Reverse Complement, Pattern Matching, Clump Finding, Skewi, Mismatches

Genome replication is one of the most important tasks carried out in the cell. Before a cell can divide, it must first replicate its genome so that each of the two daughter cells inherits its own copy.

Replication begins in a genomic region called the replication origin (denoted oriC) and is carried out by molecular copy machines called DNA polymerases.

Locating oriC presents an important task not only for understanding how cells replicate but also for various biomedical problems. For example, some gene therapy methods use genetically engineered mini-genomes, which are called viral vectors because they are able to penetrate cell walls (just like real viruses). Viral vectors carrying artificial genes have been widely used in agriculture.
In the following problem, we assume that a genome has a single oriC. Read more of this post

Length of a Post/Answer Map Reducer Code

Correlation between the length of a post and the length of answers.
Output the length of the post and the average answer length for each post.

Mapper result Reducet result
111 35 111
15084 237 15084
2 145 2
3778 164 3848
3778 69 3778
66193 60 66193
66193 34 66199
66193 302 66196
66193 288 66195
7185 86 7185
10000001 140 323.940451745
10000002 625 465.578947368
10000005 0 35.0
10000006 836 99.6666666667
10000007 4224 580.428571429
66193 59 154.666666667

Read more of this post

Top tag Map Reducer Code

The top 10 tags used in posts, ordered by the number of questions they appear in.
forum csv file contains
“id”    “title”    “tagnames”    “author_id”    “body”    “node_type”    “parent_id”    “abs_parent_id”    “added_at”    “score”    “state_string”    “last_edited_id”    “last_activity_by_id”    “last_activity_at”    “active_revision_id”    “extra”    “extra_ref_id”    “extra_count”    “marked”
Read more of this post

Study Group Map Reducer Code

Analysis the each forum thread would give us a list of students that have posted there – either asked the question, answered a question or added a comment.
forum csv file contains
“id”    “title”    “tagnames”    “author_id”    “body”    “node_type”    “parent_id”    “abs_parent_id”    “added_at”    “score”    “state_string”    “last_edited_id”    “last_activity_by_id”    “last_activity_at”    “active_revision_id”    “extra”    “extra_ref_id”    “extra_count”    “marked”
Read more of this post

Hadoop Combiners

Mapper/Reducer like this:
1.    Your mapper may have gone through the records and output a key; value pair that looked like: day of week; value.
2.    For each day of the week, your reducer kept a running total of the value as well as a count of the number of records.
3.    You divided the total value by the number of records to get the mean.

But there’s a problem here. That second step involves moving a lot of data around your network. What if we could do some of the reduction locally before sending the data to the reducers? Read more of this post

Hadoop Designing With Patterns

Filtering Patterns: Don’t change records. Only get a part of the data!. Examples:  Simple Filter

Bloom Filter (more efficient), Sampling, Random Sampling, Top K

Top K

In RDBMS you would normally first sort the data, then take top K records.  In mapreduce this kind of approach will not work, because the data is not sorted and is processed on several machines. Thus, the mappers will first have to find their own top K lists, without sorting the data, and then send the local lists to the reducers who then can find the global top K list.

Let’s find the top 5 longest posts in our forum! Read more of this post

MapReduce Code

Input Data

2012-01-01    2:01    Omaha    Book    10.51    Visa

it’s tab delimited, values will be the date, the time, the store name, a description of the item, the cost, and the method of payment.
Mapper Code (mapper.py)

    for line in sys.stdin:
        data = line.strip().split("\t")
            date, time, storename, productname, cost, paymethod = data
            print "{0}\t{1}".format(storename, cost)

Reducer Code (reducer.py)
In my case, i have a single Reducer, because that’s the Hadoop default, so it will get all the keys. If i had specified more than one Reducer, each would receive some of the keys, along with all the values from all the Mappers for those keys. Read more of this post