Data Science Tools – Pandas

Pandas – Handle data in a way suited to analyst(similar to R). Pandas = R + Python, Docs , Tutorial. Pandas allows us to structure and manipulate data in ways that are particularly well-suited for data analysis.  Pandas takes a lot of the best elements from R and implements them in Python.

Install
C:\Python27\Scripts>easy_install pandas
go to python ide, type
import pandas as pd
print pd.__version__

matplotlibhttp://sourceforge.net/projects/matplotlib/files/matplotlib/matplotlib-1.4.0/ OR easy_install matplotlib
import matplotlib.pyplot as plt
print plt.__version__

pyparsing –  http://sourceforge.net/projects/pyparsing/files/pyparsing/pyparsing-2.0.2/ OR easy_install pyparsing

openpyxl – easy_install openpyxl
ValueError: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
easy_install openpyxl==1.8.6
easy_install xlrd
easy_install pyodbc
easy_install sqlalchemy
easy_install xlwt

Install Pandas another way
https://store.continuum.io/cshop/anaconda/
Click “Download Anaconda”, enter email address
Click “Windows 64-bit Python 2.7 Graphical Installer”
Install “Anaconda-2.0.1-Windows-x86_64.exe” and go with the default settings.
Go to cmd prompt, type “conda update conda”
c:\user\mder> conda update conda
Proceed “y” to install lastest version
next type “conda update anaconda”
Proceed “y” to install lastest version
next type “conda update pandas”
Proceed “y” to install lastest version
next type “conda update ipython”
Proceed “y” to install lastest version
next type “conda update tornado”
Proceed “y” to install lastest version

Testing pandas
type ipython notebook
it will open browser, http://localhost:8888/tree/
click “New Notebook”
type
import pandas as pd
print “version : ” + pd.__version__
press “>” play symbol

DataFramesData in Pandas is often contained in a structure called a DataFrame. A DataFrame is a two-dimensional labeled data structure with columns which can
be different types if necessary.

Create a DataFrame, pass a dictionary of lists to the DataFrame constructor: 1) The key of the dictionary will be the column name. 2) The associating list will be the values within that column.

import numpy as np
import pandas as pd
#from pandas import DataFrame, Series
teams = ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
'Lions', 'Lions']
loss = [5, 8, 6, 1, 5, 10, 6, 12];
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': teams, 'wins': [11, 8, 10, 15, 11, 6, 10, 4],
'losses': Series(loss )}
df = pd.DataFrame(data)
print df
'''
losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
...
'''
print df.dtypes
'''
losses     int64
team      object
wins       int64
year       int64
dtype: object
'''
#basic statistics of the dataframe's numerical columns
print df.describe()
'''
losses       wins         year
count   8.000000   8.000000     8.000000
mean    6.625000   9.375000  2011.125000
std     3.377975   3.377975     0.834523
min     1.000000   4.000000  2010.000000
25%     5.000000   7.500000  2010.750000
50%     6.000000  10.000000  2011.000000
75%     8.500000  11.000000  2012.000000
max    12.000000  15.000000  2012.000000
'''
#displays the first five rows of the dataset
print df.head()
#displays the last five rows of the dataset
print df.tail()

Series in Pandas – Series as an one-dimensional object that is similar to an array, list, or column in a database. By default, it will assign an index label to each item in the Series ranging from 0 to N, where N is the number of items in the Series minus one.

import pandas as pd
series = pd.Series(['Python', 42, -1789710578, 12.35])
'''
0         Python
1             42
2    -1789710578
3          12.35
dtype: object
'''
#manually assign indices to the items in the Series when creating the series
series = pd.Series(['Python', 'Mak', 359, 250.55],
index=['Langugae', 'Manager', 'ID', 'Price'])
'''
Langugae    Python
Manager        Mak
ID             359
Price       250.55
dtype: object
'''
#use index to select specific items from the Series
print series['Langugae'] #Python
print series[['Manager', 'Price']] #Manager Mak Price 250.55 dype: object
#use boolean operators to select specific items from the Series
skills = pd.Series([1, 2, 3, 4, 5], index=['C#', 'Java', 'Python',
'Hadoop', 'CRM'])
print skills > 3
'''
C#        False
Java      False
Python    False
Hadoop     True
CRM        True
dtype: bool
'''
print skills[skills > 3]
'''
Hadoop    4
CRM       5
dtype: int64
'''

Notes
1) Selecting a single column from the DataFrame will return a Series.
2) Selecting multiple columns from the DataFrame will return a DataFrame.

print df['year'] #or df.year #shorthand
'''
0    2010
1    2011
2    2012
...
Name: year, dtype: int64
'''
print df[['year','wins']]
'''
   year  wins
0  2010    11
1  2011     8
...
'''

3)Row selection can be done through multiple ways.
1) Slicing
2) An individual index (through the functions iloc or loc)
3) Boolean indexing

print df.iloc[[0]] #or print df.loc[[0]]
'''
   losses   team  wins  year
0       5  Bears    11  2010
'''
print df[3:5]
'''
   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012
'''
print df[df.wins > 10]
'''
   losses     team  wins  year
0       5    Bears    11  2010
3       1  Packers    15  2011
4       5  Packers    11  2012
'''
print df[(df.wins > 10) & (df.team == "Packers")]
'''
   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012
'''
print df['wins'][df['losses']>=6]
print df[(df.wins>=2) | (df.losses>=1)]

Pandas Vectorized Methods -df.apply(numpy.mean)
map() (particular column) and applymap() (entire DataFrame) – create a new Series or DataFrame by applying the lambda function to each element.

Lambda Function are small inline functions that are defined on-the-fly in Python. lambda x: x>= 1 will take an input x and return x>=1, or a boolean that equals True or False.

df['one'].map(lambda x:x>=1)
df.applymap(lambda x:x>=1)

PandaSql – C:\Python27\Scripts>easy_install pandasql
The pandasql package allows us to perform queries on dataframes using the SQLite syntax.

Advertisement

One response to “Data Science Tools – Pandas

  1. Pingback: Jupyter Notebook | {Algorithm;}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: