Pandas – Handle data in a way suited to analyst(similar to R). Pandas = R + Python, Docs , Tutorial. Pandas allows us to structure and manipulate data in ways that are particularly well-suited for data analysis. Pandas takes a lot of the best elements from R and implements them in Python.
Install
C:\Python27\Scripts>easy_install pandas
go to python ide, type
import pandas as pd
print pd.__version__
matplotlib– http://sourceforge.net/projects/matplotlib/files/matplotlib/matplotlib-1.4.0/ OR easy_install matplotlib
import matplotlib.pyplot as plt
print plt.__version__
pyparsing – http://sourceforge.net/projects/pyparsing/files/pyparsing/pyparsing-2.0.2/ OR easy_install pyparsing
openpyxl – easy_install openpyxl
ValueError: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
easy_install openpyxl==1.8.6
easy_install xlrd
easy_install pyodbc
easy_install sqlalchemy
easy_install xlwt
Install Pandas another way
https://store.continuum.io/cshop/anaconda/
Click “Download Anaconda”, enter email address
Click “Windows 64-bit Python 2.7 Graphical Installer”
Install “Anaconda-2.0.1-Windows-x86_64.exe” and go with the default settings.
Go to cmd prompt, type “conda update conda”
c:\user\mder> conda update conda
Proceed “y” to install lastest version
next type “conda update anaconda”
Proceed “y” to install lastest version
next type “conda update pandas”
Proceed “y” to install lastest version
next type “conda update ipython”
Proceed “y” to install lastest version
next type “conda update tornado”
Proceed “y” to install lastest version
Testing pandas
type ipython notebook
it will open browser, http://localhost:8888/tree/
click “New Notebook”
type
import pandas as pd
print “version : ” + pd.__version__
press “>” play symbol
DataFrames – Data in Pandas is often contained in a structure called a DataFrame. A DataFrame is a two-dimensional labeled data structure with columns which can
be different types if necessary.
Create a DataFrame, pass a dictionary of lists to the DataFrame constructor: 1) The key of the dictionary will be the column name. 2) The associating list will be the values within that column.
import numpy as np
import pandas as pd
#from pandas import DataFrame, Series
teams = ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
'Lions', 'Lions']
loss = [5, 8, 6, 1, 5, 10, 6, 12];
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': teams, 'wins': [11, 8, 10, 15, 11, 6, 10, 4],
'losses': Series(loss )}
df = pd.DataFrame(data)
print df
'''
losses team wins year
0 5 Bears 11 2010
1 8 Bears 8 2011
2 6 Bears 10 2012
...
'''
print df.dtypes
'''
losses int64
team object
wins int64
year int64
dtype: object
'''
#basic statistics of the dataframe's numerical columns
print df.describe()
'''
losses wins year
count 8.000000 8.000000 8.000000
mean 6.625000 9.375000 2011.125000
std 3.377975 3.377975 0.834523
min 1.000000 4.000000 2010.000000
25% 5.000000 7.500000 2010.750000
50% 6.000000 10.000000 2011.000000
75% 8.500000 11.000000 2012.000000
max 12.000000 15.000000 2012.000000
'''
#displays the first five rows of the dataset
print df.head()
#displays the last five rows of the dataset
print df.tail()
Series in Pandas – Series as an one-dimensional object that is similar to an array, list, or column in a database. By default, it will assign an index label to each item in the Series ranging from 0 to N, where N is the number of items in the Series minus one.
import pandas as pd
series = pd.Series(['Python', 42, -1789710578, 12.35])
'''
0 Python
1 42
2 -1789710578
3 12.35
dtype: object
'''
#manually assign indices to the items in the Series when creating the series
series = pd.Series(['Python', 'Mak', 359, 250.55],
index=['Langugae', 'Manager', 'ID', 'Price'])
'''
Langugae Python
Manager Mak
ID 359
Price 250.55
dtype: object
'''
#use index to select specific items from the Series
print series['Langugae'] #Python
print series[['Manager', 'Price']] #Manager Mak Price 250.55 dype: object
#use boolean operators to select specific items from the Series
skills = pd.Series([1, 2, 3, 4, 5], index=['C#', 'Java', 'Python',
'Hadoop', 'CRM'])
print skills > 3
'''
C# False
Java False
Python False
Hadoop True
CRM True
dtype: bool
'''
print skills[skills > 3]
'''
Hadoop 4
CRM 5
dtype: int64
'''
Notes
1) Selecting a single column from the DataFrame will return a Series.
2) Selecting multiple columns from the DataFrame will return a DataFrame.
print df['year'] #or df.year #shorthand
'''
0 2010
1 2011
2 2012
...
Name: year, dtype: int64
'''
print df[['year','wins']]
'''
year wins
0 2010 11
1 2011 8
...
'''
3)Row selection can be done through multiple ways.
1) Slicing
2) An individual index (through the functions iloc or loc)
3) Boolean indexing
print df.iloc[[0]] #or print df.loc[[0]]
'''
losses team wins year
0 5 Bears 11 2010
'''
print df[3:5]
'''
losses team wins year
3 1 Packers 15 2011
4 5 Packers 11 2012
'''
print df[df.wins > 10]
'''
losses team wins year
0 5 Bears 11 2010
3 1 Packers 15 2011
4 5 Packers 11 2012
'''
print df[(df.wins > 10) & (df.team == "Packers")]
'''
losses team wins year
3 1 Packers 15 2011
4 5 Packers 11 2012
'''
print df['wins'][df['losses']>=6]
print df[(df.wins>=2) | (df.losses>=1)]
Pandas Vectorized Methods -df.apply(numpy.mean)
map() (particular column) and applymap() (entire DataFrame) – create a new Series or DataFrame by applying the lambda function to each element.
Lambda Function are small inline functions that are defined on-the-fly in Python. lambda x: x>= 1 will take an input x and return x>=1, or a boolean that equals True or False.
df['one'].map(lambda x:x>=1)
df.applymap(lambda x:x>=1)
PandaSql – C:\Python27\Scripts>easy_install pandasql
The pandasql package allows us to perform queries on dataframes using the SQLite syntax.