Technology for Nuts & Nerds: data analysis

Monday, April 16, 2018

Introduction to Pandas Library

Pandas library Provides fast data cleaning, preparation and analysis

Built on top of NumPy , so its easy to work with array (Series) and matrices (Data Frames)

DataFrames are indexable and are made up of number of Series objects. Consider this as similar to DB table / spreadsheet - rows & columns.

Series can also be considered 1-d DataFrame - DataFrame with only 1 column.

Code Examples:

import numpy as np
import pandas as pd
s = pd.Series([1,2,4, 5, 6, 8])

s is Series. Printing s would give us

0 1
1 2
2 4
3 5
4 6
5 8
dtype: int64

Data Frame can be defined as

df = pd.DataFrame({'date' : ['2018-04-01', '2018-04-02', '2018-04-03'],
'price': [200, 380, 405]})

This is essentially a table with 2 columns date & price and 3 rows having values. We can directly load
a csv file as DataFrame using pd.read_csv

Once the data is loaded , we can leverage the power of pandas. For the table above, let's try these

1. Find entries where price is > 200

df[df['price'] > 200]

This will return entries corresponding to 380, 405

Let's try to find total sale value:

df['price'].sum()

By default DataFrame rows are numbered 0 to n. We can also give them names. In order to use name to fetch a row, we need to set index. Let's say we wanted to name the row as a combination of data & price with a # separator:

df = df.set_index(df['date'])

This will assign the mutated data frame to the old variable. We can now query using newly named indexes

df.loc['2018-04-02'] will correspond to second row.