Showing posts with label data analysis. Show all posts
Showing posts with label data analysis. Show all posts

Monday, April 16, 2018

Introduction to Pandas Library

Pandas library Provides fast data cleaning, preparation and analysis

Built on top of NumPy , so its easy to work with array (Series) and matrices (Data Frames)

DataFrames are indexable and are made up of number of Series objects. Consider this as similar to DB table / spreadsheet - rows & columns.

Series can also be considered 1-d DataFrame - DataFrame with only 1 column.

Code Examples:


import numpy as np
import pandas as pd
s = pd.Series([1,2,4, 5, 6, 8])

s is Series. Printing s would give us

0 1
1 2
2 4
3 5
4 6
5 8
dtype: int64

Data Frame can be defined as 

df = pd.DataFrame({'date' : ['2018-04-01', '2018-04-02', '2018-04-03'],
'price': [200, 380, 405]})

This is essentially a table with 2 columns date & price and 3 rows having values. We can directly load
a csv file as DataFrame using pd.read_csv

Once the data is loaded , we can leverage the power of pandas. For the table above, let's try these

1. Find entries where price is > 200

df[df['price'] > 200]

This will return entries corresponding to 380, 405

Let's try to find total sale value:

df['price'].sum()

By default DataFrame rows are numbered 0 to n. We can also give them names. In order to use name to fetch a row, we need to set index. Let's say we wanted to name the row as a combination of data & price with a # separator:

df = df.set_index(df['date'])

This will assign the mutated data frame to the old variable. We can now query using newly named indexes

df.loc['2018-04-02'] will correspond to second row.