Pandas library Provides fast data cleaning, preparation and analysis
Built on top of NumPy , so its easy to work with array (Series) and matrices (Data Frames)
DataFrames are indexable and are made up of number of Series objects. Consider this as similar to DB table / spreadsheet - rows & columns.
Series can also be considered 1-d DataFrame - DataFrame with only 1 column.
Code Examples:
This is essentially a table with 2 columns date & price and 3 rows having values. We can directly load
a csv file as DataFrame using pd.read_csv
Once the data is loaded , we can leverage the power of pandas. For the table above, let's try these
1. Find entries where price is > 200
This will return entries corresponding to 380, 405
Let's try to find total sale value:
By default DataFrame rows are numbered 0 to n. We can also give them names. In order to use name to fetch a row, we need to set index. Let's say we wanted to name the row as a combination of data & price with a # separator:
This will assign the mutated data frame to the old variable. We can now query using newly named indexes
df.loc['2018-04-02'] will correspond to second row.
Built on top of NumPy , so its easy to work with array (Series) and matrices (Data Frames)
DataFrames are indexable and are made up of number of Series objects. Consider this as similar to DB table / spreadsheet - rows & columns.
Series can also be considered 1-d DataFrame - DataFrame with only 1 column.
Code Examples:
import numpy as np
import pandas as pd
s = pd.Series([1,2,4, 5, 6, 8])
s is Series. Printing s would give us
0 1
1 2
2 4
3 5
4 6
5 8
dtype: int64
Data Frame can be defined as
df = pd.DataFrame({'date' : ['2018-04-01', '2018-04-02', '2018-04-03'],
'price': [200, 380, 405]})
This is essentially a table with 2 columns date & price and 3 rows having values. We can directly load
a csv file as DataFrame using pd.read_csv
Once the data is loaded , we can leverage the power of pandas. For the table above, let's try these
1. Find entries where price is > 200
df[df['price'] > 200]
This will return entries corresponding to 380, 405
Let's try to find total sale value:
df['price'].sum()
By default DataFrame rows are numbered 0 to n. We can also give them names. In order to use name to fetch a row, we need to set index. Let's say we wanted to name the row as a combination of data & price with a # separator:
df = df.set_index(df['date'])
This will assign the mutated data frame to the old variable. We can now query using newly named indexes
df.loc['2018-04-02'] will correspond to second row.
No comments:
Post a Comment