Tuesday, April 17, 2018

Introduction to NumPy & Scikit Learn

NumPy introduces an array type data structure in Python.

import numpy as np
a = np.array([1,2,3])

We can also create matrices and do all matrix operations using NumPy

m1 = np.matrix('1 2; 3 4')m2 = np.matrix('9 10; 11 12')m1 * m2
We can also import csv files as matrices.



Scikit-learn is a machine learning library. We can preprocess data, reduce redundant variables (dimensionality reduction), implement classification & regression models & fine tune the models using  scikit-learn

Creating ML Model:

1. In order to create a feature vector, we need to factorize data so that we have numeric values for all features. This can be done using factorize functionality in pandas library.

2. Next step of scale feature vector. This can be done using inbuilt scalars in sklearn like Standard Scalar.

3. After we have scaled featured vector, we can go for dimensionality reduction. We can apply Principle Component Analysis for the same.

4. After we have final set of features, we can divide the data into test data and training data. For this we can use train_test_split functionality from sklearn.

5a) Let's say we now want to apply logistic regression to classify data. We can use inbuilt functions from sklearn for the same and run on training data. We can then verify model's performance on test data by comparing values from model's predictions and actual test data results.

5b) Let's say we are interested in finding natural grouping of data, we can implement k-means clustering in that case.

We need to define number of clusters , number of time we have to run k-means algorithm with different seed values, number of times we will iterate k-means algorithm for a set of seed values and also specify tolerance - relative tolerance with regards to inertia to declare convergence.





Monday, April 16, 2018

Introduction to Pandas Library

Pandas library Provides fast data cleaning, preparation and analysis

Built on top of NumPy , so its easy to work with array (Series) and matrices (Data Frames)

DataFrames are indexable and are made up of number of Series objects. Consider this as similar to DB table / spreadsheet - rows & columns.

Series can also be considered 1-d DataFrame - DataFrame with only 1 column.

Code Examples:


import numpy as np
import pandas as pd
s = pd.Series([1,2,4, 5, 6, 8])

s is Series. Printing s would give us

0 1
1 2
2 4
3 5
4 6
5 8
dtype: int64

Data Frame can be defined as 

df = pd.DataFrame({'date' : ['2018-04-01', '2018-04-02', '2018-04-03'],
'price': [200, 380, 405]})

This is essentially a table with 2 columns date & price and 3 rows having values. We can directly load
a csv file as DataFrame using pd.read_csv

Once the data is loaded , we can leverage the power of pandas. For the table above, let's try these

1. Find entries where price is > 200

df[df['price'] > 200]

This will return entries corresponding to 380, 405

Let's try to find total sale value:

df['price'].sum()

By default DataFrame rows are numbered 0 to n. We can also give them names. In order to use name to fetch a row, we need to set index. Let's say we wanted to name the row as a combination of data & price with a # separator:

df = df.set_index(df['date'])

This will assign the mutated data frame to the old variable. We can now query using newly named indexes

df.loc['2018-04-02'] will correspond to second row.