Tuesday, April 17, 2018

Introduction to NumPy & Scikit Learn

NumPy introduces an array type data structure in Python.

import numpy as np
a = np.array([1,2,3])

We can also create matrices and do all matrix operations using NumPy

m1 = np.matrix('1 2; 3 4')m2 = np.matrix('9 10; 11 12')m1 * m2
We can also import csv files as matrices.



Scikit-learn is a machine learning library. We can preprocess data, reduce redundant variables (dimensionality reduction), implement classification & regression models & fine tune the models using  scikit-learn

Creating ML Model:

1. In order to create a feature vector, we need to factorize data so that we have numeric values for all features. This can be done using factorize functionality in pandas library.

2. Next step of scale feature vector. This can be done using inbuilt scalars in sklearn like Standard Scalar.

3. After we have scaled featured vector, we can go for dimensionality reduction. We can apply Principle Component Analysis for the same.

4. After we have final set of features, we can divide the data into test data and training data. For this we can use train_test_split functionality from sklearn.

5a) Let's say we now want to apply logistic regression to classify data. We can use inbuilt functions from sklearn for the same and run on training data. We can then verify model's performance on test data by comparing values from model's predictions and actual test data results.

5b) Let's say we are interested in finding natural grouping of data, we can implement k-means clustering in that case.

We need to define number of clusters , number of time we have to run k-means algorithm with different seed values, number of times we will iterate k-means algorithm for a set of seed values and also specify tolerance - relative tolerance with regards to inertia to declare convergence.





No comments:

Post a Comment