Sunday, 12 July 2020

Hyperparameter Tuning


In machine learning while training an algorithm different parameters are need to be passed to get the best fit and accuracy from the model, therefore selecting the values of different parameters such as n_estimators and max_depth, etc that are involved in training of model is called hyper parameter tuning.

For example in tuning decision tree multiple parameters are present as per below 

tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None,
 min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, class_weight=None, presort=False)

We need to choose a combination for above parameters that gives the best accuracy for the model. In order to select the values of these variables we generally follow two methods

There are two ways of hyper parameter tuning

1) Grid Search

This is one of the basic method for hyper parameter tuning, in this method all possible combinations of different parameters are passed and best combination is chosen for model building.

2) Random Search

In this method, randomly chosen values are passed for different parameters and combination is chosen from same.
This method consumes less time as compare to grid search as we are not passing all the possible values but we are passing randomly chosen statistical values.

Saturday, 11 July 2020

Data Science Use Cases in Retail Industry

There are multiple ways in which data science can be apply to retail industry. Following are the most prominent use cases

  • Recommendation Engine
Recommendation engine helps to narrow down the options for customer and helps to showcase the products relevant to customer which ultimately led to increase in sales

  • Inventory Management
Basis on trend of sales for different products, data science can be used for inventory management by predicting the future sales for a given product and accordingly inventory can be managed. This approach help retailers to provide the products to customer at right time and also helps to manage the inventory.

  • Sentiment Analysis
By reviewing the data from social media and other online channels, analysis can be done using Natural Language Processing and customer sentiments can be understand in terms of neutral, positive and negative reviews. This helps to understand that where retailers are not able to fulfil client demand and improvement is needed and also helps to understand where customers are happy from the retailer service

Monday, 6 July 2020

K-Means Clustering


K-Means clustering is an unsupervised learning technique. This technique is used to group the data points that are showing similar characteristics and dissimilar from others

In K-Means, K represents number of clusters

Advantages
  • Scales well
  • Efficient
Disadvantages
  • Choosing K
When to Use? 
  • Normally distributed data
  • Large number of samples
  • Limited number of clusters

Use Cases 
  • Document classification
  • Customer segmentation

Python Code

from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
kmeans.cluster_centers_
array([[10.,  2.],
       [ 1.,  2.]])


Saturday, 4 July 2020

How Naive Bayes Works ?

Naive bayes is a supervised machine learning algorithm that works for both classification and regression problems. This algorithms works on bayes theorem to calculate the class of data sets.

Bayes theorem works on assumption that all variables are independent of one another. This algorithm best works for large data set

Advantages
  1.  Naive bayes performs better than other models when the assumption of independent variables hold true.
  2. Naive bayes requires less amount of data for training, therefore need less training time.
  3. Simple, fast and easy to implement.
  4. It can be used for both binary and multi class predictions
Disadvantages
  1. It is hard to find such data sets where all independent variables are independent from one another
Use Cases
  1. Text Classification
  2. Spam Filtering
  3. Sentiment Analysis
  4. Naive bayes classifier with collaborative filtering is used to make recommendation engine.

Python Code

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=32) model_nb = GaussianNB() y_pred = model_nb.fit(X_train, y_train).predict(X_test)




Friday, 3 July 2020

How Random Forest Works ?


Random forest is a supervised learning algorithm and works for both classification and regression problems.

Random forest is an ensemble classifier which is made by using multiple decision trees. Ensemble models combine the results from different models.

Application
  • Credit card fault
  • Consumer finance survey
  • Identification of disease in patients using classification 
  • Identify customer churn
How Random Forest works?
  1. Randomly select n features from N, where n << N and N are number of features
  2. For node d, calculate the best split point among the n feature
  3. Split the node among two daughter nodes using the best split
  4. Repeat first 3 steps until n number of nodes has been reached
  5. Build your forest by repeating steps 1 to 4 for D number of times where D is number of trees to be constructed
Advantages
  • Reduces overfitting compare to decision trees, that helps to improve the accuracy
  • Works on both classification and regression problems
  • Works for both continuous and categorical data
  •  Automatically treats missing value in the data
  • No need to normalize the data 
Disadvantages
  • Require high computation power as multiple trees are build during process
  • Training time is high compare to decision trees

Thursday, 2 July 2020

Types of Charts in Data Science

There are multiple charts available in data science and following is the list of most frequent charts

1) Line Chart

2) Bar Chart

3) Histogram

4) Pie Chart

5) Doughnut Chart

6) Heat Map

7) Area Chart

8) Bubble Chart

9) Waterfall Chart

Wednesday, 1 July 2020

Support Vector Machine

Support Vector Machine (SVM) is a supervised learning algorithm which is used for both classification and regression problem and separates data using hyperplane


The objective is to maximize the margin between two support vectors, more the margin better the separation of class.

Tuesday, 30 June 2020

Classification trees


Classification trees is a supervised machine learning algorithm used for both classification and regression problems

Below are the key terms associated with classification trees

Root Node - It constitutes the entire population or sample, and this further gets divided into two or more homogeneous sets 

Splitting - It is a procedure of dividing a node into two or more sub-nodes

Decision Node - When a sub-node further breaks into sub-nodes, then it is called decision node

Leaf/Terminal Node -Nodes that are not possible to split further are called as leaf or terminal node

Pruning - When we remove sub-nodes of a decision nodes, this process is called pruning, it can also be said that it is opposite of splitting

Branch/Sub-Tree - A sub section of the entire is called branch or sub tree

Parent and Child Node - A node, which is divided into sub nodes is called parent node or sub- nodes whereas sub nodes are the child of parent node





Advantages
  • Easy to understand
  • Useful in data exploration
  • Less data cleaning required
  • Data type is not constraint
Challenges
  • Over-fitting
  • Not fit for continuous variables
Techniques for Division/Splitting
  • Gini Index
  • Chi square
  • Information Gain
  • Reduction of variance

Confusion Matrix


Confusion matrix is a matrix that is used to calculate the accuracy of classification model.It can also be defined as a table in which original and prediction results are present.

Below is the table

 
1- Positive 
0 - Negative

TP - True Positive
TN - True Negative
FN - False Negative (Type 2 Error)
FP - False Positive (Type 1 Error)

Accuracy = (TP + TN)/(TP + TN + FP + FN)

Recall = TP/(TP + FN) - Out of all positive classes how much we have predicted positive

Precision = TP/(TP +FP) - Out of all predictive positive classes, how many are actually positive

F-Measure = (2 * Recall * Precision)/(Recall + Precision)

F-Measure is the harmonic mean of recall and precision.F-Measure helps to measure recall and precision at same time as it would be hard to interpret the better model when there is high recall and low precision and vice versa due to which F-Measure is used

Sensitivity

Specificity

ROC Curve

AUC

How Linear Regression works ?

Linear regression aims to model the relationship between two variables by fitting a linear line.

It is assumed that dependent variable can be predicted by independent variable by fitting a best fit line, when there is one independent variable involved the phenomenon is known as simple linear regression while when multiple independent variables are involved then phenomenon is known multiple linear regression.


Linear regression is a supervised learning algorithm

How linear regression works ?
The regression line (yellow color) is the best fit line for the model and red points are coordinates of data

Equation of linear regression line

y = mx + c

y = Dependent variable (labels to data)
x = Independent variable (input data)
m = Slope of line (coefficient of x)

c = Intercept

Life Cycle of Data Science Project


In order to successfully execute the data science project one need to follow concrete plan in order to achieve the goal

1) Business understanding
Ask relevant questions and define objectives for the problem that needs to be solved

2) Data Mining
Gather and scrape the data necessary for the project

3) Data Cleaning
Fix the inconsistencies in the data and handle the missing values

4) Data Exploration
Preparing hypothesis about the defined problem by analyzing the data

5) Feature Engineering
Select the features that are important and remove unwanted features 

6) Predictive Modelling 
Train the data using machine learning models and evaluate the performance and use them to make the predictions

7) Data Visualization
Communicate the finding with key stakeholders using plots and interactive visualizations