All about Artificial Intelligence, Machine Learning, Deep Learning and Data Science: 2020

Sunday, 12 July 2020

Hyperparameter Tuning

In machine learning while training an algorithm different parameters are need to be passed to get the best fit and accuracy from the model, therefore selecting the values of different parameters such as n_estimators and max_depth, etc that are involved in training of model is called hyper parameter tuning.

For example in tuning decision tree multiple parameters are present as per below

tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None,

min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,

max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0,

min_impurity_split=None, class_weight=None, presort=False)

We need to choose a combination for above parameters that gives the best accuracy for the model. In order to select the values of these variables we generally follow two methods

There are two ways of hyper parameter tuning

1) Grid Search

This is one of the basic method for hyper parameter tuning, in this method all possible combinations of different parameters are passed and best combination is chosen for model building.

2) Random Search

In this method, randomly chosen values are passed for different parameters and combination is chosen from same.

This method consumes less time as compare to grid search as we are not passing all the possible values but we are passing randomly chosen statistical values.

Saturday, 11 July 2020

Data Science Use Cases in Retail Industry

There are multiple ways in which data science can be apply to retail industry. Following are the most prominent use cases

Recommendation Engine

Recommendation engine helps to narrow down the options for customer and helps to showcase the products relevant to customer which ultimately led to increase in sales

Inventory Management

Basis on trend of sales for different products, data science can be used for inventory management by predicting the future sales for a given product and accordingly inventory can be managed. This approach help retailers to provide the products to customer at right time and also helps to manage the inventory.

Sentiment Analysis

By reviewing the data from social media and other online channels, analysis can be done using Natural Language Processing and customer sentiments can be understand in terms of neutral, positive and negative reviews. This helps to understand that where retailers are not able to fulfil client demand and improvement is needed and also helps to understand where customers are happy from the retailer service

Monday, 6 July 2020

K-Means Clustering

K-Means clustering is an unsupervised learning technique. This technique is used to group the data points that are showing similar characteristics and dissimilar from others

In K-Means, K represents number of clusters

Advantages

Scales well
Efficient

Disadvantages

Choosing K

When to Use?

Normally distributed data
Large number of samples
Limited number of clusters

Use Cases

Document classification
Customer segmentation

Python Code

from sklearn.cluster import KMeans

import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0],

... [10, 2], [10, 4], [10, 0]])

kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

kmeans.labels_

array([1, 1, 1, 0, 0, 0], dtype=int32)

kmeans.predict([[0, 0], [12, 3]])

array([1, 0], dtype=int32)

kmeans.cluster_centers_

array([[10., 2.],

[ 1., 2.]])

Saturday, 4 July 2020

How Naive Bayes Works ?

Naive bayes is a supervised machine learning algorithm that works for both classification and regression problems. This algorithms works on bayes theorem to calculate the class of data sets.

Bayes theorem works on assumption that all variables are independent of one another. This algorithm best works for large data set

Advantages

Naive bayes performs better than other models when the assumption of independent variables hold true.
Naive bayes requires less amount of data for training, therefore need less training time.
Simple, fast and easy to implement.
It can be used for both binary and multi class predictions

Disadvantages

It is hard to find such data sets where all independent variables are independent from one another

Use Cases

Text Classification
Spam Filtering
Sentiment Analysis
Naive bayes classifier with collaborative filtering is used to make recommendation engine.

Python Code

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=32) model_nb = GaussianNB() y_pred = model_nb.fit(X_train, y_train).predict(X_test)

Friday, 3 July 2020

How Random Forest Works ?

Random forest is a supervised learning algorithm and works for both classification and regression problems.

Random forest is an ensemble classifier which is made by using multiple decision trees. Ensemble models combine the results from different models.

Application

Credit card fault
Consumer finance survey
Identification of disease in patients using classification
Identify customer churn

How Random Forest works?

Randomly select n features from N, where n << N and N are number of features
For node d, calculate the best split point among the n feature
Split the node among two daughter nodes using the best split
Repeat first 3 steps until n number of nodes has been reached
Build your forest by repeating steps 1 to 4 for D number of times where D is number of trees to be constructed

Advantages

Reduces overfitting compare to decision trees, that helps to improve the accuracy
Works on both classification and regression problems
Works for both continuous and categorical data
Automatically treats missing value in the data
No need to normalize the data

Disadvantages

Require high computation power as multiple trees are build during process
Training time is high compare to decision trees

Thursday, 2 July 2020

Types of Charts in Data Science

There are multiple charts available in data science and following is the list of most frequent charts

1) Line Chart

2) Bar Chart

3) Histogram

4) Pie Chart

5) Doughnut Chart

6) Heat Map

7) Area Chart

8) Bubble Chart

9) Waterfall Chart

Wednesday, 1 July 2020

Support Vector Machine

Support Vector Machine (SVM) is a supervised learning algorithm which is used for both classification and regression problem and separates data using hyperplane

The objective is to maximize the margin between two support vectors, more the margin better the separation of class.

Tuesday, 30 June 2020

Classification trees

Classification trees is a supervised machine learning algorithm used for both classification and regression problems

Below are the key terms associated with classification trees

Root Node - It constitutes the entire population or sample, and this further gets divided into two or more homogeneous sets

Splitting - It is a procedure of dividing a node into two or more sub-nodes

Decision Node - When a sub-node further breaks into sub-nodes, then it is called decision node

Leaf/Terminal Node -Nodes that are not possible to split further are called as leaf or terminal node

Pruning - When we remove sub-nodes of a decision nodes, this process is called pruning, it can also be said that it is opposite of splitting

Branch/Sub-Tree - A sub section of the entire is called branch or sub tree

Parent and Child Node - A node, which is divided into sub nodes is called parent node or sub- nodes whereas sub nodes are the child of parent node

Advantages

Easy to understand
Useful in data exploration
Less data cleaning required
Data type is not constraint

Challenges

Over-fitting
Not fit for continuous variables

Techniques for Division/Splitting

Gini Index
Chi square
Information Gain
Reduction of variance

Confusion Matrix

Confusion matrix is a matrix that is used to calculate the accuracy of classification model.It can also be defined as a table in which original and prediction results are present.

Below is the table

1- Positive

0 - Negative

TP - True Positive

TN - True Negative

FN - False Negative (Type 2 Error)

FP - False Positive (Type 1 Error)

Accuracy = (TP + TN)/(TP + TN + FP + FN)

Recall = TP/(TP + FN) - Out of all positive classes how much we have predicted positive

Precision = TP/(TP +FP) - Out of all predictive positive classes, how many are actually positive

F-Measure = (2 * Recall * Precision)/(Recall + Precision)

F-Measure is the harmonic mean of recall and precision.F-Measure helps to measure recall and precision at same time as it would be hard to interpret the better model when there is high recall and low precision and vice versa due to which F-Measure is used

Sensitivity

Specificity

ROC Curve

AUC

How Linear Regression works ?

Linear regression aims to model the relationship between two variables by fitting a linear line.

It is assumed that dependent variable can be predicted by independent variable by fitting a best fit line, when there is one independent variable involved the phenomenon is known as simple linear regression while when multiple independent variables are involved then phenomenon is known multiple linear regression.

Linear regression is a supervised learning algorithm

How linear regression works ?

The regression line (yellow color) is the best fit line for the model and red points are coordinates of data

Equation of linear regression line

y = mx + c

y = Dependent variable (labels to data)

x = Independent variable (input data)

m = Slope of line (coefficient of x)

c = Intercept

Life Cycle of Data Science Project

In order to successfully execute the data science project one need to follow concrete plan in order to achieve the goal

1) Business understanding

Ask relevant questions and define objectives for the problem that needs to be solved

2) Data Mining

Gather and scrape the data necessary for the project

3) Data Cleaning

Fix the inconsistencies in the data and handle the missing values

4) Data Exploration

Preparing hypothesis about the defined problem by analyzing the data

5) Feature Engineering

Select the features that are important and remove unwanted features

6) Predictive Modelling

Train the data using machine learning models and evaluate the performance and use them to make the predictions

7) Data Visualization

Communicate the finding with key stakeholders using plots and interactive visualizations