All about Artificial Intelligence, Machine Learning, Deep Learning and Data Science: July 2020

Sunday, 12 July 2020

Hyperparameter Tuning

In machine learning while training an algorithm different parameters are need to be passed to get the best fit and accuracy from the model, therefore selecting the values of different parameters such as n_estimators and max_depth, etc that are involved in training of model is called hyper parameter tuning.

For example in tuning decision tree multiple parameters are present as per below

tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None,

min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,

max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0,

min_impurity_split=None, class_weight=None, presort=False)

We need to choose a combination for above parameters that gives the best accuracy for the model. In order to select the values of these variables we generally follow two methods

There are two ways of hyper parameter tuning

1) Grid Search

This is one of the basic method for hyper parameter tuning, in this method all possible combinations of different parameters are passed and best combination is chosen for model building.

2) Random Search

In this method, randomly chosen values are passed for different parameters and combination is chosen from same.

This method consumes less time as compare to grid search as we are not passing all the possible values but we are passing randomly chosen statistical values.

Saturday, 11 July 2020

Data Science Use Cases in Retail Industry

There are multiple ways in which data science can be apply to retail industry. Following are the most prominent use cases

Recommendation Engine

Recommendation engine helps to narrow down the options for customer and helps to showcase the products relevant to customer which ultimately led to increase in sales

Inventory Management

Basis on trend of sales for different products, data science can be used for inventory management by predicting the future sales for a given product and accordingly inventory can be managed. This approach help retailers to provide the products to customer at right time and also helps to manage the inventory.

Sentiment Analysis

By reviewing the data from social media and other online channels, analysis can be done using Natural Language Processing and customer sentiments can be understand in terms of neutral, positive and negative reviews. This helps to understand that where retailers are not able to fulfil client demand and improvement is needed and also helps to understand where customers are happy from the retailer service

Monday, 6 July 2020

K-Means Clustering

K-Means clustering is an unsupervised learning technique. This technique is used to group the data points that are showing similar characteristics and dissimilar from others

In K-Means, K represents number of clusters

Advantages

Scales well
Efficient

Disadvantages

Choosing K

When to Use?

Normally distributed data
Large number of samples
Limited number of clusters

Use Cases

Document classification
Customer segmentation

Python Code

from sklearn.cluster import KMeans

import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0],

... [10, 2], [10, 4], [10, 0]])

kmeans = KMeans(n_clusters=2, random_state=0).fit(X)

kmeans.labels_

array([1, 1, 1, 0, 0, 0], dtype=int32)

kmeans.predict([[0, 0], [12, 3]])

array([1, 0], dtype=int32)

kmeans.cluster_centers_

array([[10., 2.],

[ 1., 2.]])

Saturday, 4 July 2020

How Naive Bayes Works ?

Naive bayes is a supervised machine learning algorithm that works for both classification and regression problems. This algorithms works on bayes theorem to calculate the class of data sets.

Bayes theorem works on assumption that all variables are independent of one another. This algorithm best works for large data set

Advantages

Naive bayes performs better than other models when the assumption of independent variables hold true.
Naive bayes requires less amount of data for training, therefore need less training time.
Simple, fast and easy to implement.
It can be used for both binary and multi class predictions

Disadvantages

It is hard to find such data sets where all independent variables are independent from one another

Use Cases

Text Classification
Spam Filtering
Sentiment Analysis
Naive bayes classifier with collaborative filtering is used to make recommendation engine.

Python Code

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=32) model_nb = GaussianNB() y_pred = model_nb.fit(X_train, y_train).predict(X_test)

Friday, 3 July 2020

How Random Forest Works ?

Random forest is a supervised learning algorithm and works for both classification and regression problems.

Random forest is an ensemble classifier which is made by using multiple decision trees. Ensemble models combine the results from different models.

Application

Credit card fault
Consumer finance survey
Identification of disease in patients using classification
Identify customer churn

How Random Forest works?

Randomly select n features from N, where n << N and N are number of features
For node d, calculate the best split point among the n feature
Split the node among two daughter nodes using the best split
Repeat first 3 steps until n number of nodes has been reached
Build your forest by repeating steps 1 to 4 for D number of times where D is number of trees to be constructed

Advantages

Reduces overfitting compare to decision trees, that helps to improve the accuracy
Works on both classification and regression problems
Works for both continuous and categorical data
Automatically treats missing value in the data
No need to normalize the data

Disadvantages

Require high computation power as multiple trees are build during process
Training time is high compare to decision trees

Thursday, 2 July 2020

Types of Charts in Data Science

There are multiple charts available in data science and following is the list of most frequent charts

1) Line Chart

2) Bar Chart

3) Histogram

4) Pie Chart

5) Doughnut Chart

6) Heat Map

7) Area Chart

8) Bubble Chart

9) Waterfall Chart

Wednesday, 1 July 2020

Support Vector Machine

Support Vector Machine (SVM) is a supervised learning algorithm which is used for both classification and regression problem and separates data using hyperplane

The objective is to maximize the margin between two support vectors, more the margin better the separation of class.