Sunday, 12 July 2020

Hyperparameter Tuning


In machine learning while training an algorithm different parameters are need to be passed to get the best fit and accuracy from the model, therefore selecting the values of different parameters such as n_estimators and max_depth, etc that are involved in training of model is called hyper parameter tuning.

For example in tuning decision tree multiple parameters are present as per below 

tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None,
 min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, class_weight=None, presort=False)

We need to choose a combination for above parameters that gives the best accuracy for the model. In order to select the values of these variables we generally follow two methods

There are two ways of hyper parameter tuning

1) Grid Search

This is one of the basic method for hyper parameter tuning, in this method all possible combinations of different parameters are passed and best combination is chosen for model building.

2) Random Search

In this method, randomly chosen values are passed for different parameters and combination is chosen from same.
This method consumes less time as compare to grid search as we are not passing all the possible values but we are passing randomly chosen statistical values.

Saturday, 11 July 2020

Data Science Use Cases in Retail Industry

There are multiple ways in which data science can be apply to retail industry. Following are the most prominent use cases

  • Recommendation Engine
Recommendation engine helps to narrow down the options for customer and helps to showcase the products relevant to customer which ultimately led to increase in sales

  • Inventory Management
Basis on trend of sales for different products, data science can be used for inventory management by predicting the future sales for a given product and accordingly inventory can be managed. This approach help retailers to provide the products to customer at right time and also helps to manage the inventory.

  • Sentiment Analysis
By reviewing the data from social media and other online channels, analysis can be done using Natural Language Processing and customer sentiments can be understand in terms of neutral, positive and negative reviews. This helps to understand that where retailers are not able to fulfil client demand and improvement is needed and also helps to understand where customers are happy from the retailer service

Monday, 6 July 2020

K-Means Clustering


K-Means clustering is an unsupervised learning technique. This technique is used to group the data points that are showing similar characteristics and dissimilar from others

In K-Means, K represents number of clusters

Advantages
  • Scales well
  • Efficient
Disadvantages
  • Choosing K
When to Use? 
  • Normally distributed data
  • Large number of samples
  • Limited number of clusters

Use Cases 
  • Document classification
  • Customer segmentation

Python Code

from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
kmeans.cluster_centers_
array([[10.,  2.],
       [ 1.,  2.]])


Saturday, 4 July 2020

How Naive Bayes Works ?

Naive bayes is a supervised machine learning algorithm that works for both classification and regression problems. This algorithms works on bayes theorem to calculate the class of data sets.

Bayes theorem works on assumption that all variables are independent of one another. This algorithm best works for large data set

Advantages
  1.  Naive bayes performs better than other models when the assumption of independent variables hold true.
  2. Naive bayes requires less amount of data for training, therefore need less training time.
  3. Simple, fast and easy to implement.
  4. It can be used for both binary and multi class predictions
Disadvantages
  1. It is hard to find such data sets where all independent variables are independent from one another
Use Cases
  1. Text Classification
  2. Spam Filtering
  3. Sentiment Analysis
  4. Naive bayes classifier with collaborative filtering is used to make recommendation engine.

Python Code

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=32) model_nb = GaussianNB() y_pred = model_nb.fit(X_train, y_train).predict(X_test)




Friday, 3 July 2020

How Random Forest Works ?


Random forest is a supervised learning algorithm and works for both classification and regression problems.

Random forest is an ensemble classifier which is made by using multiple decision trees. Ensemble models combine the results from different models.

Application
  • Credit card fault
  • Consumer finance survey
  • Identification of disease in patients using classification 
  • Identify customer churn
How Random Forest works?
  1. Randomly select n features from N, where n << N and N are number of features
  2. For node d, calculate the best split point among the n feature
  3. Split the node among two daughter nodes using the best split
  4. Repeat first 3 steps until n number of nodes has been reached
  5. Build your forest by repeating steps 1 to 4 for D number of times where D is number of trees to be constructed
Advantages
  • Reduces overfitting compare to decision trees, that helps to improve the accuracy
  • Works on both classification and regression problems
  • Works for both continuous and categorical data
  •  Automatically treats missing value in the data
  • No need to normalize the data 
Disadvantages
  • Require high computation power as multiple trees are build during process
  • Training time is high compare to decision trees

Thursday, 2 July 2020

Types of Charts in Data Science

There are multiple charts available in data science and following is the list of most frequent charts

1) Line Chart

2) Bar Chart

3) Histogram

4) Pie Chart

5) Doughnut Chart

6) Heat Map

7) Area Chart

8) Bubble Chart

9) Waterfall Chart

Wednesday, 1 July 2020

Support Vector Machine

Support Vector Machine (SVM) is a supervised learning algorithm which is used for both classification and regression problem and separates data using hyperplane


The objective is to maximize the margin between two support vectors, more the margin better the separation of class.