Tuesday, 30 June 2020

Classification trees


Classification trees is a supervised machine learning algorithm used for both classification and regression problems

Below are the key terms associated with classification trees

Root Node - It constitutes the entire population or sample, and this further gets divided into two or more homogeneous sets 

Splitting - It is a procedure of dividing a node into two or more sub-nodes

Decision Node - When a sub-node further breaks into sub-nodes, then it is called decision node

Leaf/Terminal Node -Nodes that are not possible to split further are called as leaf or terminal node

Pruning - When we remove sub-nodes of a decision nodes, this process is called pruning, it can also be said that it is opposite of splitting

Branch/Sub-Tree - A sub section of the entire is called branch or sub tree

Parent and Child Node - A node, which is divided into sub nodes is called parent node or sub- nodes whereas sub nodes are the child of parent node





Advantages
  • Easy to understand
  • Useful in data exploration
  • Less data cleaning required
  • Data type is not constraint
Challenges
  • Over-fitting
  • Not fit for continuous variables
Techniques for Division/Splitting
  • Gini Index
  • Chi square
  • Information Gain
  • Reduction of variance

Confusion Matrix


Confusion matrix is a matrix that is used to calculate the accuracy of classification model.It can also be defined as a table in which original and prediction results are present.

Below is the table

 
1- Positive 
0 - Negative

TP - True Positive
TN - True Negative
FN - False Negative (Type 2 Error)
FP - False Positive (Type 1 Error)

Accuracy = (TP + TN)/(TP + TN + FP + FN)

Recall = TP/(TP + FN) - Out of all positive classes how much we have predicted positive

Precision = TP/(TP +FP) - Out of all predictive positive classes, how many are actually positive

F-Measure = (2 * Recall * Precision)/(Recall + Precision)

F-Measure is the harmonic mean of recall and precision.F-Measure helps to measure recall and precision at same time as it would be hard to interpret the better model when there is high recall and low precision and vice versa due to which F-Measure is used

Sensitivity

Specificity

ROC Curve

AUC

How Linear Regression works ?

Linear regression aims to model the relationship between two variables by fitting a linear line.

It is assumed that dependent variable can be predicted by independent variable by fitting a best fit line, when there is one independent variable involved the phenomenon is known as simple linear regression while when multiple independent variables are involved then phenomenon is known multiple linear regression.


Linear regression is a supervised learning algorithm

How linear regression works ?
The regression line (yellow color) is the best fit line for the model and red points are coordinates of data

Equation of linear regression line

y = mx + c

y = Dependent variable (labels to data)
x = Independent variable (input data)
m = Slope of line (coefficient of x)

c = Intercept

Life Cycle of Data Science Project


In order to successfully execute the data science project one need to follow concrete plan in order to achieve the goal

1) Business understanding
Ask relevant questions and define objectives for the problem that needs to be solved

2) Data Mining
Gather and scrape the data necessary for the project

3) Data Cleaning
Fix the inconsistencies in the data and handle the missing values

4) Data Exploration
Preparing hypothesis about the defined problem by analyzing the data

5) Feature Engineering
Select the features that are important and remove unwanted features 

6) Predictive Modelling 
Train the data using machine learning models and evaluate the performance and use them to make the predictions

7) Data Visualization
Communicate the finding with key stakeholders using plots and interactive visualizations

Imputation of missing values


Why we need imputation?

Whenever we start working on any dataset there is always some cleaning, treating missing values activities are required.

Methods to treat missing values
1) Replacing the missing value by mean, mode, median depending on the data
2) Imputing the missing value by some constant that aligns with data or business sense
3) Use regression techniques such as KNN (K- Nearest Neighbors)
4) Remove the specific rows and columns

5) Do nothing

Encoding techniques


Why we need encoding ?

Models in machine learning and deep learning requires input and output data to be in numeric format so the objective of encoding is to transform data to numeric so that machine can read the same and do the predictions

Types of Encoding techniques

1) One hot encoding

In this technique we encode each category of a categorical variable to 0 or 1 so that it converts in numeric value.
For eg. Gender: Male and female- we will transform male as 0 and female as 1

2)  Label encoding
3) Ordinal encoding

Monday, 29 June 2020

Correlation


Correlation is the magnitude of relationship between two quantitative variables.

It is assumed that correlation follows a linear pattern and ranges from -1 to 1 where -1 means complete opposite relationship between the variables i.e. if one variable goes up then other variable goes down while correlation of 1 means both variables are moving together and in same direction.
Zero correlation means there is no relationship between two variables and movement of these variables is independent of one other.

Generally, a correlation (positive or negative) of 0.7 and above is consider as strong relationship, 0.5 to 0.7 as moderate and below 0.5 as non-significant.

Types of data in Statistics


In statistics, there are two types of variable

1) Categorical variable

These types of variable can take few specific values and are limited in number.
For e.g.- Gender

They are further divided into two sub-categories

a) Nominal
Nominal variable does not follow any order and can also take few values only. For eg- 
Marital Status - Yes or No

b) Ordinal
Ordinal variables have specific order and can also take few values. For e.g.-
Risk level - Low, Medium, High

2) Continuous variable

These variables are kind of variables for which values cannot be counted but measured. For e.g.- Height of person is a continuous variable as it can be plot on number line.

Covariance

Covariance is defined as relationship between two random variables or how one variable behaves as per the movement of another variable. Covariance can be positive or negative.

If one variable moves in same direction as other variable, then the variables have positive covariance while if one variable moves in one direction and other variable moves in opposite direction the variables have negative covariance.

Example: Size and price of house, in general it is observed as area of house increases price also increases therefore the area and house have positive covariance. The other example could be if temperature increases then amount of heater consumption decreases that shows negative covariance.

Equation of Covariance


Cov(X,Y)      =      (∑(Xi-Xmean)(Yi-Ymean))/N

  

Log Normal Distribution


Log normal distribution is used for transformation of variable when it is not normally distributed. Suppose we have a variable for which the distribution is left skewed then transformation of that variable is done to log so that it becomes normally distributed.

The transformation is required as if the data is left or right skewed then it will impact the model accuracy and results will be fall in accuracy while when transformation is done accuracy improves.

Variable transformation


Gaussian distribution


Gaussian distribution also known as normal distribution is a kind of distribution in which the values of observations of data are symmetrically distributed across the mean value. Normal distribution is also known as bell curve




·       The center of the bell curve is the mean
·       68.2% of the total data lie in the range of (Mean +/- Standard deviation)
·       95.5% of the total data points line in the range of (Mean +/- 2* Standard deviation)
·       99.7% of the total data points line in the range of (Mean +/- 3* Standard deviation)






Population vs Sample


Population vs Sample

Population is defined as entire group of items, people or any group while sample is a smaller set of items, people that is collected from larger group and sample is derived from population
The need of sampling comes in place as it is impossible to calculate all the features of population every time, therefore study is done on sample and inference on population is drawn.
For ex.- Exit pools- In this a survey is done on limited people as soon as they cast their vote and basis on results of same predictions on results are made. Some pools are also done before the voters cast their vote.

Sampling techniques

1) Simple random sampling
2) Systematic sampling
3) Stratified sampling
4) Clustered sampling

Non-probability sampling methods

1) Convenience sampling
2) Quota sampling
3) Judgement or purposive sampling
4) Snowball sampling

Saturday, 27 June 2020

Slicing and dicing in Python


Slicing and Dicing in Python

Slicing and dicing is used for data exploration.In this activity data is divided into small parts and exploration of same is done.

Deleting columns- It is used to remove the unwanted columns from the data.It is one of the step require to remove the columns which are not required for modelling

del data[‘column’]

Selecting a specific column
To select the individual column from the dataset below syntax is used

data[‘column’]
data.column

Afterwards one can do further exploration by using
data['column'].value_counts()

Selecting multiple columns
If one wants to explore data in two columns below syntax can be used and exploration can be done
data[[‘column1’,’columns2]] # Multiple columns are passed in a list

Subsetting rows in Python
Python works on 0 based indexing

data[0:3]              #Selecting rows 1,2,3
data[-1:]                #Selecting last element

data_copy = data.copy()    # Copy data frame

Tuesday, 23 June 2020

Pandas for data exploration


Using Pandas for data exploration

Pandas is a python library that is used for data manipulation and analysis.In data science the key applications of pandas are


  • Reading and writing data 
  • Handling of missing data
  • Reshaping and pivoting of data
  • Label based slicing, indexing and sub-setting of large data sets
  • Column insertion, deletion and filtering
  • Merging and joining
Below are some important syntax that are used in analysis

#Syntax for importing pandas
Import pandas as pd

#Reading a csv file, after writing pd.read_csv, one can use Shift + Tab in jupyter notebook to explore the individual component in  the function
data = pd.read_csv(“data.csv”)

type(data) # To check the data type 

len(data) #To check the length of data 
data.shape # To check the dimension (rows, column)
data.head() # To check the top 5 rows
pd.set_option("display.max.columns", None) # To see all columns
data.tail() # To check the bottom 5 rows
data.info() # To provide information on non null count and data type for a column
data.describe() # To describe the data in terms of count, mean , std,min, 25%,50%,75%, max
data.describe(include=np.object)   # To explore object variables
data["column"].value_counts() # To explore categorical variables
data.loc[data["column1"] == "ABC", "column2"].value_counts() # To explore a column with some condition
data.loc[data["column"] == “A", "date_column"].min() 
data.loc[data["column"] == "A", "date_column"].max()
data.loc[data["column"] == "A", "date_column"].agg(("min", "max"))

Tuesday, 16 June 2020

Python for Data Science


Python for Data Science

Python data types

·       Float- real numbers
·       Integer - integers
·       String- string, text
·       Boolean – True, False

Lists

Collection of values which are ordered and changeable and can be of any type
List = [1,”Sam”,2,True]

Indexing in Python

Python works on zero base indexing where 1st column and 1st row have zero index
States = [‘Maharashtra’,’Rajasthan’,’MP’,’Gujarat’, ‘UP’]
Index           0                         1                  2           4          5



Dictionary

Tuples

Sets

Tuesday, 9 June 2020

Important R packages - Part-1: Dplyr


Important R packages - Part-1

Data Manipulation

Dplyr:Dplyr is the data manipulation package that helps to solve most common data manipulation challenges

a) Mutate-Is is used to create new variable from the data

mutate(data,new_variable)
mutate(mtcars,new_var=mpg/cyl)

b) Filter- It is used to select rows based on filter applied

filter(data,variable)
filter(mtcars,gear == 4)

c) Select - It is used to pick columns as per requirement

select(data,variables)
select(mtcars,mpg,cyl,disp)

d) Summarise() - It is used to reduce multiple values to a single summary

summarise(data,Name-value pairs)
summarise(mtcars, cyl_mean=mean(cyl),cyl_median=median(cyl))


e) Arrange() - It is used to sort the data

arrange(data,variable)
arrange(mtcars,mpg)