Blog detail

Missing Value, types, and Treatment in Data Science

Date: 27-04-2023

Missing values are a common problem in data science. They occur when no data is available for a particular variable in a data set. There can be various reasons behind missing values, such as data entry errors, equipment malfunction, or a subjet’s refusal to provide the information. In this blog, we will discuss the types of missing values and how to handle them using Python.

Types of Missing Values

 There are three types of missing values.

1. Missing Completely at Random (MCAR)

This type of missing value occurs when the missingness is independent of both observed and unobserved data. In other words, the probability of a value is missing is the same for all observations, regardless of any other variables. This type of missingness is the least problematic, as it does not bias the analysis.

2. Missing at Random (MAR)

This type of missing value occurs when the probability of a value being missing is dependent on other observed variables in the dataset. In this case, the missingness is systematic, but it can still be handled through statistical methods such as imputation.

3. Missing Not at Random (MNAR)

This type of missing value occurs when the probability of a value being missing is dependent on unobserved data or factors that are not captured in the dataset. This type of missingness is the most problematic, as it can lead to biased analysis.

Handle them using Python

Let’s take an example of a data set containing missing values to understand how to handle them using Python. We will use the Titanic dataset, which contains information about the passengers who were aboard the Titanic when it sank.

First, we will import the necessary libraries and load the data set.

import pandas as pd
import numpy as np
titanic = pd.read_csv(‘titanic.csv’)
Next, we will check for missing values in the data set.
print(titanic.isnull().sum())

Output:

PassengerId      0

Survived       418

Pclass           0

Name             0

Sex              0

Age            263

SibSp            0

Parch            0

Ticket           0

Fare             1

Cabin         1014

Embarked         2

dtype: int64

We can see that there are missing values in the Survived, Age, Fare, Cabin, and Embarked columns. We will handle these missing values one by one.

1. Missing Completely at Random (MCAR)

As the missingness is completely random, we can simply drop the rows containing missing values. We will use the dropna() method for this.

titanic_mcar = titanic.dropna()

2. Missing at Random (MAR)

As the missingness is related to another variable in the data set, we can use imputation techniques to fill in the missing values. We will use the mean value to fill in the missing Age Values.

titanic_mar = titanic.copy()
titanic_mar[‘Age’].fillna(titanic_mar[‘Age’].mean(), inplace=True)

3. Missing Not at Random (MNAR)

As the missingness is related to the variable itself, we need to use more sophisticated imputation techniques to fill in the missing values. We will use the MICE (Multiple Imputation by Chained Equations) method from the fancyimpute library.

Multiple Imputation by Chained Equations (MICE) is a method for imputing missing values that considers the correlations between variables in a data set. MICE is an iterative method that involves creating multiple imputations for each missing value based on the values of the other variables in the data set.

MICE work by filling in each missing value with a value predicted from a regression model based on the observed values of the other variables. The regression models used in MICE are often different for each missing value, depending on the variables that are most correlated with the missing value.

Let’s take an example of the Titanic dataset to understand how to use MICE to impute missing values using Python.

We will first install the fancyimpute library, which provides the MICE algorithm.

!pip install fancyimpute

Next, we will import the necessary libraries and load the data set.

import pandas as pd
import numpy as np
from fancyimpute import IterativeImputer
titanic = pd.read_csv(‘titanic.csv’)

We will create a copy of the data set and drop the columns that we do not need for imputation.

titanic_mnar = titanic[[‘Survived’, ‘Pclass’, ‘Sex’, ‘Age’, ‘Fare’, ‘Embarked’]].copy()

Next, we will convert the categorical variables in the data set to dummy variables.

titanic_mnar = pd.get_dummies(titanic_mnar, columns=[‘Sex’, ‘Embarked’], drop_first=True)

We will then use MICE to impute the missing values in the Age column.

mice_imputer = IterativeImputer()
titanic_mnar_imputed = mice_imputer.fit_transform(titanic_mnar)

The IterativeImputer() function creates an instance of the MICE algorithm, and the fit_transform() method is used to impute the missing values in the data set.

Finally, we will convert the imputed data set back to a pandas dataframe and check for any remaining missing values.

Titanic_mnar_imputed                                                     =pd.DataFrame(titanic_mnar_imputed,
columns=titanic_mnar.columns)
print(titanic_mnar_imputed.isnull().sum())

Output:

Survived 0

Pclass 0

Age 0

Fare 0

Sex_male 0

Embarked_Q 0

Embarked_S 0

dtype: int64

We can see that there are no missing values in the imputed data set. We have successfully imputed the missing values in the Age column using the MICE algorithm.

Conclusion

Missing data is a common issue in data analysis and can lead to biased or inaccurate results. There are different types of missing data, such as Missing at Random (MAR), Missing Completely at Random (MCAR), and Missing Not at Random (MNAR). Understanding the type of missing data is important to choose the appropriate imputation method.

There are various methods to deal with missing data, such as listwise deletion, pairwise deletion, mean imputation, regression imputation, and multiple imputations. However, each method has its own strengths and weaknesses, and it’s essential to choose the appropriate method based on the type of missing data and the characteristics of the data set.

Multiple Imputation by Chained Equations (MICE) is an imputation method that considers the correlations between variables in a data set. MICE involves creating multiple imputations for each missing value based on the values of the other variables in the data set. MICE is a powerful and flexible method that can handle various types of missing data.

In summary, dealing with missing data is a crucial aspect of data analysis, and choosing the appropriate imputation method can significantly impact the results. MICE is one such imputation method that can handle various types of missing data and can lead to more accurate and unbiased results.

Must Read:-

Tags associated Data Analytics Services,Data analytics services and solutions,Data Science,Data Science Services