top of page

Data preprocessing:

Data preprocessing in Machine Learning is a data mining technique that transforms raw data into an understandable and readable format. 

Steps in Data Preprocessing in Machine Learning:

1. Acquire the dataset

2. Import all the crucial libraries

3. Import the dataset

4. Identifying and handling the missing values

5. Encoding the categorical data

6. Splitting the dataset

7. Feature scaling

1. Acquire the dataset:

There are several online sources from where you can download datasets like https://www.kaggle.com/uciml/datasets and https://archive.ics.uci.edu/ml/index.php.

You can also create your dataset by collecting data via different Python APIs.

2. Import all the required libraries:

The three core Python libraries used for this data preprocessing in Machine Learning are:

  • NumPy – NumPy is the fundamental package for scientific calculation in Python. Hence, it is used for inserting any type of mathematical operation in the code. 

  • Pandas – Pandas is an excellent open-source Python library for data manipulation and analysis. It is extensively used for importing and managing the datasets.

  • Matplotlib – Matplotlib is a Python 2D plotting library that is used to plot any type of charts in Python.

3. Import the dataset:

You can import the dataset using the “read_csv()” function of the Pandas library.

4. Identifying and handling the missing values:

Basically, there are two ways to handle missing data:

  • Deleting a particular row – In this method, you remove a specific row that has a null value for a feature or a particular column where more than 75% of the values are missing.

  • Calculating the mean – This method is useful for features having numeric data like age, salary, year, etc. Here, you can calculate the mean, median, or mode of a particular feature or column or row that contains a missing value and replace the result for the missing value. 

5. Encoding the categorical data:

Categorical data refers to the information that has specific categories within the dataset.

Machine Learning models are primarily based on mathematical equations. Thus, you can intuitively understand that keeping the categorical data in the equation will cause certain issues since you would only need numbers in the equations.

6. Splitting the dataset:

Usually, the dataset is split into 70:30 ratio or 80:20 ratio. This means that you either take 70% or 80% of the data for training the model while leaving out the rest 30% or 20%. The splitting process varies according to the shape and size of the dataset in question.

7. Feature scaling:

Feature scaling marks the end of the data preprocessing in Machine Learning. It is a method to standardize the independent variables of a dataset within a specific range. In other words, feature scaling limits the range of variables so that you can compare them on common grounds.

Project | 01
Project | 01 Data Pre-processing Template

# Importing the dataset
dataset = read.csv('Data.csv')

 

# Importing the dataset
dataset = read.csv('Data.csv')

# Taking care of missing data
dataset$Age = ifelse(is.na(dataset$Age),
                     ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
                     dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
                        ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                        dataset$Salary)

 

# Encoding categorical data
dataset$Country = factor(dataset$Country,
                         levels = c('France', 'Spain', 'Germany'),
                         labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
                           levels = c('No', 'Yes'),
                           labels = c(0, 1))

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$DependentVariable, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

 

# Feature Scaling
 training_set = scale(training_set)
 test_set = scale(test_set)

bottom of page