Basic guide to Data Cleaning for training machine learning models

Created 2 years ago

86 Views

0 Comments

@Sahil0eFnNU

Data cleaning is the process of removing or fixing unwanted or faulty data in the dataset. It is an essential step in building any machine learning model. Before the model can be trained, it is important to get the right dataset, as poor datasets can lead to unreliable machine learning models.

Let's understand data cleaning using the titanic dataset.

There are several ways to clean the dataset, and some or all can be applied according to the need. These include :

1.Handling missing values

At times, your dataset will contain columns with missing values. The dataset cannot be given directly to an ml algorithm for training, which is why you must deal with the missing values by either removing the corresponding entry completely or replacing the missing values with suitable values.

There are several ways to handle missing data for example :

a) drop all NaN values in column :

b) replace NaN values with mean age :

c) using SimpleImputer

2.Removing Outliers

Outliers can be considered as those data points in the dataset, which may be false representation of the overall dataset. Outliers should be removed if it may cause your machine learning model to not generalise. You can visualise outliers using box plot or scatter plot.

How to handle outliers in a data-frame :

a) using Z score

b) using IQR (Inter Quartile Range)

3.Dropping Unnecessary Columns

Sometimes while analysing the dataset, you may want to only focus on a few columns and ignore the rest. In such cases you can drop the unwanted columns. For example, let's say you want to analyse the survival rate of passengers. Then it makes sense to drop the columns Ticket and Embarked, as it doesn't add much value to our analysis.

4.Removing Duplicate Values

Duplicate values can lead to a noisy machine learning model. To remove duplicate values from a pandas DataFrame use :

df.drop_duplicates()

which will remove all duplicate entries from the dataset.

5.Fixing Structural Errors

Structural errors include naming conventions, typos or incorrect capitalization. These errors can cause mislabeled categories or classes. Python and associated libraries offer various methods to handle such errors in a dataset.

Comments

Please login to comment.