In this blog post, we are going to talk about preprocessing in order to have a robust dataset for our model. When we are given a dataset, it is really important to understand it and by understanding, I mean to find the key features. This is not always the case, because we may have a medically related dataset, or is it? To find the key features in a dataset, does not always mean that you have to be an expert in the field the dataset is coming from, we mostly want statistical knowledge.
We want to understand if a feature is relevant or not in order to make a decision, despite the model we are using (Classification, Clustering, etc) That knowledge comes from preprocessing. For example, we have a medical dataset with checkup results. This dataset may have high dimensionality (multiple test results for each individual).
First, we have to check if we have string values (Categorical, ordinal, etc) all these values must be replaced with numeric ( i.e. male – 0, female – 1). This does not require any medical knowledge.
Next, we need to fill in missing values. There are entire textbooks written in this field and it is considered a very hard task to do. Despite the fact that is difficult, it is still a procedure that is going to be handled with statistical models and not medical knowledge. There are a ton of actions we can take to smooth a dataset for better results.
The final thing to do is dimensionality reduction. This is considered a hard task and I am not going to dive into many details. We are going to analyze this in a next blog post using pandas dataframes. In a few words, one of the DR (dimensionality reduction) methods is to find the correlated columns in your dataset and remove them. For example, when column A has increasing values and column B mimics the behavior of column A, then we only need to keep one of these two columns. Another example is when a column has the same value across all dataset, that means this column is irrelevant and cannot have an impact on the output.
In the bottom line, preprocessing can raise your accuracy by a factor of 50% if it is done correctly. That’s why this is so important. For example, let’s say we have movie reviews and we want to perform a sentiment analysis in order to find which of them are positive and which of them are negative. If you perform the analysis in the raw text, the accuracy will be terrible, only because there are universal words, in each of the categories, like the stop words.
That’s all for today’s post! Please let me know if you have any question in the comments section below or post on my Twitter @siaterliskonsta! Till next time, take care and bye bye!