Week #5: We Have Data To Clean

Eduardo Silva
Jan 10, 2022
5 min read

Updated: Jan 19, 2022

1. Data pre-processing and analysis

In the week prior to the holidays, the team acknowledged that the provided dataset was way too large (i.e. around 6GB of data), which may be due to the complexity of the real world scenario it captures.

As such, the team had to undertake a data pre-processing process in order to simplify/reduce the number of products and their respective categories, as well as clean outliers and entries that may not be relevant to the problem at hand.

Picking up where we left off in the week prior to the holidays, the team settled on the following 7 phases, which were considered necessary to pre-process the data to be used in the solution:

1. Clean missing data

Given the significant size of the dataset, the entries with missing data were simply dropped and they were also not that many.

2. Filter necessary features

Despite having features such as SKU number, 3 levels of product description (i.e. category, subcategory, product), among others, the team recognized that not all of these would be relevant to solve the problem at hand.

As such, the following features were selected for the final dataset:

Ticket ID - Receipt/Basket identifier;
Date - Date of the purchase;
Value - Total cost of the product in the receipt;
Quantity - Quantity/Units of the product that were bought;
Category - Category of the purchased product (e.g. Vegetables);
Product - Product designation (e.g. Tomato).

3. Translate product and category labels

Given that the data is provenient from a south american supermarket chain, with this specific supermarket being situated in Uruguai, the raw labels are in spanish.

Understandably, it is more desirable for the team that they are presented in english, both for development and future presentation purposes.

As such, the team underwent through a category and product label translation process, which took some time and effort, but ultimately allowed the translation of around 53 categories and 578 products.

4. Convert currency from uruguain pesos to euros

As stated before, this dataset is relative to an uruguayan supermarket, which uses its local currency: uruguayan pesos.

For the sake of familiarity, the team decided to convert the sales' values to euros, which consisted in multiplying every value by 0.020, as 1 uruguayan peso is worth 0.020 euros.

5. Filter irrelevant products and categories

Given the size of the dataset, the team naturally identified products and categories that sold poorly or were irrelevant for the problem at hand, having decided to filter them, in order to attain a more generalized but still rich and representative dataset.

This way, the dataset went from having 129 categories and 2069 products to 53 categories (about 60% reduction) and 578 products (about 72% reduction).

6. Separate dates into seperate day, month, year features

The dates in the raw dataset file are presented in a YYYY-MM-DD HH:MM:SS format, which take a lot of space and the hours aren't even necessary for the problem at hand.

As such, the team decided to separate the year, month and day into different features, which helps reduce the storage taken by the dataset and may also help facilitate queries and other operations in future use cases (e.g. EDA, model training).

7. Aggregate product categories

In supermarkets, it is very usual to find categories of products such as frozen meat, ice cream and desserts in the same section (e.g. frozen meals).

As such, the team thought that it might be interesting to check if the genetic algorithm deals better with pre-aggregated product categories, such as the aforementioned example, or if it "prefers" the flexibility of having separate product categories and aggregate them as it finds necessary.

As a result, by the end of the pre-processing pipeline, two separate datasets are obtained (i.e. the original and the one with pre-aggregated categories) and the differences are clear: the amount of memory taken by the datasets has been reduced by about 66%, from about 6GB to roughly 2GB.

In summary, the resulting dataset presents only the necessary features for the problem at hand, labels translated from spanish to english, currency in euros, dates split into separate features and irrelevant products and categories (outliers) filtered out.

Afterwards, the team decided to perform an exploratory data analysis on the processed dataset, obtaining some insights about the impact of COVID in the customer's behaviours and how the most profitable products and categories vary throughout the year. Graphs that display the aforementioned insights are presented shortly.

2. Apriori algorithm

Another interesting insight on the matter is products/categories that are frequently bought together (e.g. people that buy beer, usually buy snacks), which is obtained through the application of the apriori algorithm to the processed dataset.

An obstacle that the team found with this is the amount of data that is still present in the dataset and how the algorithm handles big amounts of data. As the apriori algorithm works with a pivot table and executes some complex calculations on it, the team quickly found their computers running out of memory when running the algorithm.

As such, after some deliberation, the team concluded that, at most, the algorithm would be executed for 2019 and 2020, separately, as it may be interesting to analyse the impact of the pandemic on people's behaviours and shopping habits. If further optimization was deemed necessary, the algorithm could be ran only for the product categories instead, as they are fewer (53 categories and 578 products).

This way, some of the most commonly associated product categories found by the team are presented in the following table.

Antecedents	Consequents	Confidence
Fruit	Vegetables	63%
Cheese	Dairy	60%
Dairy	Fruit	59%
Meat	Vegetables	52%
Cheese	Bakery	50%

3. State of the art

Picking up where the article was left off, the team did some research on the techniques and algorithms that will be applied in the solution, in order to understand how they work, how they are used in retail solutions and, more specifically, how they are applied to the problem at hand.

As discussed previously, the team has settled on three techniques: (i) genetic algorithm, (ii) regression algorithms and (iii) apriori algorithm. All of these techniques are found in several solutions in the retail domain, with regression and apriori algorithms being commonly found in sales/revenue forecasting and product recommendation systems, respectively.

4. Week retrospective

In retrospective, this week was rather productive, with the team having made good progress in several fronts: data pre-processing and analysis, the application of the apriori algorithm and the writing of the state of the art in the scientific report. As such, it is safe to state that this week's goals and milestones were achieved.

Next week, the team is expecting to finalize the application of the machine learning algorithms, in order to advance to the genetic algorithm's design and hopefully start its implementation, while also writing the solution proposal chapter in the scientific article.

It's worth noting that by the end of next week, the team is expected to deliver the state of the art chapter of the scientific article, in order for the professors to check on the team's progress in the article's writing.

Week #5: We Have Data To Clean

1. Data pre-processing and analysis

2. Apriori algorithm

3. State of the art

4. Week retrospective

Recent Posts

Comments