1. Initial data insights
With the holidays coming soon, this week is shorter than usual, which led the team to start performing some exploratory data analysis and data pre-processing, both of which should be continued and completed after the holidays.
Right away, the team acknowledges that the provided dataset is currently way too large, which may be due to the complexity of a real world scenario, thus requiring the team to simplify/reduce the number of products and their respective categories, as well as cleaning outliers and entries that may not be relevant to the problem at hand.
For reference, the current number of entries, grouped by feature, are presented in the following table:
Feature | No. of entries |
Receipt ID | 5 141 676 |
Date | 874 |
Value | 107 456 |
Quantity | 2 325 |
Category | 136 |
Product | 2 234 |
Total | 42 244 350 |
Through some exploratory data analysis, the team identified several categories of products that are insignificant, either because they sold very poorly (e.g. only one product sold in a year) or because they are way too specific to the country where this supermarket is located. It is worth noting that the team is aiming to keep this dataset as generalised as possible, working with categories such as: fruits, vegetables, drinks, bakery, etc.
Furthermore, through the following bar plot, it is possible to visualize the product categories that sold the most in 2021 and it is evident that categories such as vegetables, fruits and bakery are some of the best selling categories, just as expected, since they are present in most purchases made in a supermarket.
In the following post, a more in-depth exploratory data analysis will be performed and more graphs and insights will be shared with the readers. Stay tuned!
2. Scientific article
With the problem defined and a solution proposed, the team decided to start writing the scientific article that will report the state-of-the-art of the techniques used in the proposed solution (i.e. genetic, apriori and regression algorithms) and how each of them contribute to the problem resolution.
Furthermore, the article will present an overview of the solution itself, as well as its benefits and disadvantages when compared to other existing solutions.
Lastly, the team will follow the development of the solution according to the CRISP-DM methodology, all the way from the business and data understanding to the modeling and evaluation phases.
By the end of this week, the team wrote its introduction and expects to have the state-of-the-art written by the first week after the holidays. Naturally, the progress will be reported by then.
3. Week retrospective
In retrospective, this week, being shorter than the usual, was rather productive, with the team starting to write the scientific report and performing an initial pre-processing and exploratory data analysis of the provided dataset.
In the following week, this work shall be resumed, with the professors asking for the data pre-processing to be completed by the end of the 5th week and the state-of-the-art chapter in the scientific article written by the 6th week.
With this being said, the team now leaves for a short period of vacation, which is very much appreciated, as there was no break between the first and second challenge.
This will help the team recharge its batteries for the rest of this challenge, that is looking very promising although very challenging, which will require the team to be at its maximum strength in order to tackle the many obstacles that will certainly be found.
To the reader, merry christmas and a happy new year!
Comentarios