Deduplicating 2BI records for supply chain platform
A supply chain management platform needed a specialized team to improve their data enrichment process. After seeing a talk from Flavio Juvenal, one of our partners, on record linkage for vast amounts of data, they realized it could be the piece they were missing to crack the puzzle. You can also check Flavio’s talk and the Jupyter Notebook code here.
Client Vision and Key Challenges:
Our client needed help with the massive amount of data they had to process. Since part of their business is trade-related, they consistently sought improved ways to clean, group, and deduplicate their data. They contacted us because they needed a specialist team to improve these steps in the data enrichment process.
To do that, we began working closely with their infrastructure and data science teams on their ETL pipeline. On the data science side, we aimed to increase the accuracy of their deduplication efforts within that pipeline. The scale was also essential for them, as we should still process over 2 billion records in a couple of hours.
On the infrastructure side, we had to work alongside their team to implement good practices and ensure our work would be easy to replicate so that their platform would be more straightforward for other teams to consume.
Data science front
Increasing accuracy required us to delve deep into their ETL pipeline and work to improve all steps, from pre-processing and cleaning to clustering and community detection. As for tools, we worked on a workflow inside Apache Airflow, built on Kubernetes.
The beginning of the work presented a challenge: with their scale, we needed to use async.io to parallelize our communication with the database and use multiprocessing to keep processing and transforming our data on multiple threads. Plugging both with the minimum amount of bottlenecks is no easy feat.
After that, we realized that the indexing in the dedupe lib, a community standard for such efforts, would not reach the scale we needed. Their indexes needed to be more complex to better accommodate the type of records we had. To better determine indexes without comparing all the records to all others, increasing exponentially the time the process took, we had to plug scikit-learn into the indexing process dedupe used, radically improving its speed and reducing comparison time.
We couldn’t be restricted only to the Python scripts to deliver the improvements they needed. We had to work side-by-side with their data team. So, after ensuring efficiency was not an issue here, we shifted towards their train-validate-test cycle.
Their team didn’t know which classifications would best compare their records. So, we prepared multiple proofs of concept for them to choose the best performer while instructing them on the technical tradeoffs of each choice.
After comparing, we got a percentage of how likely each record was to be the same as another. Now, it was time to cluster the comparison pairs to understand which companies were, in fact, duplicates. Here, we used the community standards of scikit and pandas, but since they were not fast enough, we invested some time optimizing how they communicated with multiprocessing.
Finally, we connected with their data science team to validate the efficiency and trade-offs of each clustering algorithm, creating more proofs of concept, so they could decide on the ideal one for their needs.
They had a large infrastructure team that had to attend to multiple other teams and their needs. One of those needs was that an infrastructure expert was required whenever a Python developer or data scientist wanted to run an experiment. Our work with them was focused on guiding them on documentation and methods needed for developers to no longer require them for this specific job. The work took four steps:
- Understand their more recurrent complaints from other teams and the automation they already had in place;
- Focus on Developer UX, better defining how some methods should work, what kind of parameters to use, and how to fetch some automatically whenever possible;
- Then, we moved to automated tests to enforce this way of working and let the Python developers know what went wrong, so they could correct mistakes and try again without an infrastructure expert on their side;
- Finally, we helped them document everything and write how-tos, so other teams could understand the architecture, ideas, risks, and changes they had in mind for the future.
In the end, we reached their goal of 2 billion records in a couple of hours, with improved accuracy while raising their understanding of the performance of classification and clustering algorithms.
Our work on the infrastructure team also freed them of the necessary participation in time-consuming experiments, freeing their time to focus on other tasks in their roadmap. Other teams also reported increased satisfaction with their work as more problems could be solved independently.