Customer database matching and consolidation for data enrichment and quality improvement

THE PROBLEM

Client, a leading FMCG manufacturer, sells its products through a vast network of retailers across the country. The client maintains a customer database that stores various essential information about these retailers such as name, address, geolocation, retailer type etc. In addition, the client makes use of Nielsen surveys in regular intervals to ensure that the data is always updated, as well as to obtain various other relevant information that might be more dynamic in nature, such as inventory information. The manual mapping between the client’s internal customer database and Nielsen survey data used to happen manually, making the process time-consuming, laborious, and prone to errors. In addition, there were inherent data quality issues such as duplication, entry errors, measurement errors etc. – which made the manual matching and consolidation process even more difficult. Hence, the client wanted to leverage data science to implement a solution to map the retailers across the two databases and consolidate them into a single data source.

INXITE OUT APPROACH

Data Understanding

Databases were explored, and a long list of attributes were prioritized in alignment with the business team. Quality of these attributes were understood heuristically as well as from the experts. Also, based on certain data limitations (e.g., client’s geolocation measurement accuracy was inherently below the accuracy provided by Nielsen’s data, due to the inherent limitations of the measurement instruments used), business rules were identified and codified accordingly.

Hierarchical Fuzzy Mapping

We devised a hierarchical search-based mapping strategy that entailed an ensemble of multiple search algorithms involving different subsets of attributes, to maximize the confidence and coverage simultaneously. At one end of the hierarchy, we applied attribute-level filters (involving relatively higher quality attributes). And at the other end of the hierarchy, we applied NLP-based fuzzy matching involving combination of attributes (for attributes of relatively lower quality).

Final Output Generation

Finally, confidence was generated for each search algorithm, based on considerations such as fuzzy match probability, number of filters used and their quality index etc. These confidence numbers were used to ensemble and select the best algorithm for any given outlet record. For each outlet record in client’s internal database, top five recommendations from Nielsen’s database were provided, along with confidence.

RESULT

Model was able to provide mapping recommendations for 95% outlets with ~70% confidence on an average basis. The client adopted the model in their business process.

Case Studies