Innovative blocking technique for entity resolution

Entity Resolution (ER) is crucial for identifying the same real-world entities across multiple datasets, and blocking is a key technique for making this process efficient. By using the fuzzy k-modes algorithm, our study addresses the limitations of binary assignment in blocking approaches. Our method allows each object to have a degree of membership in multiple blocks, thereby enhancing matching precision.
The adaptation of the fuzzy k-modes clustering algorithm to alphanumeric data, along with its parallelization using Apache Spark, significantly improves scalability. In addition, probabilistic and statistical methods—such as Modified Z-Scores and Maximum Entropy—are employed to automatically select appropriate thresholds. Although further work remains to be done, this approach has the potential to greatly enhance data integration processes across various industries.  Refer to the full paper here: https://link.springer.com/article/10.1007/s10115-025-02609-w