The adaptation of the fuzzy k-modes clustering algorithm to alphanumeric data, along with its parallelization using Apache Spark, significantly improves scalability. In addition, probabilistic and statistical methods—such as Modified Z-Scores and Maximum Entropy—are employed to automatically select appropriate thresholds. Although further work remains to be done, this approach has the potential to greatly enhance data integration processes across various industries. Refer to the full paper here: https://link.springer.com/article/10.1007/s10115-025-02609-w
Innovative blocking technique for entity resolution
Entity Resolution (ER) is crucial for identifying the same real-world entities across multiple datasets, and blocking is a key technique for making this process efficient. By using the fuzzy k-modes algorithm, our study addresses the limitations of binary assignment in blocking approaches.