The adaptation of the fuzzy k-modes clustering algorithm to alphanumeric data, along with its parallelization using Apache Spark, significantly improves scalability. In addition, probabilistic and statistical methods—such as Modified Z-Scores and Maximum Entropy—are employed to automatically select appropriate thresholds. Although further work remains to be done, this approach has the potential to greatly enhance data integration processes across various industries. Refer to the full paper here: https://link.springer.com/article/10.1007/s10115-025-02609-w