TY - JOUR T1 - On Robustness of Mutual Funds Categorization and Distance Metric Learning JF - The Journal of Financial Data Science SP - 130 LP - 150 DO - 10.3905/jfds.2021.3.4.130 VL - 3 IS - 4 AU - Dhruv Desai AU - Dhagash Mehta Y1 - 2021/10/31 UR - https://pm-research.com/content/3/4/130.abstract N2 - Identifying similar mutual funds among a given universe of funds has many applications, including competitor analysis, marketing and sales, tax loss harvesting, and so on. For a contemporary analyst, the most popular approach to finding similar funds is to look up a categorization system such as Morningstar categorization. Morningstar categorization has been heavily investigated by academic researchers from various angles, including using unsupervised clustering techniques in which clusters were found to be inconsistent with categorization. Recently, however, categorization has been studied using supervised classification techniques, with the categories being the target labels. Categorization was indeed learnable with very high accuracy using a purely data-driven approach, causing a paradox: Clustering was inconsistent with respect to categorization, whereas supervised classification was able to reproduce (near) complete categorization. Here, the authors resolve this apparent paradox by pointing out incorrect uses and interpretations of machine learning techniques in the previous academic literature. The authors demonstrate that by using an appropriate list of variables and metrics to identify the optimal number of clusters and preprocessing the data using distance metric learning, one can indeed reproduce the Morningstar categorization using a data-driven approach. The present work puts an end to the debate on this issue and establishes that the Morningstar categorization is indeed intrinsically rigorous, consistent, rule-based, and reproducible using data-driven approaches, if machine learning techniques are correctly implemented.Key Findings▪ Academic literature has time and again questioned the consistency and robustness of mutual fund’s categorization systems, such as Morningstar categorization, by contrasting them with unsupervised clustering of funds.▪ Here, the authors settle the debate in favor of Morningstar categorization by pointing out the use of incorrect lists of variables and interpretation of machine learning algorithms in the previous literature, while emphasizing that the main missing piece from the machine learning side in previous research was the appropriate distance metric.▪ The authors employ a machine learning technique called distance metric learning and reproduce the Morningstar categorization completely using a data-driven approach. ER -