P. Fogel, C. Geissler, N. Morizet and G. Luta. Special Issue Advances in Applied Probability and Statistical Inference, MDPI Mathematics, November 10th 2023.
Abstract: The choice of the factorization rank of a matrix is critical, e.g., in dimensionality reduction, filtering, clustering, deconvolution, etc., because selecting a rank that is too high amounts to adjusting the noise, while selecting a rank that is too low results in the oversimplification of the signal. Numerous methods for selecting the factorization rank of a non-negative matrix have been proposed. One of them is the cophenetic correlation coefficient (𝑐𝑐𝑐), widely used in data science to evaluate the number of clusters in a hierarchical clustering. In previous work, it was shown that 𝑐𝑐𝑐 performs better than other methods for rank selection in non-negative matrix factorization (NMF) when the underlying structure of the matrix consists of orthogonal clusters. In this article, we show that using the ratio of 𝑐𝑐𝑐 to the approximation error significantly improves the accuracy of the rank selection. We also propose a new criterion, 𝑐𝑜𝑛𝑐𝑜𝑟𝑑𝑎𝑛𝑐𝑒, which, like 𝑐𝑐𝑐, benefits from the stochastic nature of NMF; its accuracy is also improved by using its ratio-to-error form. Using real and simulated data, we show that 𝑐𝑜𝑛𝑐𝑜𝑟𝑑𝑎𝑛𝑐𝑒, with a CUSUM-based automatic detection algorithm for its original or ratio-to-error forms, significantly outperforms 𝑐𝑐𝑐. It is important to note that the new criterion works for a broader class of matrices, where the underlying clusters are not assumed to be orthogonal.