U-statistical inference for hierarchical clustering

Citation:
Valk, M, Cybis GB.  2021.  U-statistical inference for hierarchical clustering. Journal of Computational and Graphical Statistics. 30(1)

Abstract:

Clustering methods are valuable tools for the identifcation of patterns in high dimensional data with applications in many scientifc felds. However, quantifying uncertainty in clustering is a challenging problem, particularly when dealing with High Dimension Low Sample Size (HDLSS) data. We develop a U-statistics based clustering approach that assesses statistical signifcance in clustering and is specifcally tailored to HDLSS scenarios. These non-parametric methods rely on very few assumptions about the data, and thus can be applied to a wide range of dataset for which the Euclidean distance captures relevant features. Our main result is the development of a hierarchical signifcance clustering method. In order to do so, we first introduce an extension of a relevant U-statistic and develop its asymptotic theory. Additionally, as a preliminary step, we propose a binary non-nested signifcance clustering method and show its optimality in terms of expected values. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. They are further showcased in three applications ranging from genetics to image recognition problems.

website