Following charts demonstrate dependence of clustering speed (inchlib_clust) and rendering (InCHlib) on the data size. The test data are randomly generated integers between 0 and 1000. We used the Euclidean distance and Ward’s linkage for clustering.
Clustering time increases quadratically with the increasing number of data points. This corresponds to the O(N²) complexity of the Ward’s linkage implementation in the fastcluster library. Apart from the time increase, memory needs also grow significantly with the number of data points. Clustering of 20,000 data points requires around 2 GB RAM.
Clustering time increases with the increasing number of features only modestly. The dependence between the number of features and the clustering time is linear. If the total number of data is constant and data are clustered by rows, sets with less rows (e.g., 50 x 1000) are processed much faster than sets with higher number of rows (e.g., 1000 x 50). In the clustering, the time limited step is the formation of the distance matrix. Dimension of the distance matrix is given by the number of rows, and its size is constant with the increasing number of data features.