butina#

nvmolkit.clustering.butina(
distance_matrix: AsyncGpuResult | Tensor,
cutoff: float,
neighborlist_max_size: int = 64,
return_centroids: bool = False,
stream: Stream | None = None,
) AsyncGpuResult | tuple[AsyncGpuResult, AsyncGpuResult]#

Perform Butina clustering on a distance matrix.

The Butina algorithm is a deterministic clustering method that groups items based on distance thresholds. It iteratively: 1. Finds the item with the most neighbors within the cutoff distance 2. Forms a cluster with that item and all its neighbors 3. Removes clustered items from consideration 4. Repeats until all items are clustered

Parameters:
  • distance_matrix – Square distance matrix of shape (N, N) where N is the number of items. Can be an AsyncGpuResult or torch.Tensor on GPU.

  • cutoff – Distance threshold for clustering. Items are neighbors if their distance is less than this cutoff.

  • neighborlist_max_size – Maximum size of the neighborlist used for small cluster optimization. Must be 8, 16, 24, 32, 64, or 128. Larger values allow parallel processing of larger clusters but use more shared memory.

  • return_centroids – Whether to return centroid indices for each cluster.

  • stream – CUDA stream to use. If None, uses the current stream.

Returns:

AsyncGpuResult of shape (N,) with cluster IDs (cluster 0 is the largest) when return_centroids is False. When return_centroids is True, returns a tuple (clusters, centroids) where centroids is an AsyncGpuResult of shape (num_clusters,) containing the centroid index for each cluster ID.

Note

The distance matrix should be symmetric and have zeros on the diagonal.