_generate_sample_indices in xrspatial/classify.py has a branch for large arrays that is meant to keep host memory proportional to num_sample. It uses np.random.RandomState.choice(num_data, size=num_sample, replace=False), which builds a full arange(num_data) permutation internally. So peak driver-host memory actually scales with num_data, not num_sample, and on a large dask array the sample-index step OOMs the driver before a single chunk is read.
The docstring says this branch is "memory-efficient ... O(num_sample) rather than O(num_data)", which is the opposite of what it does.
The sampler backs the dask and dask+cupy paths of natural_breaks, maximum_breaks, quantile, percentiles, and box_plot. Only the num_data > 10_000_000 branch is affected. The small-array branch (<= 10M) builds a full linspace on purpose, to stay reproducible against the numpy backend, and is bounded.
Reproduction:
import numpy as np, tracemalloc
from xrspatial.classify import _generate_sample_indices
tracemalloc.start()
_generate_sample_indices(20_000_000, 20_000) # 20k sample from a 20M population
_, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(peak / 1024**2, "MB") # ~160 MB, grows linearly with the population
At 30 TB scale (~3.75e12 float64 elements) the same call tries to allocate a multi-terabyte index array on the driver host.
Fix: use np.random.default_rng(seed).choice(..., replace=False) on the large branch. NumPy's Generator.choice uses Floyd's algorithm and stays O(num_sample) when num_sample is much smaller than num_data. It is still seeded and deterministic, and the small-array reproducibility branch does not change. Measured peak for the reproduction above drops from ~160 MB to under 1 MB.
_generate_sample_indicesinxrspatial/classify.pyhas a branch for large arrays that is meant to keep host memory proportional tonum_sample. It usesnp.random.RandomState.choice(num_data, size=num_sample, replace=False), which builds a fullarange(num_data)permutation internally. So peak driver-host memory actually scales withnum_data, notnum_sample, and on a large dask array the sample-index step OOMs the driver before a single chunk is read.The docstring says this branch is "memory-efficient ... O(num_sample) rather than O(num_data)", which is the opposite of what it does.
The sampler backs the dask and dask+cupy paths of
natural_breaks,maximum_breaks,quantile,percentiles, andbox_plot. Only thenum_data > 10_000_000branch is affected. The small-array branch (<= 10M) builds a fulllinspaceon purpose, to stay reproducible against the numpy backend, and is bounded.Reproduction:
At 30 TB scale (~3.75e12 float64 elements) the same call tries to allocate a multi-terabyte index array on the driver host.
Fix: use
np.random.default_rng(seed).choice(..., replace=False)on the large branch. NumPy'sGenerator.choiceuses Floyd's algorithm and stays O(num_sample) whennum_sampleis much smaller thannum_data. It is still seeded and deterministic, and the small-array reproducibility branch does not change. Measured peak for the reproduction above drops from ~160 MB to under 1 MB.