classify: large-array sample-index path allocates O(num_data) on the driver and OOMs at scale

`_generate_sample_indices` in `xrspatial/classify.py` has a branch for large arrays that is meant to keep host memory proportional to `num_sample`. It uses `np.random.RandomState.choice(num_data, size=num_sample, replace=False)`, which builds a full `arange(num_data)` permutation internally. So peak driver-host memory actually scales with `num_data`, not `num_sample`, and on a large dask array the sample-index step OOMs the driver before a single chunk is read.

The docstring says this branch is "memory-efficient ... O(num_sample) rather than O(num_data)", which is the opposite of what it does.

The sampler backs the dask and dask+cupy paths of `natural_breaks`, `maximum_breaks`, `quantile`, `percentiles`, and `box_plot`. Only the `num_data > 10_000_000` branch is affected. The small-array branch (<= 10M) builds a full `linspace` on purpose, to stay reproducible against the numpy backend, and is bounded.

Reproduction:

```python
import numpy as np, tracemalloc
from xrspatial.classify import _generate_sample_indices

tracemalloc.start()
_generate_sample_indices(20_000_000, 20_000)  # 20k sample from a 20M population
_, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(peak / 1024**2, "MB")  # ~160 MB, grows linearly with the population
```

At 30 TB scale (~3.75e12 float64 elements) the same call tries to allocate a multi-terabyte index array on the driver host.

Fix: use `np.random.default_rng(seed).choice(..., replace=False)` on the large branch. NumPy's `Generator.choice` uses Floyd's algorithm and stays O(num_sample) when `num_sample` is much smaller than `num_data`. It is still seeded and deterministic, and the small-array reproducibility branch does not change. Measured peak for the reproduction above drops from ~160 MB to under 1 MB.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classify: large-array sample-index path allocates O(num_data) on the driver and OOMs at scale #3412

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

classify: large-array sample-index path allocates O(num_data) on the driver and OOMs at scale #3412

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions