Skip to content

classify: large-array sample-index path allocates O(num_data) on the driver and OOMs at scale #3412

@brendancol

Description

@brendancol

_generate_sample_indices in xrspatial/classify.py has a branch for large arrays that is meant to keep host memory proportional to num_sample. It uses np.random.RandomState.choice(num_data, size=num_sample, replace=False), which builds a full arange(num_data) permutation internally. So peak driver-host memory actually scales with num_data, not num_sample, and on a large dask array the sample-index step OOMs the driver before a single chunk is read.

The docstring says this branch is "memory-efficient ... O(num_sample) rather than O(num_data)", which is the opposite of what it does.

The sampler backs the dask and dask+cupy paths of natural_breaks, maximum_breaks, quantile, percentiles, and box_plot. Only the num_data > 10_000_000 branch is affected. The small-array branch (<= 10M) builds a full linspace on purpose, to stay reproducible against the numpy backend, and is bounded.

Reproduction:

import numpy as np, tracemalloc
from xrspatial.classify import _generate_sample_indices

tracemalloc.start()
_generate_sample_indices(20_000_000, 20_000)  # 20k sample from a 20M population
_, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(peak / 1024**2, "MB")  # ~160 MB, grows linearly with the population

At 30 TB scale (~3.75e12 float64 elements) the same call tries to allocate a multi-terabyte index array on the driver host.

Fix: use np.random.default_rng(seed).choice(..., replace=False) on the large branch. NumPy's Generator.choice uses Floyd's algorithm and stays O(num_sample) when num_sample is much smaller than num_data. It is still seeded and deterministic, and the small-array reproducibility branch does not change. Measured peak for the reproduction above drops from ~160 MB to under 1 MB.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions