Skip to content

to_geotiff color_ramp computes a dask source twice (write pass + stats pass) #3597

Description

@brendancol

Describe the problem

to_geotiff(data, path, color_ramp='viridis') on a dask-backed DataArray executes the source graph twice. The streaming writer (_write_streaming) computes each chunk once to write pixels; _write_sidecars then calls _finite_stats (xrspatial/geotiff/_symbology.py), which runs dask.compute over the same source to get min/max/mean/stddev for the PAM stats and the QML ramp bounds.

Measured with a counting map_blocks layer on a 1024x1024 float64 source in 16 chunks:

write chunk executions
to_geotiff(data, path) 16
to_geotiff(data, path, color_ramp='viridis') 32
to_geotiff(data, path, color_ramp='viridis', color_ramp_range=(0, 2)) 16

The docstring documents the extra pass and points at color_ramp_range as the escape hatch. But the streaming writer already materializes every pixel exactly once (row bands when a band fits the buffer budget, row x column segments on wide rasters, strip bands with tiled=False), so the stats pass can ride along with the write instead of re-reading the source.

Proposed fix

  • Give _write_streaming an optional chunk_observer callback, called with each materialized buffer right after compute. That's before the out_dtype cast and the NaN->sentinel restore, so the observer sees logical values.
  • Add a small accumulator in _symbology.py: count/min/max/mean/M2 per buffer, combined with Chan's parallel variance formula. Population stddev at the end, same as _finite_stats (ddof=0).
  • In to_geotiff, when color_ramp is set on the dask streaming path and no color_ramp_range was given, thread the accumulator through the write and hand its result to write_symbology_sidecars instead of re-running _finite_stats.

The GPU writer and VRT tiled paths keep the current behavior: the GPU writer fully materializes anyway, and VRT writes per-tile through a different code path. The docstring should say so.

Found by /sweep-performance against the geotiff module (2026-07-01). This costs wall-clock time and IO (the source is read twice), not memory. Affects dask+numpy and dask+cupy on the streaming CPU write path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    daskDask backend / chunked arraysgeotiffGeoTIFF moduleperformancePR touches performance-sensitive codeseverity:mediumSweep finding: MEDIUMsweep-performanceFound by /sweep-performance

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions