to_geotiff color_ramp computes a dask source twice (write pass + stats pass)

**Describe the problem**

`to_geotiff(data, path, color_ramp='viridis')` on a dask-backed DataArray executes the source graph twice. The streaming writer (`_write_streaming`) computes each chunk once to write pixels; `_write_sidecars` then calls `_finite_stats` (`xrspatial/geotiff/_symbology.py`), which runs `dask.compute` over the same source to get min/max/mean/stddev for the PAM stats and the QML ramp bounds.

Measured with a counting `map_blocks` layer on a 1024x1024 float64 source in 16 chunks:

| write | chunk executions |
|---|---|
| `to_geotiff(data, path)` | 16 |
| `to_geotiff(data, path, color_ramp='viridis')` | 32 |
| `to_geotiff(data, path, color_ramp='viridis', color_ramp_range=(0, 2))` | 16 |

The docstring documents the extra pass and points at `color_ramp_range` as the escape hatch. But the streaming writer already materializes every pixel exactly once (row bands when a band fits the buffer budget, row x column segments on wide rasters, strip bands with `tiled=False`), so the stats pass can ride along with the write instead of re-reading the source.

**Proposed fix**

- Give `_write_streaming` an optional `chunk_observer` callback, called with each materialized buffer right after compute. That's before the out_dtype cast and the NaN->sentinel restore, so the observer sees logical values.
- Add a small accumulator in `_symbology.py`: count/min/max/mean/M2 per buffer, combined with Chan's parallel variance formula. Population stddev at the end, same as `_finite_stats` (ddof=0).
- In `to_geotiff`, when `color_ramp` is set on the dask streaming path and no `color_ramp_range` was given, thread the accumulator through the write and hand its result to `write_symbology_sidecars` instead of re-running `_finite_stats`.

The GPU writer and VRT tiled paths keep the current behavior: the GPU writer fully materializes anyway, and VRT writes per-tile through a different code path. The docstring should say so.

Found by /sweep-performance against the geotiff module (2026-07-01). This costs wall-clock time and IO (the source is read twice), not memory. Affects dask+numpy and dask+cupy on the streaming CPU write path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

to_geotiff color_ramp computes a dask source twice (write pass + stats pass) #3597

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

write	chunk executions
`to_geotiff(data, path)`	16
`to_geotiff(data, path, color_ramp='viridis')`	32
`to_geotiff(data, path, color_ramp='viridis', color_ramp_range=(0, 2))`	16

Uh oh!

to_geotiff color_ramp computes a dask source twice (write pass + stats pass) #3597

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions