Skip to content

Discuss: surfacing GDAL config options through the gdalxarray API #25

@mdsumner

Description

@mdsumner

GDAL behaviour for /vsis3/, /vsicurl/, Icechunk, and most cloud-native workflows is governed by config options like AWS_NO_SIGN_REQUEST, AWS_REGION, GDAL_NUM_THREADS, CPL_VSIL_CURL_USE_HEAD, etc. Currently users set these via gdal.SetConfigOption(...) before calling gdalxarray, or via environment variables. Both work, but neither is ideal:

  • gdal.SetConfigOption has global, process-wide scope and persists across calls. Easy to forget; surprising when stale.
  • Environment variables are out-of-band and don't appear in code, making notebooks and scripts non-reproducible without external context.

Worked example of current usage:

from osgeo import gdal
gdal.SetConfigOption("AWS_NO_SIGN_REQUEST", "YES")
gdal.SetConfigOption("AWS_REGION", "us-west-2")

import gdalxarray
from gdalxarray import GDALBackendEntrypoint
backend = GDALBackendEntrypoint()
xds = backend.open_dataset(
    "/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk",
    multidim=True,
)

The auth/region setup is structurally separate from the open call, which makes the code less self-documenting than it could be.

Options to discuss

1. Pass-through kwarg on open_dataset

xds = backend.open_dataset(
    url,
    multidim=True,
    gdal_config={"AWS_NO_SIGN_REQUEST": "YES", "AWS_REGION": "us-west-2"},
)

Scoped via gdal.config_options(...) context manager (GDAL ≥ 3.5) so settings don't leak globally. Self-documenting in the call. The downside is API surface — adds another kwarg to maintain.

2. Document the env-var pattern as the recommended path

No code change in gdalxarray. README has a "common config" section showing env vars for AWS anonymous, NCI THREDDS, Pawsey, GCS, etc. Lowest-friction implementation but pushes the documentation burden onto users.

3. Module-level helper for common profiles

import gdalxarray
gdalxarray.config.aws_anonymous(region="us-west-2")
gdalxarray.config.nci_friendly()   # GDAL_NUM_THREADS=4, etc.
xds = backend.open_dataset(url, multidim=True)

Curated presets for the common cases observed during 0.2.0 development (NCI THREDDS rate-limit recipe, anonymous S3, anonymous GCS, etc.). Convenient but Claude-adjacent overreach — gdalxarray taking opinions on what defaults users want.

4. No-op — users use gdal.SetConfigOption or env vars as they do today

Document the patterns in the README, point at the GDAL config docs, don't add API.

Considerations

  • The hypertidy philosophy leans toward primitives and thin wrappers — argues against (3).
  • The "remote ops are the primary use case" observation argues for (1) being inline-with-the-open-call.
  • Context-managed scoping (whichever option) avoids the "stale global" pitfall.
  • Related: # covers which specific options matter for performance and reliability.

Questions

  • Is per-call gdal_config worth the API surface, or is gdal.SetConfigOption close enough?
  • If implementing, should it be opt-in scoped (context manager) or session-global?
  • Should default profiles (anonymous AWS, NCI-friendly) be shipped, or only documented?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions