Skip to content

Bytes codec does not roundtrip cleanly #4073

@clbarnes

Description

@clbarnes

Zarr version

v3.1.5

Numcodecs version

0.16.5

Python Version

3.11

Operating System

Mac

Installation

uv add zarr

Description

I maintain zarr-python-n5, which implements the n5_default codec. This codec wraps a number of internal codecs (specifically a transpose, a bytes, and optionally a bytes-to-bytes).

The wrapped bytes codec MUST be big-endian (although None is accepted for single-byte types). My platform is little-endian.

As part of the n5_default codec's evolve_from_array_spec method, I evolve_from_array_spec its constituent codecs, because that seemed like the right thing to do. For single-byte types, this erases the endianness of the wrapped bytes codec, so when it serialises it becomes {"name": "bytes"}. This means that when I deserialise the codec, BytesCodec.from_dict instantiates the codec with BytesCodec(**{}). When endian is not given, it defaults to the platform's endianness ("little", for me). This means that I can instantiate an explicitly big-endian codec, then once it's roundtripped, I get a little-endian codec back, which I found surprising (and also breaks my n5_default codec validation).

I understand that I could just not evolve the wrapped codecs in my case, or not validate that the codec is big-or-none. However, IMO the bytes codec defaulting to the system endianness when the no-config form is passed to from_dict is surprising and unnecessary. Instead, it should take None from the no-config form. If None is not valid in this case, that's due to an error on the part of the writer and zarr-python shouldn't fabricate possibly-incorrect metadata to account for that.

Steps to reproduce

# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
import sys

from zarr.codecs import BytesCodec, Endian
from zarr.core.array_spec import ArraySpec, ArrayConfig
from zarr.buffer import default_buffer_prototype

assert sys.byteorder == "little"

original = BytesCodec(endian=Endian("big"))
evolved = original.evolve_from_array_spec(
    ArraySpec((2, 2), "uint8", 0, ArrayConfig("C", False), default_buffer_prototype())
)
serialised = evolved.to_dict()
assert serialised.get("configuration") is None
deserialised = BytesCodec.from_dict(serialised)

# System byteorder, not the byteorder I explicitly gave
assert deserialised.endian == Endian("little")

# I want this to fail, but it doesn't.
assert original.endian != deserialised.endian

Additional output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugPotential issues with the zarr-python library

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions