Skip to content

Garbage collection deletes external-store files belonging to existing rows (custom codecs) #1469

Description

@esutlie

What we're seeing

A custom codec (a SchemaCodec subclass) was built to store xarray datasets as NetCDF in an external protocol: file store, one .nc file per row. Two problems show up around deletion and garbage collection:

  1. Deleting a row does not remove its external file. The .nc file is left orphaned on disk.
  2. Running dj.gc.collect() to clean up those orphans also removes files that are still referenced by existing rows. dj.gc.scan() reports files belonging to live rows as orphaned, so collect() deletes them along with the genuinely orphaned ones, leaving existing rows pointing at files that no longer exist.

The net effect is that the built-in dj.gc tooling offers no safe cleanup path for a custom codec's external store. Skipping GC leaves orphaned files behind on every delete, and running GC deletes files that are still in use.

Sequence that triggers it

  • Insert rows into a table with a custom <codec@store> column (each insert writes one file to the store).
  • Delete a subset of the rows.
  • Run dj.gc.collect(schema, store_name=..., dry_run=False) to reclaim the orphaned files.
  • The files for the rows that were not deleted are removed too.

Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions