Skip to content

[Architecture/Feature - Mid Level] Add progress callbacks for large cleaning jobs #22

@JohnnyWilson-Portfolio

Description

@JohnnyWilson-Portfolio

The Problem: FreshData can clean large DataFrames, but users currently get no structured progress signal while the pipeline runs. This makes CLI tools, notebooks, and batch jobs feel opaque on bigger datasets.

The Solution: Add an optional progress callback that emits lightweight events before or after each major pipeline step without adding a required progress-bar dependency.

Acceptance Criteria:

  • Add a progress_callback option to CleanConfig.
  • fd.clean(df, progress_callback=callback) works through the normal options flow.
  • run_pipeline() emits events for major steps like column_names, strings, dtypes, duplicates, engine_missing, and engine_outliers.
  • Callback events include at least step, status, rows, and columns.
  • No progress output happens when no callback is provided.
  • Add tests confirming event order and that Cleaner(...).clean(...) also supports it.

Skills Required: Python, PyTest, API design, Pandas

How to Get Started:

  1. Open src/freshdata/config.py and study how options are validated.
  2. Open src/freshdata/cleaner.py and identify each step in run_pipeline().
  3. Add a tiny callback test that appends events to a list and asserts the expected steps appear.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions