The Problem: FreshData can clean large DataFrames, but users currently get no structured progress signal while the pipeline runs. This makes CLI tools, notebooks, and batch jobs feel opaque on bigger datasets.
The Solution: Add an optional progress callback that emits lightweight events before or after each major pipeline step without adding a required progress-bar dependency.
Acceptance Criteria:
- Add a
progress_callback option to CleanConfig.
fd.clean(df, progress_callback=callback) works through the normal options flow.
run_pipeline() emits events for major steps like column_names, strings, dtypes, duplicates, engine_missing, and engine_outliers.
- Callback events include at least
step, status, rows, and columns.
- No progress output happens when no callback is provided.
- Add tests confirming event order and that
Cleaner(...).clean(...) also supports it.
Skills Required: Python, PyTest, API design, Pandas
How to Get Started:
- Open
src/freshdata/config.py and study how options are validated.
- Open
src/freshdata/cleaner.py and identify each step in run_pipeline().
- Add a tiny callback test that appends events to a list and asserts the expected steps appear.
The Problem: FreshData can clean large DataFrames, but users currently get no structured progress signal while the pipeline runs. This makes CLI tools, notebooks, and batch jobs feel opaque on bigger datasets.
The Solution: Add an optional progress callback that emits lightweight events before or after each major pipeline step without adding a required progress-bar dependency.
Acceptance Criteria:
progress_callbackoption toCleanConfig.fd.clean(df, progress_callback=callback)works through the normal options flow.run_pipeline()emits events for major steps likecolumn_names,strings,dtypes,duplicates,engine_missing, andengine_outliers.step,status,rows, andcolumns.Cleaner(...).clean(...)also supports it.Skills Required: Python, PyTest, API design, Pandas
How to Get Started:
src/freshdata/config.pyand study how options are validated.src/freshdata/cleaner.pyand identify each step inrun_pipeline().