Why it matters
The Polars backend advertises LazyFrame streaming, and EngineConfig.streaming defaults to True. However, the default clean configuration also enables drop_duplicates, which makes the native path choose an order-preserving unique operation that Polars documents as incompatible with its streaming engine. Large native Polars cleans therefore do not receive the advertised streaming behavior by default.
Steps to reproduce
- Use a Parquet or LazyFrame source with the native-only configuration:
from freshdata import clean
from freshdata.config import CleanConfig
from freshdata.execution import EngineConfig
clean(
"large.parquet",
config=CleanConfig(strategy="conservative", fix_dtypes=False),
engine="polars",
engine_config=EngineConfig(engine="polars", streaming=True),
)
- Follow
PolarsEngine._stage_drop_duplicates.
- It calls
lf.unique(keep=config.duplicate_keep, maintain_order=True) before _collect requests engine="streaming".
Expected behavior
A run configured for streaming should remain streamable, or clearly report that exact duplicate-order semantics require materialization.
Actual behavior
Polars documents that LazyFrame.unique(..., maintain_order=True) blocks streaming. Because drop_duplicates=True and duplicate_keep="first" are defaults, this applies to the normal native Polars execution path.
Likely root cause
Pandas parity requires the surviving first or last duplicate and original row order, while the Polars streaming-compatible keep="any" path intentionally makes no ordering guarantee.
Suggested first investigation path
Make this tradeoff explicit in the plan. Consider an opt-in unordered duplicate mode that uses a streamable unique operation, or surface a clear warning and benchmark note whenever order-preserving dedup forces eager execution. Add tests that assert the planned backend remains streaming only for streamable stage combinations.
Why it matters
The Polars backend advertises LazyFrame streaming, and
EngineConfig.streamingdefaults toTrue. However, the default clean configuration also enablesdrop_duplicates, which makes the native path choose an order-preserving unique operation that Polars documents as incompatible with its streaming engine. Large native Polars cleans therefore do not receive the advertised streaming behavior by default.Steps to reproduce
PolarsEngine._stage_drop_duplicates.lf.unique(keep=config.duplicate_keep, maintain_order=True)before_collectrequestsengine="streaming".Expected behavior
A run configured for streaming should remain streamable, or clearly report that exact duplicate-order semantics require materialization.
Actual behavior
Polars documents that
LazyFrame.unique(..., maintain_order=True)blocks streaming. Becausedrop_duplicates=Trueandduplicate_keep="first"are defaults, this applies to the normal native Polars execution path.Likely root cause
Pandas parity requires the surviving first or last duplicate and original row order, while the Polars streaming-compatible
keep="any"path intentionally makes no ordering guarantee.Suggested first investigation path
Make this tradeoff explicit in the plan. Consider an opt-in unordered duplicate mode that uses a streamable unique operation, or surface a clear warning and benchmark note whenever order-preserving dedup forces eager execution. Add tests that assert the planned backend remains streaming only for streamable stage combinations.