Skip to content
Discussion options

You must be logged in to vote

You can improve both speed and memory usage by:

Read in chunks instead of loading the whole file:
for chunk in pd.read_csv(f, chunksize=100_000): chunk["new_col"] = chunk["old_col"] * 2 chunk.to_csv(f.replace(".csv", "_processed.csv"), mode="a", index=False)
Avoid .apply() — use vectorized operations (df["old_col"] * 2) which are much faster.
If possible, use dtype arguments in read_csv to reduce memory footprint.
For big gains, consider Polars or Dask for parallel, out-of-core processing.

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
1 reply
@meKryztal
Comment options

Answer selected by meKryztal

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question Ask and answer questions about GitHub features and usage Programming Help Discussions around programming languages, open source and software development
3 participants