How to optimize a Python script that processes large CSV files? #169711
-
BodyI have a Python script that reads and processes multiple CSV files (~500MB each). The current approach is slow and uses a lot of memory. What I need help with:
Any advice or examples would be appreciated! Guidelines
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
|
You can improve both speed and memory usage by: Read in chunks instead of loading the whole file: |
Beta Was this translation helpful? Give feedback.
You can improve both speed and memory usage by:
Read in chunks instead of loading the whole file:
for chunk in pd.read_csv(f, chunksize=100_000): chunk["new_col"] = chunk["old_col"] * 2 chunk.to_csv(f.replace(".csv", "_processed.csv"), mode="a", index=False)Avoid .apply() — use vectorized operations (df["old_col"] * 2) which are much faster.
If possible, use dtype arguments in read_csv to reduce memory footprint.
For big gains, consider Polars or Dask for parallel, out-of-core processing.