How to optimize a Python script that processes large CSV files? #169711

meKryztal · 2025-08-13T17:24:37Z

meKryztal
Aug 13, 2025

Body

I have a Python script that reads and processes multiple CSV files (~500MB each). The current approach is slow and uses a lot of memory.
Here’s a simplified version of the code:

import pandas as pd

files = ["file1.csv", "file2.csv", "file3.csv"]

for f in files:
    df = pd.read_csv(f)
    # processing
    df["new_col"] = df["old_col"].apply(lambda x: x * 2)
    df.to_csv(f.replace(".csv", "_processed.csv"), index=False)

What I need help with:

How to make this run faster?
How to reduce memory usage while processing?
Would chunking or using another library be better here?

Any advice or examples would be appreciated!

Guidelines

I have read and understood this category's guidelines before making this post.

Answered by Andrey1224

Aug 13, 2025

You can improve both speed and memory usage by:

Read in chunks instead of loading the whole file:
for chunk in pd.read_csv(f, chunksize=100_000): chunk["new_col"] = chunk["old_col"] * 2 chunk.to_csv(f.replace(".csv", "_processed.csv"), mode="a", index=False)
Avoid .apply() — use vectorized operations (df["old_col"] * 2) which are much faster.
If possible, use dtype arguments in read_csv to reduce memory footprint.
For big gains, consider Polars or Dask for parallel, out-of-core processing.

View full answer

Andrey1224 · 2025-08-13T19:21:52Z

Andrey1224
Aug 13, 2025

You can improve both speed and memory usage by:

Read in chunks instead of loading the whole file:
for chunk in pd.read_csv(f, chunksize=100_000): chunk["new_col"] = chunk["old_col"] * 2 chunk.to_csv(f.replace(".csv", "_processed.csv"), mode="a", index=False)
Avoid .apply() — use vectorized operations (df["old_col"] * 2) which are much faster.
If possible, use dtype arguments in read_csv to reduce memory footprint.
For big gains, consider Polars or Dask for parallel, out-of-core processing.

1 reply

meKryztal Aug 13, 2025
Author

Thanks

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

How to optimize a Python script that processes large CSV files? #169711

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

This comment was marked as off-topic.

Select a reply

Uh oh!

GitHub Community

How to optimize a Python script that processes large CSV files? #169711

Uh oh!

meKryztal Aug 13, 2025

Body

Guidelines

Replies: 2 comments · 1 reply

Uh oh!

Andrey1224 Aug 13, 2025

Uh oh!

meKryztal Aug 13, 2025 Author

This comment was marked as off-topic.

meKryztal
Aug 13, 2025

Replies: 2 comments 1 reply

Andrey1224
Aug 13, 2025

meKryztal Aug 13, 2025
Author