Towards Data Science0 Hot0 bình luận9 phút đọc1 giờ trước

What Can We Do When Memory Becomes the New Bottleneck in Data Engineering?

When a 6.2 million-row social media dataset with mixed-type columns exceeds available RAM, three approaches can keep an ETL pipeline running without a hardware upgrade. Pandas chunking processes data in 250k-row slices to reduce peak memory at the cost of speed. Dask automates partitioning and uses multiple CPU cores for parallel execution, but requires explicit schema definitions for mixed-type columns. Polars, built on a Rust engine with Apache Arrow columnar format, offers the best balance of speed and memory efficiency through lazy query planning and streaming mode, though it requires learning a new DataFrame API. The right choice depends on constraints: Pandas chunking for dynamic schemas with tight resources, Dask for multi-core workloads, and Polars for performance-critical pipelines.

Đọc bài gốc

#data-engineering #pandas #etl #polars

Nguồn: https://towardsdatascience.com/when-memory-becomes-the-new-bottleneck-in-data-engineering-what-can-we-do. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

Redpanda1 Hot12 phút5 ngày trướcAI

Kafka's log compaction corrupts data. Here's how we fixed it

Apache Kafka có lỗ hổng trong cơ chế log compaction khiến dữ liệu bị hỏng do xung đột giữa compaction và replication, gây ra bốn vấn đề: dữ liệu đã xóa tái xuất hiện, giao dịch bị hủy hiện dưới dạng đã commit, dữ liệu đã commit bị ẩn, và consumers read_committed bị đóng băng partition. Redpanda Streaming khắc phục bằng giao thức compaction phối hợp, sử dụng các cặp offset (MCCO/MTRO, MXFO/MXRO) để đảm bảo tombstones và transaction markers không bị xóa trước khi tất cả replicas xử lý xong. Lỗi này có thể tái hiện trên Kafka phiên bản 3.9 đến 4.2 bằng Docker Compose.

Lập trình viên cần đọc bài này để hiểu cách giải quyết vấn đề lỗi race condition trong log compaction của Kafka, giúp tránh mất dữ liệu và bảo đảm tính nhất quán khi xử lý các trường hợp đồng bộ hóa dữ liệu trên nhiều broker.

What Can We Do When Memory Becomes the New Bottleneck in Data Engineering?

Đề xuất cho bạn

Kafka's log compaction corrupts data. Here's how we fixed it

Why the Frontier Ecosystem must be Open — Matei Zaharia and Reynold Xin, Databricks

A Quadrillion Rows across three Clouds: scaling LogHouse

A Guide to Apache Paimon Java API

The New Age of Consulting: How We Reduced Data Model Refresh Time by 90 %

The Dagster Almanack: Operationalizing Data Orchestration

I Spent an Hour on a Data Preprocessing Task Before Asking Gemini

What is the dltHub Context Layer?