databricks0 Hot0 bình luận9 phút đọc3 giờ trước

How we keep GPUs reliable across Databricks AI

Databricks AI shares how they maintain GPU reliability at scale across distributed training workloads. The post covers three main failure categories: crashed jobs (often surfacing as NCCL watchdog timeouts), silent slowdowns from thermal or interconnect degradation, and numerical corruption from ECC-uncorrectable faults. At large GPU counts, failures during a run are statistically expected — a 1,024-GPU job running 30 days has a 57% chance of encountering one. To address this, Databricks built a multi-stage health check service called gpu-monitor with three layers: active bootstrap checks run at node provisioning, passive continuous checks monitor nodes under live workloads, and periodic multi-node checks validate inter-node fabric health using NCCL collective bandwidth probes across payload sizes from 8 bytes to 2 GiB. A real incident is detailed where a single InfiniBand port flap crashed a 7-hour training run due to NCCL_IB_TIMEOUT firing before the PyTorch watchdog — highlighting that cumulative downtime matters more than flap count. The post is the first in a series on GPU reliability engineering at Databricks scale.

Đọc bài gốc

#machine-learning #databricks

Nguồn: https://www.databricks.com/blog/how-we-keep-gpus-reliable-across-databricks-ai. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

Google Developers1 Hot2 phút9 giờ trướcAI

ML Development in VS Code with Google Cloud Power: Workbench Extension Now Available

Google Cloud vừa ra mắt tiện ích mở rộng Workbench Notebooks cho VS Code, giúp nhà khoa học dữ liệu và lập trình viên quản lý Jupyter notebooks trên cloud trực tiếp từ IDE cục bộ. Tiện ích này kết nối VS Code với cơ sở hạ tầng tối ưu AI của Google Cloud, giảm thiểu sự chuyển đổi ngữ cảnh giữa thử nghiệm cục bộ và điện toán đám mây.

Lập trình viên AI/ML sẽ tiết kiệm thời gian và hiệu suất khi sử dụng công cụ này để chạy và quản lý notebooks trên Google Cloud từ VS Code, tránh mất thời gian chuyển đổi giữa môi trường cài đặt địa phương và cloud.

#machine-learning

How we keep GPUs reliable across Databricks AI

Đề xuất cho bạn

ML Development in VS Code with Google Cloud Power: Workbench Extension Now Available

From a “Buzzword” to a “Direction” — How AI Pulled Me Into the World of Data

Unlocking the Power of the TPU Stack: Introducing our new Developer Hub

Why Specialization Is Inevitable

From Prompt to Classifier: A Production Case Study

Why the Frontier Ecosystem must be Open — Matei Zaharia and Reynold Xin, Databricks

Every security leader I know has a version of the same story.

Jon and Mindy Gray bet $55M on AI to catch cancer before it starts