The New Stack0 Hot0 bình luận8 phút đọc1 giờ trước

“You Only Compute Once”: How Clockwork wants to put an end to AI training restarts

Clockwork has launched its YOCO (You Only Compute Once) Guarantee, backed by its TorchPass fault-tolerance product, which reached general availability in March. Instead of rolling back to a checkpoint when a GPU or node fails during AI training, TorchPass performs live migration of the in-memory training state — model weights, gradients, and optimizer state — to a healthy spare GPU, typically recovering in seconds to minutes. The guarantee promises 90% of failures resolved with no lost progress; if Clockwork misses that target, customers receive a 25% credit. TorchPass offers two modes: a model-aware mode requiring a few lines of code that recovers in tens of seconds, and a model-transparent mode requiring no code changes that takes a few minutes. Independent benchmarks by SemiAnalysis confirmed TorchPass outperformed checkpoint-restart and Meta's open-source TorchFT on a GPT-OSS-120B run on a 64x H200 cluster. Clockwork estimates failure-driven restarts cost over $6 million annually on a typical 2,048-GPU H200 deployment, and targets AI-native startups and enterprises rather than hyperscalers with large internal engineering teams.

Đọc bài gốc

#machine-learning #pytorch

Nguồn: https://thenewstack.io/clockwork-torchpass-gpu-migration. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

Medium1 Hot5 phút4 giờ trướcAI

From a “Buzzword” to a “Direction” — How AI Pulled Me Into the World of Data

Một sinh viên tốt nghiệp ngành ứng dụng máy tính chia sẻ hành trình từ kiến thức lập trình cơ bản đến xây dựng mô hình phân loại bệnh võng mạc tiểu đường nhờ AI, chứng minh rằng sự tò mò và ham học hỏi là đủ để bước chân vào lĩnh vực AI và khoa học dữ liệu, ngay cả khi không có nền tảng toán nâng cao.

Một lập trình viên nên đọc bài này để hiểu cách chuyển đổi từ kiến thức cơ bản đến dự án thực tế AI như phân loại bệnh từ hình ảnh, chứng minh rằng với sự tò mò và tinh thần học hỏi, họ có thể xây dựng được những giải pháp mạnh mẽ mà không cần phải nắm toàn bộ lý thuyết toán học phức tạp.

#machine-learning

“You Only Compute Once”: How Clockwork wants to put an end to AI training restarts

Đề xuất cho bạn

From a “Buzzword” to a “Direction” — How AI Pulled Me Into the World of Data

ML Development in VS Code with Google Cloud Power: Workbench Extension Now Available

Why Specialization Is Inevitable

Unlocking the Power of the TPU Stack: Introducing our new Developer Hub

From Prompt to Classifier: A Production Case Study

Persistent Latent Memory for Multi-Hop LLM Agents: How a 6G Handover Paper Closes the Agent Cold-Start

Jon and Mindy Gray bet $55M on AI to catch cancer before it starts

Cloudflare’s new policy pushes AI companies to pay for publishers’ content