Red Hat Developer0 Hot0 bình luận9 phút đọc2 giờ trước

Scale document ingestion with Docling and Ray on OpenShift AI

Scaling document ingestion for AI pipelines is a common bottleneck, especially with complex PDFs containing tables, multi-column layouts, and embedded figures. This post presents a production-ready architecture combining Docling (structure-aware PDF parsing) with Ray Data (distributed streaming execution) on Red Hat OpenShift AI. Docling loads ~1 GB of ML models and takes 5–20 seconds per PDF, making sequential processing of 10,000+ documents impractical. Ray Data's actor pool model amortizes model loading costs and overlaps read/process/write stages. KubeRay manages cluster lifecycle on Kubernetes, while the CodeFlare SDK simplifies cluster configuration from notebooks. Two deployment patterns are covered: ephemeral RayJob clusters for batch/CI-CD workloads and persistent RayClusters for interactive development. A configuration calculator script helps size actor pools, memory, and partitioning. Sample throughput with 8 workers × 8 CPUs reaches 4–8 files/second, processing 10,000 PDFs in 20–40 minutes. Extensions include S3 storage, OCR support, and additional document formats.

Đọc bài gốc

#rag #openshift

Nguồn: https://developers.redhat.com/articles/2026/06/30/scale-document-ingestion-docling-and-ray-openshift-ai. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

InfoQ1 Hot3 phút18 giờ trướcAI

Inside Target’s LLM-Based System for Semantic Matching in Marketing Forecast Pipelines

Target xây dựng hệ thống AI sinh ra để tối ưu dự báo chiến dịch marketing bằng cách truy xuất và xếp hạng các chiến dịch lịch sử tương tự. Pipeline đa giai đoạn sử dụng embeddings để nắm bắt ý nghĩa ngữ nghĩa từ metadata chiến dịch, vector similarity search để truy xuất ứng viên, và LLM để xếp hạng cũng như giải thích kết quả. Hệ thống này thay thế hệ thống rule-based cũ vốn đòi hỏi bảo trì thủ công và gặp khó khăn với định dạng chiến dịch thay đổi. Kết quả đánh giá đạt 75% độ phủ top-1 và 100% top-3 trên bộ dữ liệu thử nghiệm đa dạng. Hệ thống có vòng phản hồi tự động tinh chỉnh embeddings dựa trên dữ liệu hiệu suất chiến dịch đã hoàn thành, đồng thời các nhà phân tích xem xét đầu ra của mô hình trước khi đưa vào quy trình dự báo.

Scale document ingestion with Docling and Ray on OpenShift AI

Đề xuất cho bạn

Inside Target’s LLM-Based System for Semantic Matching in Marketing Forecast Pipelines

EP220: RAG vs Graph RAG vs Agentic RAG

How to Build a Powerful LLM Knowledge Base

AI won't be powered by better models alone, says Oxylabs CEO Vytautas Savickas

The AI Agent Tech Stack Explained

Your Foundation Model is a Service. Operate it Like One

Letting an LLM Pick the Right RAG Page: The Arbiter Pattern at the End of Retrieval

Knowledge graph RAG: structured retrieval for AI agents