MIT Technology Review00 bình luận7 phút đọc1 ngày trước

The emergence of the web data infrastructure layer for AI

As AI systems move beyond static training data, enterprises face a growing need for real-time web data infrastructure. Traditional model training on fixed snapshots is insufficient for use cases like dynamic pricing, market tracking, and reducing hallucinations. A new infrastructure layer — capable of emulating human browsing at massive scale, navigating anti-bot protections, and delivering structured data with low latency — is emerging to fill this gap. Bright Data's CEO argues that AI intelligence without a live knowledge layer is practically useless, and that 97% of AI organizations depend on real-time web data yet 90% feel constrained by access restrictions. Compliance with GDPR and CCPA is addressed through consent-based networks and public-data-only policies. This sponsored content promotes Bright Data's web data platform as a solution.

Đọc bài gốc

#big-data #crawling #rag

Nguồn: https://www.technologyreview.com/2026/06/24/1139202/the-emergence-of-the-web-data-infrastructure-layer-for-ai. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

Towards Data Science130 phút8 giờ trướcAI

Letting an LLM Pick the Right RAG Page: The Arbiter Pattern at the End of Retrieval

Bài viết giới thiệu "Arbiter Pattern" trong RAG, nơi LLM đóng vai trọng tài bằng cách phân loại và đánh giá các nguồn tài liệu ứng viên dựa trên cấu trúc dữ liệu đầu vào, thay thế phương pháp kết hợp điểm số truyền thống. Tác giả nhấn mạnh embeddings nên là phương pháp cuối cùng trong tài liệu doanh nghiệp do hạn chế trong việc xác định sự vắng mặt của thông tin, trong khi keyword retrieval cung cấp khả năng phủ định chắc chắn. Ngoài ra, bài viết đề cập đến bộ chọn phương pháp truy xuất theo loại câu hỏi, lược đồ JSON thống nhất cho kết quả truy xuất nhằm đảm bảo khả năng kiểm tra, và tầm quan trọng của xử lý "không tìm thấy" đáng tin cậy trong ngữ cảnh tuân thủ quy định.

Một lập trình viên cần đọc bài này để tìm hiểu cách tối ưu hóa hệ thống RAG bằng cách áp dụng —một giải pháp linh hoạt hơn fusion score, giúp xử lý các trường hợp phức tạp trong việc lựa chọn kết quả phù hợp từ nhiều nguồn thông tin khác nhau.

The emergence of the web data infrastructure layer for AI

Đề xuất cho bạn

Letting an LLM Pick the Right RAG Page: The Arbiter Pattern at the End of Retrieval

Knowledge graph RAG: structured retrieval for AI agents

Announcing DuckDB 1.5.4 Variegata

Anchor Detection for RAG: Parallel Detectors, Then One LLM Call at the End

Context Windows Are Not Memory: What AI Agent Developers Need to Understand

Using LlamaIndex for RAG in Python – Real Python

Vector RAG Isn’t Enough — I Built a Context Graph Layer for Multi-Agent Memory

Build a Governed Databricks Workspace with Pulumi