Training-serving skew — the divergence between features used during model training and those seen at inference time — silently degrades ML accuracy and doubles infrastructure costs. The solution is a unified kappa architecture: compute features once in Apache Flink, dual-write to an offline store (Apache Iceberg or Delta Lake) for training and an online key-value cache for serving. DoorDash measured a 35.7% feature-value mismatch in their dual-pipeline setup; Netflix replaced a $93M/year dual-pipeline backfill with a $2M/year kappa replay. The reference architecture covers Kafka ingestion via Confluent's Kora engine, serverless Flink with event-time watermarks and exactly-once semantics, Tableflow for automated Iceberg/Delta materialization, and Stream Governance for schema enforcement and lineage. A tooling comparison covers Databricks, SageMaker+Kinesis, Tecton, Feast, and Confluent, with a decision framework based on latency requirements, existing stack investment, and pipeline fragmentation. The post is authored by a Confluent employee and promotes the Confluent Data Streaming Platform throughout.
Nguồn: https://www.confluent.io/blog/eliminate-training-serving-skew-mlops. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.
Apache Kafka có lỗ hổng trong cơ chế log compaction khiến dữ liệu bị hỏng do xung đột giữa compaction và replication, gây ra bốn vấn đề: dữ liệu đã xóa tái xuất hiện, giao dịch bị hủy hiện dưới dạng đã commit, dữ liệu đã commit bị ẩn, và consumers read_committed bị đóng băng partition. Redpanda Streaming khắc phục bằng giao thức compaction phối hợp, sử dụng các cặp offset (MCCO/MTRO, MXFO/MXRO) để đảm bảo tombstones và transaction markers không bị xóa trước khi tất cả replicas xử lý xong. Lỗi này có thể tái hiện trên Kafka phiên bản 3.9 đến 4.2 bằng Docker Compose.
Lập trình viên cần đọc bài này để hiểu cách giải quyết vấn đề lỗi race condition trong log compaction của Kafka, giúp tránh mất dữ liệu và bảo đảm tính nhất quán khi xử lý các trường hợp đồng bộ hóa dữ liệu trên nhiều broker.
EDB bổ sung khả năng phân tích hội tụ cho dịch vụ cơ sở dữ liệu EDB Postgres AI, sử dụng Apache Iceberg làm lớp danh mục chia sẻ kết nối ClickHouse, WarehousePG và Spark, đồng thời cung cấp tính năng "agentic database" tự động hóa nhiệm vụ DBA định kỳ. Giải pháp này nhấn mạnh quyền kiểm soát dữ liệu tại chỗ cho doanh nghiệp, khác biệt với cách tiếp cận lakehouse của Databricks, và có mức giá theo lõi CPU ổn định hơn so với các nền tảng cloud theo mức tiêu thụ.
Lập trình viên cần đọc bài này để hiểu cách Postgres AI của EDB kết hợp với Iceberg và các công cụ phân tích khác để tạo ra một hệ sinh thái tích hợp, giúp tối ưu hóa quy trình phát triển ứng dụng AI với tính linh hoạt, kiểm soát dữ liệu và chi phí dự đoán hơn so với các giải pháp cloud tiêu thụ.

Apache Flink 2.3.0 is now available, implementing 15 FLIPs with major improvements across SQL, connectors, and runtime. New SQL operators FROM_CHANGELOG and TO_CHANGELOG bridge append-only and dynamic changelog tables. Materialized tables gain DDL parity with regular tables and fine-grained refresh control via a new START_MODE clause. The SinkUpsertMaterializer is reworked with an explicit ON CONFLICT clause and watermark-based compaction to reduce state size. A new native S3 filesystem plugin built on AWS SDK v2 replaces Hadoop/Presto-based connectors with non-blocking I/O and zero Hadoop dependencies. Runtime improvements include adaptive partition selection for backpressure handling, watermark alignment redesign for faster backlog processing, checkpointing during recovery from unaligned checkpoints, and application-level lifecycle management with a new Web UI Applications tab.
Running self-hosted AI models on Kubernetes introduces significant changes to how platform teams manage capacity, security, and operations. The post covers when self-hosting makes sense over API-based AI (cost predictability, data residency, vendor lock-in), what changes in cluster design (GPU node groups, autoscaling, scheduling patterns, observability), and how to split ownership between platform and ML teams. Key operational concerns include GPU utilization, new failure modes like queue depth and token latency, compliance mapping, and FinOps for GPU spend. The post also addresses when to keep Kubernetes management in-house versus using a managed service.
Redpanda SQL introduces bridge queries, a feature that lets you query live Redpanda topic data and historical Iceberg table data together through a single virtual SQL table. This eliminates the classic streaming-to-lakehouse tradeoff between data freshness and Parquet file quality. Previously, frequent flushes to Iceberg were needed for low-latency analytics, resulting in thousands of tiny Parquet files, poor compression, high S3 costs, and constant compaction overhead. With bridge queries, the topic itself serves the freshness gap at query time, so Iceberg flushes can happen on a longer cadence (hours instead of seconds/minutes), producing larger, analytics-optimized Parquet files. Configuration involves raising the iceberg lag target and optionally the flush size threshold. The result is lower query latency, reduced S3 costs, and no need for a compaction service.
Enterprise organizations are struggling to move AI projects beyond the proof-of-concept stage into scalable, production-ready systems. The core problem is that AI adoption is outpacing governance, with only 21% of companies having operationalized the foundations needed to scale safely. The proposed solution is building AI-driven applications with governance baked in by design rather than added as an afterthought. BARC research supports this: organizations with company-wide governed data products are far more likely to have three or more AI projects in production (85% vs. 25%). A practical framework is outlined covering seamless application development, integrated governance with automated monitoring, and a shift from experimental pilots to product-oriented AI development.
Batch customer data platforms can't capture user intent as it forms — by the time a nightly sync completes, the intent moment is gone. A streaming-native architecture built on Apache Kafka and Apache Flink handles the full spectrum of personalization latency windows, from sub-100ms real-time bidding to multi-day email campaigns, using the same four-job pipeline: connect, stream, process, and govern. An AI-native layer (Confluent Intelligence) sits on top, enabling streaming agents with MCP tool-calling, a real-time context engine for LLMs, and built-in ML functions (ML_PREDICT, AI_COMPLETE) for embedding, ranking, and generative copy — all running as Flink jobs with exactly-once semantics and full lineage. The guide covers three production patterns (retail product recommendations, media feed personalization, cross-channel cart abandonment orchestration), a five-capability vendor evaluation framework, and a three-phase rollout roadmap from streaming backbone to autonomous agentic personalization.