Towards Data Science00 bình luận17 phút đọc1 giờ trước

Prompt Engineering Fails Quietly — Prompt Regression Is Why

Prompt changes silently break production behavior — a problem called prompt regression. When a RAG intent classifier's system prompt grew from 6 to 14 instructions, negation queries started misclassifying without any obvious signal. The solution is a regression test suite: 40 golden queries across 6 intent categories, validated with 4 deterministic checks (schema, pattern, intent, guard). The suite detects the 'False Improvement' pattern — where overall accuracy rises while a critical category collapses. v4, the 'best' prompt at 67.5% overall accuracy, triggered FALSE IMPROVEMENT DETECTED due to a 66.7% collapse in negation classification. The framework uses a deterministic mock simulator instead of live LLM calls, runs in under 2 seconds, has zero external dependencies, and is fully reproducible. Practical guidance covers defining golden queries, setting critical categories, and building failure simulators from your own prompt changelog.

Đọc bài gốc

#python #testing #llm #rag #prompt-engineering

Nguồn: https://towardsdatascience.com/prompt-engineering-fails-quietly-prompt-regression-is-why. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

Real Python129 phút3 giờ trướcAI

How to Get Started With the GitHub Copilot CLI – Real Python

Bài viết hướng dẫn chi tiết cách cài đặt, xác thực và sử dụng GitHub Copilot CLI - một công cụ AI hỗ trợ lập trình dựa trên terminal. Nó bao gồm các bước cài đặt qua npm, Homebrew hoặc WinGet, xác thực OAuth, sử dụng chế độ tương tác, lệnh gạch chéo (/), và ba chế độ hoạt động (Standard, Plan, Autopilot), kèm theo ví dụ thực tế trên dự án tic-tac-toe bằng Python.

Lập trình viên muốn tự động hóa công việc phát triển bằng AI, thử nghiệm các tính năng mới của Copilot trong terminal và tối ưu hóa hiệu suất với các chế độ đa nhiệm như Fleet ngay trên dự án thực tế.

#python

Prompt Engineering Fails Quietly — Prompt Regression Is Why

Đề xuất cho bạn

How to Get Started With the GitHub Copilot CLI – Real Python

Inside Target’s LLM-Based System for Semantic Matching in Marketing Forecast Pipelines

Deno 2.9

The many journeys of learning Rust

Tail Control: The Counterintuitive Engineering of Reliable Agentic Workflows

CachyOS June 2026 OS Released With More Performance Optimizations

AI inference is obviously profitable

Anthropic’s Mythos found flaws in classified US systems during a government test