XDA Developers00 bình luận7 phút đọc3 giờ trước

I ran my local LLM for hours and watched it get dumber in real time

Running a local LLM (Qwen 3.6 27B) on an RTX 5090 for extended periods causes noticeable performance degradation. The root causes are not model corruption or uptime, but two interacting issues: the KV cache pre-allocated for a 262K context window consumes enough VRAM to push total usage over 32GB, causing silent spillover into system RAM via PCIe; and as conversations grow, the context window fills up, causing the model to drop early instructions and lose conversational history. Qwen 3.6's hybrid architecture (only 16 of 64 layers use full attention) means KV cache is smaller than expected, but combined with model weights and OS overhead, VRAM still overflows. The fix is simple: start a new chat or reload the model to clear the cache. The model weights themselves never change during inference.

Đọc bài gốc

#llm #local-ai #lm-studio

Nguồn: https://www.xda-developers.com/ran-my-local-llm-for-hours-and-watched-it-get-dumber-in-real-time. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

Rust3922 phút1 ngày trướcAITop

The many journeys of learning Rust

Nghiên cứu định tính từ nhóm Rust về cách các nhà phát triển học ngôn ngữ Rust thông qua …

#llm #rust Nguồn

I ran my local LLM for hours and watched it get dumber in real time

Đề xuất cho bạn

The many journeys of learning Rust

The Exhaustion of Talking to a Tool

I tried PewDiePie's open-source AI workspace, and it's weirdly great

Your agent already has a plan

A no-nonsense explainer to Agentic AI

Anthropic’s Mythos found flaws in classified US systems during a government test

Your Foundation Model is a Service. Operate it Like One

EP219: 12 Open-source LLMs