Towards Trajectory-Level Alignment: Detecting Inte

Benjamin

创造

项目类型

Large Language ModelIntent DriftIntent Drift ScoreTrajectory-level StaLong-horizon Tasks

项目简介

Large Language Models (LLMs) are increasingly deployed as multi-turn, goal- 1directed agents in domains such as tutoring, planning, and financial decision- 2making. Yet, even when individual steps appear correct, their overall trajectories 3can gradually diverge from user intent—a phenomenon we call Intent Drift. Unlike 4hallucination or local error accumulation, intent drift is a trajectory-level instability 5that undermines reliability in long-horizon tasks. 6We introduce the Intent Drift Score (IDS), a unified and computable metric for 7detecting and mitigating this form of misalignment. IDS integrates semantic, 8structural, and temporal signals into a prefix-monotone score, enabling real-time 9monitoring of drift. It is computable in linear time and scales to million-token 10contexts, making it deployable in practical long-horizon applications. 11Grounded in stability and rate–distortion theory, IDS offers formal guarantees of 12prefix-monotonicity and stability bounds. Empirical evaluations across dialogue 13and planning benchmarks show that IDS correlates strongly with human ratings 14(above 0.82) and identifies drift significantly earlier than BLEU, ROUGE, or 15graph-based diagnostics. 16Our core message is straightforward: alignment must be assessed not only by 17accuracy and safety, but also by trajectory-level stability. IDS operationalizes this 18principle, providing a foundation for building LLM agents that remain trustworthy 19over extended interactions.

大家的评价

还没有评价

George、RunningStone 等人赞过