Read original ↗
paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring

As Large Language Models (LLMs) and agentic systems become integrated into real-world applications, ensuring their safety and security is critical. Guardrail systems that detect and block malicious instructions sent to and from an LLM are an essential component of AI security. However, researchers conducting black-box adversarial emulation against production AI systems often struggle to determine whether a guardrail block or an LLM rejection has occurred. This distinction is important because the techniques used to bypass guardrails can differ substantially from those used to bypass LLM safety

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

  • Linked via arxiv authorWilliam Hackett

    Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring

  • Linked via arxiv authorPeter Garraghan

    Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring

Covers

authored (incoming)

Covers (incoming)

Related across the graph

Topics