paperarXivTrust 82 · PrimaryPublished yesterdayLive · 19h ago

Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring

As Large Language Models (LLMs) and agentic systems become integrated into real-world applications, ensuring their safety and security is critical. Guardrail systems that detect and block malicious instructions sent to and from an LLM are an essential component of AI security. However, researchers conducting black-box adversarial emulation against production AI systems often struggle to determine whether a guardrail block or an LLM rejection has occurred. This distinction is important because the techniques used to bypass guardrails can differ substantially from those used to bypass LLM safety

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Why these links exist

Linked via arxiv authorWilliam Hackett →
Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring
Linked via arxiv authorPeter Garraghan →
Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring

Covers

newsPrompt injection is exploiting enterprise AI's biggest design flaws by targeting agents, RAG pipelines and model routers newsSecuring the future of AI agents newsAI browsers can be lulled into a dream world where guardrails no longer apply news"Dangerous" AI models are coming no matter what newsHow to Secure AI Agents With Container Sandboxing - HackerNoon

authored (incoming)

personWilliam Hackett personPeter Garraghan

Covers (incoming)

newsChain-of-Thought Spoofing Targets Reasoning AI Models - Hackaday

Related across the graph

newsPrompt injection is exploiting enterprise AI's biggest design flaws by targeting agents, RAG pipelines and model routers personWilliam Hackett newsChain-of-Thought Spoofing Targets Reasoning AI Models - Hackaday newsAI browsers can be lulled into a dream world where guardrails no longer apply newsSecuring the future of AI agents personPeter Garraghan news"Dangerous" AI models are coming no matter what newsHow to Secure AI Agents With Container Sandboxing - HackerNoon

Topics

cs.AI