Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring
As Large Language Models (LLMs) and agentic systems become integrated into real-world applications, ensuring their safety and security is critical. Guardrail systems that detect and block malicious instructions sent to and from an LLM are an essential component of AI security. However, researchers conducting black-box adversarial emulation against production AI systems often struggle to determine whether a guardrail block or an LLM rejection has occurred. This distinction is important because the techniques used to bypass guardrails can differ substantially from those used to bypass LLM safety
Lineage graph
Paper → model → repo connections mined from source citations (Tier-1 exact match).
Why these links exist
- Linked via arxiv authorWilliam Hackett →
Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring
- Linked via arxiv authorPeter Garraghan →
Behind the Refusal: Determining Guardrail Activation via Behavioral Monitoring
