Read original ↗
paperarXivTrust 82 · PrimaryPublished 3d agoLive · 2d ago

Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues

As large language models take on morally consequential roles in healthcare, legal, and hiring contexts, we need to examine whether their ethical behaviors are genuine or superficial. We show that current fairness evaluations substantially overestimate moral safety. Models appear fair when demographic identity is stated as an explicit label, yet become measurably less fair when the same identity must be inferred. We term this failure \emph{performative compliance}, where a model is fair when the presentation resembles a fairness evaluation and less fair as that cue weakens. We introduce a cue-v

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Topics