Read original ↗
paperarXivTrust 82 · PrimaryPublished 5d agoLive · 3d ago

A Multi-Dataset Benchmark for Evaluating LLM Agents in Microservice Failure Diagnosis

LLM-based agents are reshaping microservice operations into AgentOps, where benchmarks are key to evaluating failure diagnosis over multimodal observability data. However, existing benchmarks remain largely outcome-oriented: they score only the final answer and fail to assess the systematic reasoning process in failure diagnosis. We address this gap by introducing two large-scale datasets (AIOps2025 and RCA100) under a reasoning-process evaluation paradigm that assesses agentic diagnostic capability along three dimensions: Localization (where the fault occurs), Identification (what type of fau

Lineage graph

Paper → model → repo connections mined from source citations (Tier-1 exact match).

Covers

Related to

Implements

Implements (incoming)

Related across the graph

Topics