Topic

Llm Evaluation

8 items across the graph — tagged with Llm Evaluation.

From the graph · 8

repo
langfuse/langfuse

🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, Op…

repo
mlflow/mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-…

repo
promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simpl…

repo
comet-ml/opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready d…

repo
jeinlee1991/chinese-llm-benchmark

非线智能 NoneLinear - ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤…

repo
Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

repo
Tencent/AI-Infra-Guard

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evalu…

repo
Emmimal/prompt-regression-suite

Detect prompt regressions before they reach production — per-category accuracy scoring, deterministic validation, and False Improvement detection. Pure Python,…

Related topics