Topic

Evaluation

31 items across the graph — tagged with Evaluation.

From the graph · 31

repo
langfuse/langfuse

🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, Op…

repo
mlflow/mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-…

repo
promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simpl…

repo
comet-ml/opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready d…

repo
Tencent/WeKnora

Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning agent, and a self-maintaining Wiki.

repo
open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets…

repo
jeinlee1991/chinese-llm-benchmark

非线智能 NoneLinear - ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤…

repo
Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

repo
Kiln-AI/Kiln

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

repo
open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

repo
Tencent/AI-Infra-Guard

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evalu…

repo
langwatch/langwatch

The platform for LLM evaluations and AI agent testing

repo
modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

repo
onestardao/WFGY

WFGY is heading toward WFGY 5.0 Polaris Protocol, a major open-source release for AI reasoning, RAG, agents, and real-world workflows. Includes Problem Map, Glo…

repo
trpc-group/trpc-agent-go

A Go framework for building production agent systems with graph workflows, tools, memory, A2A, AG-UI, MCP, evaluation, and observability.

repo
NVIDIA-NeMo/Gym

Evaluate and improve models and agents using environments

repo
chrisliu298/awesome-llm-unlearning

A resource repository for machine unlearning in large language models

repo
run-llama/ParseBench

ParseBench - A Document Parsing Benchmark for AI Agents

repo
TIGER-AI-Lab/ClawBench

Open-source benchmark for browser AI agents on daily tasks.

repo
SAILResearch/awesome-ai-leaderboard

A curated list of awesome leaderboard-oriented resources for AI domain

repo
EuroEval/EuroEval

The robust European language model benchmark.

repo
saccofrancesco/deepshot

AI-powered NBA game outcome predictor that uses advanced team stats and trend-based features to forecast winners and track model performance

repo
OpenDCAI/One-Eval

Automated system for LLM evaluation via agents. Doc as below:

repo
hidai25/eval-view

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

repo
arthur-ai/arthur-engine

Make AI work for Everyone - Monitoring and governing for your AI/ML

repo
dustalov/evalica

Evalica, your favourite evaluation toolkit

repo
litefuse/litefuse

Litefuse - Agent Observability and Evaluation Platform

repo
ruimalheiro/gradient-garden

Research platform for model training, evaluation, and experimentation across architectures, benchmarks, and recipes.

repo
notwitcheer/llm-bench-rig

Dual-engine (llama.cpp + vLLM) LLM benchmarking pipeline for GGUF & safetensors on NVIDIA GPUs — speed, quality, live dashboard, publishable cards.

repo
douglasjordan2/c0

An external memory for LLMs: a bi-temporal knowledge graph with hybrid (keyword + vector) retrieval and a self-improving reflection loop. Benchmarked to beat fl…

repo
Emmimal/prompt-regression-suite

Detect prompt regressions before they reach production — per-category accuracy scoring, deterministic validation, and False Improvement detection. Pure Python,…

Related topics