Topic

Evaluation

31 items across the graph — tagged with Evaluation.

From the graph · 31

🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, Op…

→repo

mlflow/mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-…

→repo

promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simpl…

→repo

comet-ml/opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready d…

→repo

Tencent/WeKnora

Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning agent, and a self-maintaining Wiki.

→repo

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets…

→repo

jeinlee1991/chinese-llm-benchmark

非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤…

→repo

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

→repo

Kiln-AI/Kiln

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

→repo

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

→repo

Tencent/AI-Infra-Guard

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evalu…

→repo

langwatch/langwatch

The platform for LLM evaluations and AI agent testing

→repo

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

→repo

onestardao/WFGY

WFGY is heading toward WFGY 5.0 Polaris Protocol, a major open-source release for AI reasoning, RAG, agents, and real-world workflows. Includes Problem Map, Glo…

→repo

trpc-group/trpc-agent-go

A Go framework for building production agent systems with graph workflows, tools, memory, A2A, AG-UI, MCP, evaluation, and observability.

→repo

NVIDIA-NeMo/Gym

Evaluate and improve models and agents using environments

→repo

chrisliu298/awesome-llm-unlearning

A resource repository for machine unlearning in large language models

→repo

run-llama/ParseBench

ParseBench - A Document Parsing Benchmark for AI Agents

→repo

TIGER-AI-Lab/ClawBench

Open-source benchmark for browser AI agents on daily tasks.

→repo

SAILResearch/awesome-ai-leaderboard

A curated list of awesome leaderboard-oriented resources for AI domain

→repo

EuroEval/EuroEval

The robust European language model benchmark.

→repo

saccofrancesco/deepshot

AI-powered NBA game outcome predictor that uses advanced team stats and trend-based features to forecast winners and track model performance

→repo

OpenDCAI/One-Eval

Automated system for LLM evaluation via agents. Doc as below:

→repo

hidai25/eval-view

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

→repo

arthur-ai/arthur-engine

Make AI work for Everyone - Monitoring and governing for your AI/ML

→repo

dustalov/evalica

Evalica, your favourite evaluation toolkit

→repo

litefuse/litefuse

Litefuse - Agent Observability and Evaluation Platform

→repo

ruimalheiro/gradient-garden

Research platform for model training, evaluation, and experimentation across architectures, benchmarks, and recipes.

→repo

notwitcheer/llm-bench-rig

Dual-engine (llama.cpp + vLLM) LLM benchmarking pipeline for GGUF & safetensors on NVIDIA GPUs — speed, quality, live dashboard, publishable cards.

→repo

douglasjordan2/c0

An external memory for LLMs: a bi-temporal knowledge graph with hybrid (keyword + vector) retrieval and a self-improving reflection loop. Benchmarked to beat fl…

→repo

Emmimal/prompt-regression-suite

Detect prompt regressions before they reach production — per-category accuracy scoring, deterministic validation, and False Improvement detection. Pure Python,…

→

From the graph · 31

Related topics