Topic

Eval

50 items across the graph — tagged with Eval.

From the graph · 50

repo
HKUDS/LightRAG

[EMNLP2025] "LightRAG: Simple and Fast Retrieval-Augmented Generation"

repo
VectifyAI/PageIndex

📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG

repo
onyx-dot-app/onyx

Open Source AI Platform - AI Chat with advanced features that works with every LLM

repo
langfuse/langfuse

🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, Op…

repo
mlflow/mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-…

repo
deepset-ai/haystack

Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with exp…

repo
mastra-ai/mastra

Mastra is the modern TypeScript framework for AI-powered applications and agents.

repo
promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simpl…

repo
comet-ml/opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready d…

repo
arc53/DocsGPT

Private AI platform for agents, assistants and enterprise search. Built-in Agent Builder, Deep research, Document analysis, Multi-model support, and API connect…

repo
Tencent/WeKnora

Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning agent, and a self-maintaining Wiki.

model
deepseek-ai/DeepSeek-R1

Hugging Face model with 13433 likes. Tags: transformers, safetensors, deepseek_v3, text-generation, conversational, custom_code, arxiv:2501.12948, license:mit,…

repo
open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets…

repo
jeinlee1991/chinese-llm-benchmark

非线智能 NoneLinear - ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤…

repo
Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

model
deepseek-ai/DeepSeek-V4-Pro

Hugging Face model with 5141 likes. Tags: transformers, safetensors, deepseek_v4, text-generation, arxiv:2606.19348, license:mit, eval-results, endpoints_compat…

repo
Kiln-AI/Kiln

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

model
openai/gpt-oss-120b

Hugging Face model with 4943 likes. Tags: transformers, safetensors, gpt_oss, text-generation, vllm, conversational, arxiv:2508.10925, license:apache-2.0, eval-…

model
openai/gpt-oss-20b

Hugging Face model with 4759 likes. Tags: transformers, safetensors, gpt_oss, text-generation, vllm, conversational, arxiv:2508.10925, license:apache-2.0, eval-…

repo
open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

model
deepseek-ai/DeepSeek-V3

Hugging Face model with 4094 likes. Tags: transformers, safetensors, deepseek_v3, text-generation, conversational, custom_code, arxiv:2412.19437, eval-results,…

repo
Tencent/AI-Infra-Guard

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evalu…

repo
langwatch/langwatch

The platform for LLM evaluations and AI agent testing

model
deepseek-ai/DeepSeek-V3-0324

Hugging Face model with 3133 likes. Tags: transformers, safetensors, deepseek_v3, text-generation, conversational, custom_code, arxiv:2412.19437, license:mit, e…

model
google/gemma-4-31B-it

Hugging Face model with 3119 likes. Tags: transformers, safetensors, gemma4, image-text-to-text, conversational, base_model:google/gemma-4-31B, base_model:finet…

repo
modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

repo
VectifyAI/OpenKB

OpenKB: Open LLM Knowledge Base

repo
superlinked/sie

Open-source inference server and production cluster for all the models your agent needs.

repo
ombharatiya/ai-system-design-guide

AI system design guide for engineers building production AI systems and evals.

repo
Cloud-CV/EvalAI

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI

repo
jgravelle/jcodemunch-mcp

Cut AI token costs 95%+ on code exploration. The leading MCP server for precise, symbol-level GitHub code retrieval via tree-sitter AST. Works with Claude Code,…

repo
onestardao/WFGY

WFGY is heading toward WFGY 5.0 Polaris Protocol, a major open-source release for AI reasoning, RAG, agents, and real-world workflows. Includes Problem Map, Glo…

repo
trpc-group/trpc-agent-go

A Go framework for building production agent systems with graph workflows, tools, memory, A2A, AG-UI, MCP, evaluation, and observability.

repo
future-agi/future-agi

Open-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applications. Tracing · Evals · Simulations · Datasets · Gateway · Gu…

repo
NVIDIA/raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form bui…

repo
NVIDIA-NeMo/Gym

Evaluate and improve models and agents using environments

repo
Natooz/MidiTok

MIDI / symbolic music tokenizers for Deep Learning models 🎶

repo
NVIDIA/cuvs

cuVS - a library for vector search and clustering on the GPU

repo
bosun-ai/swiftide

Fast, streaming indexing, query, and agentic LLM applications in Rust

repo
chrisliu298/awesome-llm-unlearning

A resource repository for machine unlearning in large language models

repo
felladrin/MiniSearch

Minimalist web-searching platform with an AI assistant that runs directly from your browser. Demo: https://felladrin-minisearch.hf.space

repo
run-llama/ParseBench

ParseBench - A Document Parsing Benchmark for AI Agents

repo
TIGER-AI-Lab/ClawBench

Open-source benchmark for browser AI agents on daily tasks.

repo
SAILResearch/awesome-ai-leaderboard

A curated list of awesome leaderboard-oriented resources for AI domain

repo
jolibrain/colette

Multimodal RAG to search and interact locally with technical documents of any kind

repo
hud-evals/hud-python

RL environments + evals for AI agents. Define once, train anything.

repo
shenmintao/marginalia

A library-science-inspired personal knowledge management system with LLM agents

repo
KRLabsOrg/verbatim-rag

Hallucination-prevention RAG system with verbatim span extraction. Ensures all generated content is grounded in source documents with exact citations.

repo
EuroEval/EuroEval

The robust European language model benchmark.

repo
saccofrancesco/deepshot

AI-powered NBA game outcome predictor that uses advanced team stats and trend-based features to forecast winners and track model performance

Related topics