Topic

Eval

50 items across the graph — tagged with Eval.

From the graph · 50

repo

HKUDS/LightRAG

[EMNLP2025] "LightRAG: Simple and Fast Retrieval-Augmented Generation"

→repo

VectifyAI/PageIndex

📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG

→repo

onyx-dot-app/onyx

Open Source AI Platform - AI Chat with advanced features that works with every LLM

→repo

langfuse/langfuse

🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, Op…

→repo

mlflow/mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-…

→repo

deepset-ai/haystack

Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with exp…

→repo

mastra-ai/mastra

Mastra is the modern TypeScript framework for AI-powered applications and agents.

→repo

promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simpl…

→repo

comet-ml/opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready d…

→repo

arc53/DocsGPT

Private AI platform for agents, assistants and enterprise search. Built-in Agent Builder, Deep research, Document analysis, Multi-model support, and API connect…

→repo

Tencent/WeKnora

Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning agent, and a self-maintaining Wiki.

→model

deepseek-ai/DeepSeek-R1

Hugging Face model with 13433 likes. Tags: transformers, safetensors, deepseek_v3, text-generation, conversational, custom_code, arxiv:2501.12948, license:mit,…

→repo

open-compass/opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets…

→repo

jeinlee1991/chinese-llm-benchmark

非线智能 NoneLinear - ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括374个大模型，覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤…

→repo

Giskard-AI/giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

→model

deepseek-ai/DeepSeek-V4-Pro

Hugging Face model with 5141 likes. Tags: transformers, safetensors, deepseek_v4, text-generation, arxiv:2606.19348, license:mit, eval-results, endpoints_compat…

→repo

Kiln-AI/Kiln

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

→model

openai/gpt-oss-120b

Hugging Face model with 4943 likes. Tags: transformers, safetensors, gpt_oss, text-generation, vllm, conversational, arxiv:2508.10925, license:apache-2.0, eval-…

→model

openai/gpt-oss-20b

Hugging Face model with 4759 likes. Tags: transformers, safetensors, gpt_oss, text-generation, vllm, conversational, arxiv:2508.10925, license:apache-2.0, eval-…

→repo

open-compass/VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

→model

deepseek-ai/DeepSeek-V3

Hugging Face model with 4094 likes. Tags: transformers, safetensors, deepseek_v3, text-generation, conversational, custom_code, arxiv:2412.19437, eval-results,…

→repo

Tencent/AI-Infra-Guard

A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evalu…

→repo

langwatch/langwatch

The platform for LLM evaluations and AI agent testing

→model

deepseek-ai/DeepSeek-V3-0324

Hugging Face model with 3133 likes. Tags: transformers, safetensors, deepseek_v3, text-generation, conversational, custom_code, arxiv:2412.19437, license:mit, e…

→model

google/gemma-4-31B-it

Hugging Face model with 3119 likes. Tags: transformers, safetensors, gemma4, image-text-to-text, conversational, base_model:google/gemma-4-31B, base_model:finet…

→repo

modelscope/evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

→repo

VectifyAI/OpenKB

OpenKB: Open LLM Knowledge Base

→repo

superlinked/sie

Open-source inference server and production cluster for all the models your agent needs.

→repo

ombharatiya/ai-system-design-guide

AI system design guide for engineers building production AI systems and evals.

→repo

Cloud-CV/EvalAI

:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI

→repo

jgravelle/jcodemunch-mcp

Cut AI token costs 95%+ on code exploration. The leading MCP server for precise, symbol-level GitHub code retrieval via tree-sitter AST. Works with Claude Code,…

→repo

onestardao/WFGY

WFGY is heading toward WFGY 5.0 Polaris Protocol, a major open-source release for AI reasoning, RAG, agents, and real-world workflows. Includes Problem Map, Glo…

→repo

trpc-group/trpc-agent-go

A Go framework for building production agent systems with graph workflows, tools, memory, A2A, AG-UI, MCP, evaluation, and observability.

→repo

future-agi/future-agi

Open-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applications. Tracing · Evals · Simulations · Datasets · Gateway · Gu…

→repo

NVIDIA/raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form bui…

→repo

NVIDIA-NeMo/Gym

Evaluate and improve models and agents using environments

→repo

Natooz/MidiTok

MIDI / symbolic music tokenizers for Deep Learning models 🎶

→repo

NVIDIA/cuvs

cuVS - a library for vector search and clustering on the GPU

→repo

bosun-ai/swiftide

Fast, streaming indexing, query, and agentic LLM applications in Rust

→repo

chrisliu298/awesome-llm-unlearning

A resource repository for machine unlearning in large language models

→repo

felladrin/MiniSearch

Minimalist web-searching platform with an AI assistant that runs directly from your browser. Demo: https://felladrin-minisearch.hf.space

→repo

run-llama/ParseBench

ParseBench - A Document Parsing Benchmark for AI Agents

→repo

TIGER-AI-Lab/ClawBench

Open-source benchmark for browser AI agents on daily tasks.

→repo

SAILResearch/awesome-ai-leaderboard

A curated list of awesome leaderboard-oriented resources for AI domain

→repo

jolibrain/colette

Multimodal RAG to search and interact locally with technical documents of any kind

→repo

hud-evals/hud-python

RL environments + evals for AI agents. Define once, train anything.

→repo

shenmintao/marginalia

A library-science-inspired personal knowledge management system with LLM agents

→repo

KRLabsOrg/verbatim-rag

Hallucination-prevention RAG system with verbatim span extraction. Ensures all generated content is grounded in source documents with exact citations.

→repo

EuroEval/EuroEval

The robust European language model benchmark.

→repo

saccofrancesco/deepshot

AI-powered NBA game outcome predictor that uses advanced team stats and trend-based features to forecast winners and track model performance

→