Eval
50 items across the graph — tagged with Eval.
From the graph · 50
[EMNLP2025] "LightRAG: Simple and Fast Retrieval-Augmented Generation"
📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG
Open Source AI Platform - AI Chat with advanced features that works with every LLM
🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, Op…
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-…
Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with exp…
Mastra is the modern TypeScript framework for AI-powered applications and agents.
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simpl…
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready d…
Private AI platform for agents, assistants and enterprise search. Built-in Agent Builder, Deep research, Document analysis, Multi-model support, and API connect…
Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning agent, and a self-maintaining Wiki.
Hugging Face model with 13433 likes. Tags: transformers, safetensors, deepseek_v3, text-generation, conversational, custom_code, arxiv:2501.12948, license:mit,…
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets…
非线智能 NoneLinear - ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤…
🐢 Open-Source Evaluation & Testing library for LLM Agents
Hugging Face model with 5141 likes. Tags: transformers, safetensors, deepseek_v4, text-generation, arxiv:2606.19348, license:mit, eval-results, endpoints_compat…
Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.
Hugging Face model with 4943 likes. Tags: transformers, safetensors, gpt_oss, text-generation, vllm, conversational, arxiv:2508.10925, license:apache-2.0, eval-…
Hugging Face model with 4759 likes. Tags: transformers, safetensors, gpt_oss, text-generation, vllm, conversational, arxiv:2508.10925, license:apache-2.0, eval-…
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Hugging Face model with 4094 likes. Tags: transformers, safetensors, deepseek_v3, text-generation, conversational, custom_code, arxiv:2412.19437, eval-results,…
A full-stack AI Red Teaming platform securing AI ecosystems via OpenClaw Security Scan, Agent Scan, Skills Scan, MCP scan, AI Infra scan and LLM jailbreak evalu…
The platform for LLM evaluations and AI agent testing
Hugging Face model with 3133 likes. Tags: transformers, safetensors, deepseek_v3, text-generation, conversational, custom_code, arxiv:2412.19437, license:mit, e…
Hugging Face model with 3119 likes. Tags: transformers, safetensors, gemma4, image-text-to-text, conversational, base_model:google/gemma-4-31B, base_model:finet…
A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.
OpenKB: Open LLM Knowledge Base
Open-source inference server and production cluster for all the models your agent needs.
AI system design guide for engineers building production AI systems and evals.
:cloud: :rocket: :bar_chart: :chart_with_upwards_trend: Evaluating state of the art in AI
Cut AI token costs 95%+ on code exploration. The leading MCP server for precise, symbol-level GitHub code retrieval via tree-sitter AST. Works with Claude Code,…
WFGY is heading toward WFGY 5.0 Polaris Protocol, a major open-source release for AI reasoning, RAG, agents, and real-world workflows. Includes Problem Map, Glo…
A Go framework for building production agent systems with graph workflows, tools, memory, A2A, AG-UI, MCP, evaluation, and observability.
Open-source, end-to-end platform for evaluating, observing, and improving LLM and AI agent applications. Tracing · Evals · Simulations · Datasets · Gateway · Gu…
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form bui…
Evaluate and improve models and agents using environments
MIDI / symbolic music tokenizers for Deep Learning models 🎶
cuVS - a library for vector search and clustering on the GPU
Fast, streaming indexing, query, and agentic LLM applications in Rust
A resource repository for machine unlearning in large language models
Minimalist web-searching platform with an AI assistant that runs directly from your browser. Demo: https://felladrin-minisearch.hf.space
ParseBench - A Document Parsing Benchmark for AI Agents
Open-source benchmark for browser AI agents on daily tasks.
A curated list of awesome leaderboard-oriented resources for AI domain
Multimodal RAG to search and interact locally with technical documents of any kind
RL environments + evals for AI agents. Define once, train anything.
A library-science-inspired personal knowledge management system with LLM agents
Hallucination-prevention RAG system with verbatim span extraction. Ensures all generated content is grounded in source documents with exact citations.
The robust European language model benchmark.
AI-powered NBA game outcome predictor that uses advanced team stats and trend-based features to forecast winners and track model performance
