A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks fo…

→repo

InternLM/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

→repo

rapidsai/cuml

cuML - RAPIDS Machine Learning Library

→repo

iree-org/iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

→repo

NVIDIA/TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwel…

→repo

BBuf/how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

→repo

pytorch/TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT

→repo

dphnAI/aphrodite-engine

Large-scale LLM inference engine

→repo

beam-cloud/beta9

Ultrafast serverless GPU inference, sandboxes, and background jobs

→repo

tenstorrent/tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

→repo

uccl-project/uccl

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

→repo

SemiAnalysisAI/InferenceX

Open Source Continuous Inference Benchmark Research Platform — Kimi K2.7-Code, MiniMax M3, DeepSeekv4, GLM5 - GB200 NVL72 vs MI355X vs B200 vs GB300 NVL72 & soo…

→repo

NVIDIA/raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form bui…

→repo

jmaczan/tiny-vllm

Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM

→repo

brucefan1983/GPUMD

Graphics Processing Units Molecular Dynamics

→repo

NVIDIA/cuvs

cuVS - a library for vector search and clustering on the GPU

→repo

openinfer-project/openinfer

Pure Rust + CUDA LLM inference engine — no PyTorch, OpenAI-compatible, serves Qwen3 to Kimi-K2

→repo

ModelEngine-Group/unified-cache-management

Persist and reuse KV Cache to speedup your LLM.

→repo

raketenkater/ggrun

Auto-tuned launcher for GGUF models on llama.cpp / ik_llama.cpp — OpenAI-compatible server with multi-GPU tensor-split, MoE expert placement, measured flag tuni…

→repo

kibae/onnxruntime-server

ONNX Runtime Server: The ONNX Runtime Server is a server that provides TCP and HTTP/HTTPS REST APIs for ONNX inference.

→repo

traceopt-ai/traceml

A lightweight runtime health check for PyTorch training runs.

→repo

gammahazard/locate-anything

Sleek, mobile-friendly web UI for NVIDIA LocateAnything-3B — open-vocabulary object detection & grounding on your own GPU, via one docker compose up.

→repo

b-data/jupyterlab-python-docker-stack

(GPU accelerated) Multi-arch (linux/amd64, linux/arm64/v8) JupyterLab Python docker images. Please submit Pull Requests to the GitLab repository. Mirror of

→repo

b-data/data-science-devcontainers

(GPU accelerated) Multi-arch (linux/amd64, linux/arm64/v8) Data Science dev containers for R, Python, Julia and MAX/Mojo

→repo

SemiAnalysisAI/InferenceX-app

Dashboard for InferenceX™, Open Source Continuous Inference

→repo

kekzl/imp

From-scratch C++/CUDA inference engine for the NVIDIA RTX 5090 (sm_120a) — the best single-GPU backend for agentic AI: tool calling, long-context loops, reasoni…

→repo

aivrar/vllm-windows-build

Native Windows build of vLLM 0.24.0 - no WSL, no Docker. Python 3.13 + CUDA 12.8 + PyTorch 2.11 cu128 for RTX 30/40/50-series, pre-built wheel, Windows patchset…

→repo

aivrar/multi-turboquant

Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU planner. Compress KV cach…

→repo

notwitcheer/llm-bench-rig

Dual-engine (llama.cpp + vLLM) LLM benchmarking pipeline for GGUF & safetensors on NVIDIA GPUs — speed, quality, live dashboard, publishable cards.

→