Cuda
34 items across the graph — tagged with Cuda.
From the graph · 34
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a high-performance serving framework for large language models and multimodal models.
Open3D: A Modern Library for 3D Data Processing
LMCache: Supercharge Your LLM with the Fastest KV Cache Layer
Containers for machine learning
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks fo…
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
cuML - RAPIDS Machine Learning Library
A retargetable MLIR-based machine learning compiler and runtime toolkit.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwel…
how to optimize some algorithm in cuda.
PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
Large-scale LLM inference engine
Ultrafast serverless GPU inference, sandboxes, and background jobs
:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
Open Source Continuous Inference Benchmark Research Platform — Kimi K2.7-Code, MiniMax M3, DeepSeekv4, GLM5 - GB200 NVL72 vs MI355X vs B200 vs GB300 NVL72 & soo…
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form bui…
Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM
Graphics Processing Units Molecular Dynamics
cuVS - a library for vector search and clustering on the GPU
Pure Rust + CUDA LLM inference engine — no PyTorch, OpenAI-compatible, serves Qwen3 to Kimi-K2
Persist and reuse KV Cache to speedup your LLM.
Auto-tuned launcher for GGUF models on llama.cpp / ik_llama.cpp — OpenAI-compatible server with multi-GPU tensor-split, MoE expert placement, measured flag tuni…
ONNX Runtime Server: The ONNX Runtime Server is a server that provides TCP and HTTP/HTTPS REST APIs for ONNX inference.
A lightweight runtime health check for PyTorch training runs.
Sleek, mobile-friendly web UI for NVIDIA LocateAnything-3B — open-vocabulary object detection & grounding on your own GPU, via one docker compose up.
(GPU accelerated) Multi-arch (linux/amd64, linux/arm64/v8) JupyterLab Python docker images. Please submit Pull Requests to the GitLab repository. Mirror of
(GPU accelerated) Multi-arch (linux/amd64, linux/arm64/v8) Data Science dev containers for R, Python, Julia and MAX/Mojo
Dashboard for InferenceX™, Open Source Continuous Inference
From-scratch C++/CUDA inference engine for the NVIDIA RTX 5090 (sm_120a) — the best single-GPU backend for agentic AI: tool calling, long-context loops, reasoni…
Native Windows build of vLLM 0.24.0 - no WSL, no Docker. Python 3.13 + CUDA 12.8 + PyTorch 2.11 cu128 for RTX 30/40/50-series, pre-built wheel, Windows patchset…
Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU planner. Compress KV cach…
Dual-engine (llama.cpp + vLLM) LLM benchmarking pipeline for GGUF & safetensors on NVIDIA GPUs — speed, quality, live dashboard, publishable cards.
