Topic

Cuda

34 items across the graph — tagged with Cuda.

From the graph · 34

repo
vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

repo
sgl-project/sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

repo
isl-org/Open3D

Open3D: A Modern Library for 3D Data Processing

repo
LMCache/LMCache

LMCache: Supercharge Your LLM with the Fastest KV Cache Layer

repo
replicate/cog

Containers for machine learning

repo
catboost/catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks fo…

repo
InternLM/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

repo
rapidsai/cuml

cuML - RAPIDS Machine Learning Library

repo
iree-org/iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

repo
NVIDIA/TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwel…

repo
BBuf/how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

repo
pytorch/TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT

repo
dphnAI/aphrodite-engine

Large-scale LLM inference engine

repo
beam-cloud/beta9

Ultrafast serverless GPU inference, sandboxes, and background jobs

repo
tenstorrent/tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

repo
uccl-project/uccl

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

repo
SemiAnalysisAI/InferenceX

Open Source Continuous Inference Benchmark Research Platform — Kimi K2.7-Code, MiniMax M3, DeepSeekv4, GLM5 - GB200 NVL72 vs MI355X vs B200 vs GB300 NVL72 & soo…

repo
NVIDIA/raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form bui…

repo
jmaczan/tiny-vllm

Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM

repo
brucefan1983/GPUMD

Graphics Processing Units Molecular Dynamics

repo
NVIDIA/cuvs

cuVS - a library for vector search and clustering on the GPU

repo
openinfer-project/openinfer

Pure Rust + CUDA LLM inference engine — no PyTorch, OpenAI-compatible, serves Qwen3 to Kimi-K2

repo
ModelEngine-Group/unified-cache-management

Persist and reuse KV Cache to speedup your LLM.

repo
raketenkater/ggrun

Auto-tuned launcher for GGUF models on llama.cpp / ik_llama.cpp — OpenAI-compatible server with multi-GPU tensor-split, MoE expert placement, measured flag tuni…

repo
kibae/onnxruntime-server

ONNX Runtime Server: The ONNX Runtime Server is a server that provides TCP and HTTP/HTTPS REST APIs for ONNX inference.

repo
traceopt-ai/traceml

A lightweight runtime health check for PyTorch training runs.

repo
gammahazard/locate-anything

Sleek, mobile-friendly web UI for NVIDIA LocateAnything-3B — open-vocabulary object detection & grounding on your own GPU, via one docker compose up.

repo
b-data/jupyterlab-python-docker-stack

(GPU accelerated) Multi-arch (linux/amd64, linux/arm64/v8) JupyterLab Python docker images. Please submit Pull Requests to the GitLab repository. Mirror of

repo
b-data/data-science-devcontainers

(GPU accelerated) Multi-arch (linux/amd64, linux/arm64/v8) Data Science dev containers for R, Python, Julia and MAX/Mojo

repo
SemiAnalysisAI/InferenceX-app

Dashboard for InferenceX™, Open Source Continuous Inference

repo
kekzl/imp

From-scratch C++/CUDA inference engine for the NVIDIA RTX 5090 (sm_120a) — the best single-GPU backend for agentic AI: tool calling, long-context loops, reasoni…

repo
aivrar/vllm-windows-build

Native Windows build of vLLM 0.24.0 - no WSL, no Docker. Python 3.13 + CUDA 12.8 + PyTorch 2.11 cu128 for RTX 30/40/50-series, pre-built wheel, Windows patchset…

repo
aivrar/multi-turboquant

Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU planner. Compress KV cach…

repo
notwitcheer/llm-bench-rig

Dual-engine (llama.cpp + vLLM) LLM benchmarking pipeline for GGUF & safetensors on NVIDIA GPUs — speed, quality, live dashboard, publishable cards.

Related topics