Topic cluster · 8 items

efficiency

model

Nano-Refuse-0.4B

A tiny safety classifier for fast content filtering.

paper

Speculative decoding with draft models

Accelerating generation by drafting tokens with a small model.

news

Breakthrough in long-context efficiency announced

A new attention scheme cuts memory use for very long inputs.

repo

quant-kit

Post-training quantization tools for transformers.

glossary_term

Quantization

Shrinking a model by storing its weights at lower precision.

paper

Quantization at 1.58 bits

Ternary-weight models that retain most of full-precision quality.

model

Whisper-Lite

A compact speech-to-text model for on-device use.

tool

QuantBench

A one-click quantization and benchmarking tool.

Related topics