Read original ↗
newsReddit r/LocalLLaMATrust 58 · CommunityPublished 4d agoLive · 4d ago

Apparently you can skip entire transformer blocks at load time with minimal performance impact

The benefit is another trick to allow fitting a model that wouldn’t fit in your hardware otherwise. People currently rely on quantization, and this is just another tool that can be used for that purpose (and they can be used together as well) Following recent (very cool) papers, I implemented this as a --skip-layers flag to a llama.cpp fork, so it just never instantiates the blocks you tell it to skip. Bake-time pruning already exists (--prune-layers, mer

Covers (incoming)

Related across the graph