newsReddit r/LocalLLaMATrust 58 · CommunityPublished 4d agoLive · 4d ago
Apparently you can skip entire transformer blocks at load time with minimal performance impact
The benefit is another trick to allow fitting a model that wouldn’t fit in your hardware otherwise. People currently rely on quantization, and this is just another tool that can be used for that purpose (and they can be used together as well) Following recent (very cool) papers, I implemented this as a --skip-layers flag to a llama.cpp fork, so it just never instantiates the blocks you tell it to skip. Bake-time pruning already exists (--prune-layers, mer
