You may remember my earlier posts about DeepSeek V4 Pro at home. Today I checked the performance in my llama.cpp branch that contains various fixes and optimizations not yet included in mai

Read full story →

NewsReddit r/LocalLLaMALive · 4h ago

Micro-World - Action-controlled Interactive world model - AMD

Read full story →

NewsReddit r/LocalLLaMALive · 8h ago

Talking with Gemma 4 31B!

Hi! I'm Andi from Hugging Face. This is a fully open-source and free to test/pul

Read full story →

NewsReddit r/LocalLLaMALive · 8h ago

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

Read full story →

NewsReddit r/LocalLLaMALive · 8h ago

Pay attention: a few chats waiting in tray reserve 1GB VRAM for themselves.

If an application uses a Web-based interface and "hardware acceleration", it constructs its frame in VRAM and sometimes keeps it reserved even if the app is minimised. On my Linux machine, Discord is the worst offender, reserving 450 MB VRAM. Steam takes 200 MB, Telegram 150 MB,…

Read full story →

NewsReddit r/LocalLLaMALive · 20h ago

Software developers appreciation post

Im on the bus to work and just felt like i dont see enough grattitude for the men, women, children, and people who contribute thier time and effort on open projects. Just last night i saw ive been sleeping while vllm developers are releasing 3 new major releases, and not only tha…

Read full story →

NewsReddit r/LocalLLaMALive · yesterday

Gemma 4 WebGPU Kernels 255 tok/s by x/@xenovacom

We need more of this, 100+ T/s

Read full story →

NewsReddit r/LocalLLaMALive · yesterday

They fit! Mostly.... 2x 3090, Thermaltake Core p3

Got another 3090 had to print a bracket to angle t

Read full story →

NewsReddit r/LocalLLaMALive · yesterday

SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.

Read full story →

NewsReddit r/LocalLLaMALive · yesterday

Making LLMs Better at Creative Writing using Entropy

submitted by

Read full story →

NewsReddit r/LocalLLaMALive · 2d ago

ZCode: New Agentic Code Editor from the Makers of GLM

submitted by

Read full story →

NewsReddit r/LocalLLaMALive · 2d ago

How to improve RAM offload?

I have only 12GB VRAM (RTX3060) but have enough RAM to run Qwen3.6 27B Q4 with offload. Something tells me that it won't achieve m

Read full story →

NewsReddit r/LocalLLaMALive · 2d ago

Deepseek V4 Flash 2, 3 and 4 bits GGUFs

submitted by /u/tarruda

Read full story →

NewsReddit r/LocalLLaMALive · 2d ago

Open Models - June 2026

After overwhelming April , OK

Read full story →

NewsReddit r/LocalLLaMALive · 2d ago

Why can i never stop the looping?

I constantly see people here saying Qwen3.6 35B is amazing, Ornith V1 is amazing, but i cannot use these models at all without severe looping problems. What the hell am i doing wrong?? Temp 0.6 top_p 0.95 top_k 20 min_p 0.05 rep_penalty 1.1 Using Q6 of both models with K/V at Q8,…

Read full story →

NewsReddit r/LocalLLaMALive · 2d ago

README_EN.md · openpangu/openPangu-2.0-Flash at main

1. Introduction

Read full story →

NewsReddit r/LocalLLaMALive · 3d ago

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization (from the Qwen team)

Read full story →

NewsReddit r/LocalLLaMALive · 3d ago

Well.. it's a step up from nonstop bot spam I guess

submitted by /u/ForsookC

Read full story →

NewsReddit r/LocalLLaMALive · 3d ago

Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!

I've been super impressed with Krea-2-Turbo. It can generate high quality images in ~3 seconds. The quality is quite good compared to other local AI image gen models. Now, I don't want to make you watch or click a you tube video, so I'll just give these clear instructions on how…

Read full story →

NewsReddit r/LocalLLaMALive · 3d ago

DeepSeek V4, PR merged into llama.cpp !

The PR : https://github.com/ggml-org/llama.cpp/pull/24162 All to git pull, cmake , and download GGUFs ! A vos marques, prêt, partez ! submitted by /u/Squik67 [link]

Read full story →

NewsReddit r/LocalLLaMALive · 3d ago

InternScience/Agents-A1 · Hugging Face

Unbelievable benchmarks for a 35B MoE, somebody verify.

Read full story →

NewsReddit r/LocalLLaMALive · 3d ago

ascend-tribe/openPangu-2.0-Flash (They haven't uploaded it to Huggingface yet）

https://ai.gitcode.com/ascend-tribe/openPangu-2.0-Flash openPangu-2.0-Flash is an MoE model trained on Ascend. The model has 92B total parameters and 6B activated parameters. Its context length is 512k. The total pretraining data contains 34T tokens. During Post-training, openPan…

Read full story →

NewsReddit r/LocalLLaMALive · 3d ago

Bartowski has delivered DS4 GGUF

Looking forward to compare with Antirez's DS4 imamtrix https://huggingface.co/bartowski/DeepSeek-V4-Flash-GGUF submitted by /u/challis88ocarina [link]

Read full story →

NewsReddit r/LocalLLaMALive · 3d ago

It’s time, Sam, it’s time.

Mostly /s but, I mean….. I’m no CEO…. but it seems like this would be the absolute perfect time to drop a super powerful GPT-OSS

Read full story →

NewsReddit r/LocalLLaMALive · 3d ago

nvidia/Qwen3.6-27B-NVFP4 just dropped

https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4 submitted by /u/vanbukin [link] [c

Read full story →

NewsReddit r/LocalLLaMALive · 4d ago

CPU-only GLM 5.2: Epyc and 512GB RAM

This is just a preview of some content I'm putting together to share with you all. I have a server I've p

Read full story →

NewsReddit r/LocalLLaMALive · 4d ago

Apparently you can skip entire transformer blocks at load time with minimal performance impact

The benefit is another trick to allow fitting a model that wouldn’t fit in your hardware otherwise. People currently rely on quantization, and this is just another tool that can be used for that purpose (and they can be used together as well) Following recent (very cool) papers,…

Read full story →

NewsReddit r/LocalLLaMALive · 4d ago

DeepSeek V4 official version will be launch on mid-July

Read full story →

NewsReddit r/LocalLLaMALive · 4d ago

GLM 5.2 Q1_S vs Qwen 27B Q8

TL;DR; GLM-5.2 Q1_S beats Qwen 3.6 27B Q8, both run at KV Q8 edit: GLM run a K & V Q8,

Read full story →

NewsReddit r/LocalLLaMALive · 4d ago

Script to monitor llama cpp and analyze memory usage

My goal has always been to be productive with commodity hard

Read full story →

NewsReddit r/LocalLLaMALive · 4d ago

NPC Engine Using Local Models

I’ve been working on a game-agnostic NPC engine/backend based prett

Read full story →

NewsReddit r/LocalLLaMALive · 4d ago

The number 1 public enemy of open-source.

Dario's args: "Openso

Read full story →

NewsReddit r/LocalLLaMALive · 5d ago

How many of you do use Q1 or Q2 of Big models(100-250B)? How's it?

Sharing popular(also recent) models for reference: 151-250B : DeepSeek-V4-Flash Step-3.X-Flash Command-a-plus-05-2026 Laguna-M.1 MiniMax-M2.X Qwen3-235B-A22B 100-150B : GLM-4.5-Air Qwen3.5-122B-A10B NVIDIA-Nemotron-3-Super-120B-A12B Mistral-Small-4-119B-2603 Devstral-2-123B-Instr…

Read full story →

NewsReddit r/LocalLLaMALive · 5d ago

We built a calibration-aware Q4_K_M quant of Qwen3.5 0.8B that recovers 96.5% of the BF16 gap vs pure llama.cpp Q4_K_M (SpectralQuant)

Read full story →

NewsReddit r/LocalLLaMALive · 5d ago

Mythos was the first, now GPT-5.6

https://techcrunch.com/2026/06/26/openai-limits-gpt-5-6-rollout-after-government-request-says-restrictions-shouldnt-be-the-norm/ Either a hype before IPO, or they have just shot themselves in a foot. This is pretty much it for more advanced online models. Local LLM is one of the…

Read full story →

NewsReddit r/LocalLLaMALive · 5d ago

Finally.. my rig is maxed out

Got all the parts before the crazy price increase except for the rtx pro 5k! Was saving up to order rtx pro 6000 in US and i

Read full story →

NewsReddit r/LocalLLaMALive · 5d ago

If it doesn't make my PP better, I don't want it

Highlights: 4 x 48GB modded 4090s - 192GB VRAM

Read full story →

NewsReddit r/LocalLLaMALive · 5d ago

Running GLM5.2 on budget hardware < $2500.

Too many times I hear people whine about not being ble to run SOTA models or claim it would require $50k, or $100k. https://www.ebay.com/itm/398079051468 Epcy Motherboard & CPU - $460 https://www.ebay.com/itm/206374955959 P40 24gb - $230 get 2 - $460 https://www.ebay.com/itm/3184…

Read full story →

NewsReddit r/LocalLLaMALive · 5d ago

Koboldcpp v1.116 released

submitted by /u/Fcking_Chuck

Read full story →

NewsReddit r/LocalLLaMALive · 7d ago

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

Read full story →

NewsReddit r/LocalLLaMALive · 7d ago

Local LLM Peeps

I am 80% done with a harness that works for local and API but is local first. The harness has some interesting logic around multiple agents which I’m holding back on until it is open source on GitHub. I have been local for 6 months and built out EVERYTHING I could think of to mak…

Read full story →

NewsReddit r/LocalLLaMALive · 7d ago

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

Read full story →

NewsReddit r/LocalLLaMALive · 7d ago

US Govt to individually approve who gets GPT 5.6.

submitted by /u/AtlanticHM

Read full story →

NewsReddit r/LocalLLaMALive · 7d ago

Why do people keep investing in Intel for AI?

If you get a good deal on some Xeons with a lot of memory bandwidth, or a ch

Read full story →

From the graph · 1

repo

raketenkater/ggrun

Auto-tuned launcher for GGUF models on llama.cpp / ik_llama.cpp — OpenAI-compatible server with multi-GPU tensor-split, MoE expert placement, measured flag tuni…

→

Latest news

Follow-up: DeepSeek V4 Flash on 2x RTX PRO 6000 finishes real coding tasks faster than Sonnet and Opus, at about Sonnet quality

More news · 48

Dario's favorite model

Portugal just released their own LLM Amalia (9B)!

Longcat 2 model weights have been published

Uh.. Honey, how do you feel about takeout?

My DeepSeek V4 Pro at home got faster again

Micro-World - Action-controlled Interactive world model - AMD

Talking with Gemma 4 31B!

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

Pay attention: a few chats waiting in tray reserve 1GB VRAM for themselves.

Software developers appreciation post

Gemma 4 WebGPU Kernels 255 tok/s by x/@xenovacom

They fit! Mostly.... 2x 3090, Thermaltake Core p3

SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.

Making LLMs Better at Creative Writing using Entropy

ZCode: New Agentic Code Editor from the Makers of GLM

How to improve RAM offload?

Deepseek V4 Flash 2, 3 and 4 bits GGUFs

Open Models - June 2026

Why can i never stop the looping?

README_EN.md · openpangu/openPangu-2.0-Flash at main

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization (from the Qwen team)

Well.. it's a step up from nonstop bot spam I guess

Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!

DeepSeek V4, PR merged into llama.cpp !

InternScience/Agents-A1 · Hugging Face

ascend-tribe/openPangu-2.0-Flash (They haven't uploaded it to Huggingface yet）

Bartowski has delivered DS4 GGUF

It’s time, Sam, it’s time.

nvidia/Qwen3.6-27B-NVFP4 just dropped

CPU-only GLM 5.2: Epyc and 512GB RAM

Apparently you can skip entire transformer blocks at load time with minimal performance impact

DeepSeek V4 official version will be launch on mid-July

GLM 5.2 Q1_S vs Qwen 27B Q8

Script to monitor llama cpp and analyze memory usage

NPC Engine Using Local Models

The number 1 public enemy of open-source.

How many of you do use Q1 or Q2 of Big models(100-250B)? How's it?

We built a calibration-aware Q4_K_M quant of Qwen3.5 0.8B that recovers 96.5% of the BF16 gap vs pure llama.cpp Q4_K_M (SpectralQuant)

Mythos was the first, now GPT-5.6

Finally.. my rig is maxed out

If it doesn't make my PP better, I don't want it

Running GLM5.2 on budget hardware < $2500.

Koboldcpp v1.116 released

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

Local LLM Peeps

audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA

US Govt to individually approve who gets GPT 5.6.

Why do people keep investing in Intel for AI?

From the graph · 1

Related topics