Localllama
50 items across the graph · 49 news stories — tagged with Localllama.
Latest news
Follow-up: DeepSeek V4 Flash on 2x RTX PRO 6000 finishes real coding tasks faster than Sonnet and Opus, at about Sonnet quality
Read full story →More news · 48
Dario's favorite model
found this here
Read full story →Portugal just released their own LLM Amalia (9B)!
I didnt see any mention here. Source:
Read full story →Longcat 2 model weights have been published
https://huggingface.co/meituan-longcat/LongCat-2.0-INT8 https://huggingface.co/meituan-longcat/LongCat-2.0-FP8 submitted by /u/RhubarbSimilar1683
Read full story →Uh.. Honey, how do you feel about takeout?
- 2x RTX Pro 6000 Max-Q (96GB) - 8x RTX 3090 (24GB) - 2x RTX 5090 (32GB)
Read full story →My DeepSeek V4 Pro at home got faster again
You may remember my earlier posts about DeepSeek V4 Pro at home. Today I checked the performance in my llama.cpp branch that contains various fixes and optimizations not yet included in mai
Read full story →Micro-World - Action-controlled Interactive world model - AMD
Read full story →Talking with Gemma 4 31B!
Hi! I'm Andi from Hugging Face. This is a fully open-source and free to test/pul
Read full story →ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving
In
Read full story →Pay attention: a few chats waiting in tray reserve 1GB VRAM for themselves.
If an application uses a Web-based interface and "hardware acceleration", it constructs its frame in VRAM and sometimes keeps it reserved even if the app is minimised. On my Linux machine, Discord is the worst offender, reserving 450 MB VRAM. Steam takes 200 MB, Telegram 150 MB,…
Read full story →Software developers appreciation post
Im on the bus to work and just felt like i dont see enough grattitude for the men, women, children, and people who contribute thier time and effort on open projects. Just last night i saw ive been sleeping while vllm developers are releasing 3 new major releases, and not only tha…
Read full story →Gemma 4 WebGPU Kernels 255 tok/s by x/@xenovacom
We need more of this, 100+ T/s
Read full story →They fit! Mostly.... 2x 3090, Thermaltake Core p3
Got another 3090 had to print a bracket to angle t
Read full story →SenseNova-U1-8b-MoT-Infographic-V2 (released yesterday) - An open source SOTA beast for infographic design and image editing.
Read full story →Making LLMs Better at Creative Writing using Entropy
submitted by
Read full story →ZCode: New Agentic Code Editor from the Makers of GLM
submitted by
Read full story →How to improve RAM offload?
I have only 12GB VRAM (RTX3060) but have enough RAM to run Qwen3.6 27B Q4 with offload. Something tells me that it won't achieve m
Read full story →Deepseek V4 Flash 2, 3 and 4 bits GGUFs
submitted by /u/tarruda
Read full story →Open Models - June 2026
After overwhelming April , OK
Read full story →Why can i never stop the looping?
I constantly see people here saying Qwen3.6 35B is amazing, Ornith V1 is amazing, but i cannot use these models at all without severe looping problems. What the hell am i doing wrong?? Temp 0.6 top_p 0.95 top_k 20 min_p 0.05 rep_penalty 1.1 Using Q6 of both models with K/V at Q8,…
Read full story →README_EN.md · openpangu/openPangu-2.0-Flash at main
1. Introduction
Read full story →HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization (from the Qwen team)
Read full story →Well.. it's a step up from nonstop bot spam I guess
submitted by /u/ForsookC
Read full story →Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
I've been super impressed with Krea-2-Turbo. It can generate high quality images in ~3 seconds. The quality is quite good compared to other local AI image gen models. Now, I don't want to make you watch or click a you tube video, so I'll just give these clear instructions on how…
Read full story →DeepSeek V4, PR merged into llama.cpp !
The PR : https://github.com/ggml-org/llama.cpp/pull/24162 All to git pull, cmake , and download GGUFs ! A vos marques, prêt, partez ! submitted by /u/Squik67 [link]
Read full story →InternScience/Agents-A1 · Hugging Face
Unbelievable benchmarks for a 35B MoE, somebody verify.
Read full story →ascend-tribe/openPangu-2.0-Flash (They haven't uploaded it to Huggingface yet)
https://ai.gitcode.com/ascend-tribe/openPangu-2.0-Flash openPangu-2.0-Flash is an MoE model trained on Ascend. The model has 92B total parameters and 6B activated parameters. Its context length is 512k. The total pretraining data contains 34T tokens. During Post-training, openPan…
Read full story →Bartowski has delivered DS4 GGUF
Looking forward to compare with Antirez's DS4 imamtrix https://huggingface.co/bartowski/DeepSeek-V4-Flash-GGUF submitted by /u/challis88ocarina [link]
Read full story →It’s time, Sam, it’s time.
Mostly /s but, I mean….. I’m no CEO…. but it seems like this would be the absolute perfect time to drop a super powerful GPT-OSS
Read full story →nvidia/Qwen3.6-27B-NVFP4 just dropped
https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4 submitted by /u/vanbukin [link] [c
Read full story →CPU-only GLM 5.2: Epyc and 512GB RAM
This is just a preview of some content I'm putting together to share with you all. I have a server I've p
Read full story →Apparently you can skip entire transformer blocks at load time with minimal performance impact
The benefit is another trick to allow fitting a model that wouldn’t fit in your hardware otherwise. People currently rely on quantization, and this is just another tool that can be used for that purpose (and they can be used together as well) Following recent (very cool) papers,…
Read full story →DeepSeek V4 official version will be launch on mid-July
Read full story →GLM 5.2 Q1_S vs Qwen 27B Q8
TL;DR; GLM-5.2 Q1_S beats Qwen 3.6 27B Q8, both run at KV Q8 edit: GLM run a K & V Q8,
Read full story →Script to monitor llama cpp and analyze memory usage
My goal has always been to be productive with commodity hard
Read full story →NPC Engine Using Local Models
I’ve been working on a game-agnostic NPC engine/backend based prett
Read full story →The number 1 public enemy of open-source.
Dario's args: "Openso
Read full story →How many of you do use Q1 or Q2 of Big models(100-250B)? How's it?
Sharing popular(also recent) models for reference: 151-250B : DeepSeek-V4-Flash Step-3.X-Flash Command-a-plus-05-2026 Laguna-M.1 MiniMax-M2.X Qwen3-235B-A22B 100-150B : GLM-4.5-Air Qwen3.5-122B-A10B NVIDIA-Nemotron-3-Super-120B-A12B Mistral-Small-4-119B-2603 Devstral-2-123B-Instr…
Read full story →We built a calibration-aware Q4_K_M quant of Qwen3.5 0.8B that recovers 96.5% of the BF16 gap vs pure llama.cpp Q4_K_M (SpectralQuant)
Read full story →Mythos was the first, now GPT-5.6
https://techcrunch.com/2026/06/26/openai-limits-gpt-5-6-rollout-after-government-request-says-restrictions-shouldnt-be-the-norm/ Either a hype before IPO, or they have just shot themselves in a foot. This is pretty much it for more advanced online models. Local LLM is one of the…
Read full story →Finally.. my rig is maxed out
Got all the parts before the crazy price increase except for the rtx pro 5k! Was saving up to order rtx pro 6000 in US and i
Read full story →If it doesn't make my PP better, I don't want it
Highlights: 4 x 48GB modded 4090s - 192GB VRAM
Read full story →Running GLM5.2 on budget hardware < $2500.
Too many times I hear people whine about not being ble to run SOTA models or claim it would require $50k, or $100k. https://www.ebay.com/itm/398079051468 Epcy Motherboard & CPU - $460 https://www.ebay.com/itm/206374955959 P40 24gb - $230 get 2 - $460 https://www.ebay.com/itm/3184…
Read full story →Koboldcpp v1.116 released
submitted by /u/Fcking_Chuck
Read full story →[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS
Read full story →Local LLM Peeps
I am 80% done with a harness that works for local and API but is local first. The harness has some interesting logic around multiple agents which I’m holding back on until it is open source on GitHub. I have been local for 6 months and built out EVERYTHING I could think of to mak…
Read full story →audio.cpp: 12 audio models (Qwen3-TTS, PocketTTS, VeVo2 etc) in 1 C++/ggml runtime — TTS up to 5x faster than Python on CUDA
Read full story →US Govt to individually approve who gets GPT 5.6.
submitted by /u/AtlanticHM
Read full story →Why do people keep investing in Intel for AI?
If you get a good deal on some Xeons with a lot of memory bandwidth, or a ch
Read full story →