It enables something similar to unified memory. Ive got a 5060 (16GB) card and 96 GB ddr5.
I can run qwen3.5-122b int4 at 25tok/sec.And now even does image ingestion!
Ive been bulk transliterating and translating foreign language books into english. And all completely local.
The difference it does use safetensors, and not gguf's. But it does dynamically requant to int4 8 or bf16.
But go try it out now with a 35B model on your current hardware.
Right now, I have loaded qwen3.6-35B-A3B, 128k context, kv cache 2.5GB, thinking. Int8
Using 11.5GB gfx ram, 42GB system ram.
I dont want to oversell. All GPU would be faster, but creating a semi-unified system is deffo a game changer for me.
It enables something similar to unified memory. Ive got a 5060 (16GB) card and 96 GB ddr5.
I can run qwen3.5-122b int4 at 25tok/sec.And now even does image ingestion!
Ive been bulk transliterating and translating foreign language books into english. And all completely local.