Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’d like to note that this model’s parameter usage is low enough (13b) to run smoothly at high quality on a 3090 while beating GPT-3.5 on humaneval and sporting 32k context.

3090s are consumer grade and common on gaming rigs. I’m hoping game devs start experimenting with locally deployed Mixtral in their games. e.g. something like CIV but with each leader powered via LLM



You can also run Mixtral, at a decent token rate, on a post-2020 Apple Macbook Pro M1/M2/M3 with 32GB+ of RAM. 16GB RAM also works, sort of ok, which I suspect is the same quantization a 3090 is using, but I do notice a difference in the quantization. On my M2 Pro, the token rate and intelligence feels like GPT-3.5turbo. This is the first model I've started actually using (vs playing around with for the love of the tech) instead of GPT-3.5.

An Apple M2 Pro with 32GB of RAM is in the same price range as a gaming PC with a 3090, but its another example of normal people with moderately high performance systems "accidentally" being able to run a GPT-3.5 comparable model.

If you have an Apple meeting these specs and want to play around, LLM Studio is open source and has made it really easy to get started: https://lmstudio.ai/

I hope to see a LOT more hobby hacking as a result of Mixtral and successors.


I don't think it's true that LM Studio is open source. Maybe I'm missing something?


Lmstudio (that they linked) is definitely not open source, and doesn't even offer a pricing model for business use.

Llmstudio is, but I suspect that was a typo in their comment. https://github.com/TensorOpsAI/LLMStudio


I have so far run it on my M1 MacBook using llamafile [1] and found it to be great.

Is there any speed/performance/quality/context size/etc. advantage to using LLM Studio or any of the other *llama tools that require more setup than downloading and running a single llamafile executable?

[1] https://github.com/Mozilla-Ocho/llamafile/


How did you get Mixtral to run an a 32gb M1?

I tried using Ollama on my machine (same specs as above) and it told me I needed 49gb RAM minimum.


I'm using:

  ollama run dolphin-mixtral:8x7b-v2.5-q3_K_S


That runs on 32G? The original mixtral q3 wouldn’t run for me. Maybe the dolphin tuned version is smaller?

EDIT: I just checked, it runs great, thanks.


LM Studio, sadly, is not open source.


Google tells me that the RTX 3090 is priced between US$1,480 and $1,680.

You can buy a whole PC for that, I refuse to believe that a GPU priced that highly is "consumer grade" and "common".

Are there any GPUs that are good for LLMs or other genAI that aren't absurdly priced? Or ones specifically designed for AI rather than gaming graphics?



I recently purchased a 3090 on Reddit’s hardwareswap community for $550. New GPUs are pricey right now because of shortages, but if you look around a bit it can be affordable.


Gamers and LLM/AI/ML GPU users do not find that absurdly priced. Absurdly priced in our world is $15,000 so your perceptions are off by about a order of magnitude.


I can assure you a $1,500 graphics card is a big luxury for most gamers.

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

3090 isn't even in the top 30.


Then again, the Apple II cost around $6.5k in today's dollars. [0] My hunch is that people caring less for tricking out their computers for gaming is about people not being all that interested in being able to have the top of the line graphics settings enabled when playing AAA games. But I think the history of PCs and gaming very much proves that even normal consumers are willing to spend the big bucks on technology when it enables something truly new.

[0]: https://en.wikipedia.org/wiki/Apple_II


I would go even further - anytime I look at the hardware survey, I am surprised by the anemic and dated hardware people are running. Most people who game are not building a custom box anymore.


I mean I was, but turns out a well built box lasts a while.


Actually it should be #23 on that list but is split up into two items with roughly 0.6% each. Seems to be a bug.

Search the page for 3090 and see for yourself, it's on the list twice.


To be honest I didn't look that hard


Aren’t there others in that list above the 3090 that are even more expensive?


To be fair, I got a card second hand to play with, and it was only £700 ($900~) and it came with manufacturer warranty. It was a bit of a gamble, but the 24GB VRAM has been a godsend for experimenting with LLMs. And playing video games at 4K!


Nah. It's about half that. You can pick up a used 4090 for that much.


tl;dr, no, especially since AMD is lagging behind.

Apple is the one doing the best in terms of making consumer-friendly hardware that can perform AI/ML tasks...but that involves a different problem regarding video games.


It's worth noting that the 4bit quants can run on cpu at ~reading speed, which should unlock many use cases - especially if we can precompute some of the results async.


That assumes full CPU utilization, with no other tasks heavily using the CPU.

In the case of high-end video games, that's unlikely.


Resource constraints would be a concern, yeah, so if you were developing a game featuring LLMs (which would, at this point in their development and maturity, be a gimmick) you would keep that in mind and keep other demands on resources low.


True, yet many games are effectively single-threaded anyways.

The bigger problem is memory capacity and bandwidth, but I suspect folks will eventually figure out some sort of QoS setup to let the system crunch LLMs using otherwise unused/idle resources.


Although in my testing the 4bit reasoning was not nearly as good.


I've been working with local models as agents, and anyone interested in trying this needs to know about llama.ccp's "gammers" feature. You can force the model's output to conform to a specific structure, which is not only useful for ensuring that you recieve eg. valid JSON output but more specific stuff like "if you choose to do x, you must also provide y", which can be great for influencing it's thinking (eg. an actor who's planning ahead might be required to resond with three of any of the five W's (it's choice which three), but then it get's to be free-form inside of the JSON string values, which can be used as context for a following selection from a restricted set of actions; or a model might have option of asking for more time to think at the end if it's response, but if it doesn't then it needs to specify it's next action). This does't impact generation speed AFAICT and can be used in vry creative ways, but results can still need re-generated if they're truncated and I had to write a function to stop immediately when the valid JSON obj is closed (ie. at the end) or when more that like five newlines are generated it a row. This will vary by model.


The VRAM usage is closer to a 47B model - although only 2 experts are used at a time for inference, all experts are needed to complete it.


Confirmed. Currently running Mixtral 8x7B gguf (Q8_0) on a Macbook Pro M1 Max w 64GB ram, and RAM usage is sitting at 48.8 GB.


How many t/s?


Around 15 - 20 t/s


Thank you, got the same build M1 Max during Christmas B&H sale and can confirm it's amazing for running local LLMs.


The average gamer doesn't have a 3090-equivalent, or even an Nvidia GPU.

Running LLMs locally to create custom dialogue for games is still years away.


>Running LLMs locally to create custom dialogue for games is still years away

I had think about this, you need a small LLM for a game, you do not need it to know about movies, music bands, history, coding and all the text on the internet. I am thinking we need a small model similar to phi-2 , trained only on basic stuff then you would train it on the game world lore. Then the game would also use some "simpler" graphics( we had good looking game graphics decades ago so I do not think you could be limited to text adventures or 2D graphics, just you need some simpler and optimized graphics, it is always interesting when someone shows his unreal demo that uses most RAM/VRAM for a simple demo level then a giant game like GTA5)


This is wrong in many ways. First off you don't necessarily need a high-end consumer GPU, look at llama.cpp.

Or even better look at something like phi-2. It's likely to go even lower. I'm sure there are people here who can detail more specifics.

Its 1/24 and we're already there more or less.


VR isn't pragmatically accessible to the average gamer due to hardware requirements and the necessity of setting up the right physical environment but there are still VR games.


There's a few after almost a decade of vr being a thing.


How did you arrive at "almost a decade"? There's been VR stuff since at least the late 1980s. On the flip side you could say it wasn't "a thing" until just a few years ago. Or that it isn't "a thing" yet.


I think it's pretty obvious that I'm talking about post oculus rift vr and the crop of games that use controllers similar to that device.


Why couldn't this be handled remotely and be handled as part of a subscription to the game?


> or even an Nvidia GPU.

What do you mean? Most gamers do have an nvidia GPU.

Edit: unless you talk about mobile gamers, and not PC gamers?


To wit, the steam report says 74% of its users have Nvidia.

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...


you cannot currently run mixtral with a 32k context on a 3090. Unless am I wrong? I think the largest context I was able to reproduce was around 1500 with 2 or 3 bit, I would have to look at my notes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: