I’d like to note that this model’s parameter usage is low enough (13b) to run smoothly at high quality on a 3090 while beating GPT-3.5 on humaneval and sporting 32k context.
3090s are consumer grade and common on gaming rigs. I’m hoping game devs start experimenting with locally deployed Mixtral in their games. e.g. something like CIV but with each leader powered via LLM
You can also run Mixtral, at a decent token rate, on a post-2020 Apple Macbook Pro M1/M2/M3 with 32GB+ of RAM. 16GB RAM also works, sort of ok, which I suspect is the same quantization a 3090 is using, but I do notice a difference in the quantization. On my M2 Pro, the token rate and intelligence feels like GPT-3.5turbo. This is the first model I've started actually using (vs playing around with for the love of the tech) instead of GPT-3.5.
An Apple M2 Pro with 32GB of RAM is in the same price range as a gaming PC with a 3090, but its another example of normal people with moderately high performance systems "accidentally" being able to run a GPT-3.5 comparable model.
If you have an Apple meeting these specs and want to play around, LLM Studio is open source and has made it really easy to get started: https://lmstudio.ai/
I hope to see a LOT more hobby hacking as a result of Mixtral and successors.
I have so far run it on my M1 MacBook using llamafile [1] and found it to be great.
Is there any speed/performance/quality/context size/etc. advantage to using LLM Studio or any of the other *llama tools that require more setup than downloading and running a single llamafile executable?
Google tells me that the RTX 3090 is priced between US$1,480 and $1,680.
You can buy a whole PC for that, I refuse to believe that a GPU priced that highly is "consumer grade" and "common".
Are there any GPUs that are good for LLMs or other genAI that aren't absurdly priced? Or ones specifically designed for AI rather than gaming graphics?
I recently purchased a 3090 on Reddit’s hardwareswap community for $550. New GPUs are pricey right now because of shortages, but if you look around a bit it can be affordable.
Gamers and LLM/AI/ML GPU users do not find that absurdly priced. Absurdly priced in our world is $15,000 so your perceptions are off by about a order of magnitude.
Then again, the Apple II cost around $6.5k in today's dollars. [0] My hunch is that people caring less for tricking out their computers for gaming is about people not being all that interested in being able to have the top of the line graphics settings enabled when playing AAA games. But I think the history of PCs and gaming very much proves that even normal consumers are willing to spend the big bucks on technology when it enables something truly new.
I would go even further - anytime I look at the hardware survey, I am surprised by the anemic and dated hardware people are running. Most people who game are not building a custom box anymore.
To be fair, I got a card second hand to play with, and it was only £700 ($900~) and it came with manufacturer warranty. It was a bit of a gamble, but the 24GB VRAM has been a godsend for experimenting with LLMs. And playing video games at 4K!
tl;dr, no, especially since AMD is lagging behind.
Apple is the one doing the best in terms of making consumer-friendly hardware that can perform AI/ML tasks...but that involves a different problem regarding video games.
It's worth noting that the 4bit quants can run on cpu at ~reading speed, which should unlock many use cases - especially if we can precompute some of the results async.
Resource constraints would be a concern, yeah, so if you were developing a game featuring LLMs (which would, at this point in their development and maturity, be a gimmick) you would keep that in mind and keep other demands on resources low.
True, yet many games are effectively single-threaded anyways.
The bigger problem is memory capacity and bandwidth, but I suspect folks will eventually figure out some sort of QoS setup to let the system crunch LLMs using otherwise unused/idle resources.
I've been working with local models as agents, and anyone interested in trying this needs to know about llama.ccp's "gammers" feature. You can force the model's output to conform to a specific structure, which is not only useful for ensuring that you recieve eg. valid JSON output but more specific stuff like "if you choose to do x, you must also provide y", which can be great for influencing it's thinking (eg. an actor who's planning ahead might be required to resond with three of any of the five W's (it's choice which three), but then it get's to be free-form inside of the JSON string values, which can be used as context for a following selection from a restricted set of actions; or a model might have option of asking for more time to think at the end if it's response, but if it doesn't then it needs to specify it's next action). This does't impact generation speed AFAICT and can be used in vry creative ways, but results can still need re-generated if they're truncated and I had to write a function to stop immediately when the valid JSON obj is closed (ie. at the end) or when more that like five newlines are generated it a row. This will vary by model.
>Running LLMs locally to create custom dialogue for games is still years away
I had think about this, you need a small LLM for a game, you do not need it to know about movies, music bands, history, coding and all the text on the internet. I am thinking we need a small model similar to phi-2 , trained only on basic stuff then you would train it on the game world lore. Then the game would also use some "simpler" graphics( we had good looking game graphics decades ago so I do not think you could be limited to text adventures or 2D graphics, just you need some simpler and optimized graphics, it is always interesting when someone shows his unreal demo that uses most RAM/VRAM for a simple demo level then a giant game like GTA5)
VR isn't pragmatically accessible to the average gamer due to hardware requirements and the necessity of setting up the right physical environment but there are still VR games.
How did you arrive at "almost a decade"? There's been VR stuff since at least the late 1980s. On the flip side you could say it wasn't "a thing" until just a few years ago. Or that it isn't "a thing" yet.
you cannot currently run mixtral with a 32k context on a 3090. Unless am I wrong? I think the largest context I was able to reproduce was around 1500 with 2 or 3 bit, I would have to look at my notes.
3090s are consumer grade and common on gaming rigs. I’m hoping game devs start experimenting with locally deployed Mixtral in their games. e.g. something like CIV but with each leader powered via LLM