I’d like to note that this model’s parameter usage is low enough (13b) to run sm...

snickell · on Jan 9, 2024

You can also run Mixtral, at a decent token rate, on a post-2020 Apple Macbook Pro M1/M2/M3 with 32GB+ of RAM. 16GB RAM also works, sort of ok, which I suspect is the same quantization a 3090 is using, but I do notice a difference in the quantization. On my M2 Pro, the token rate and intelligence feels like GPT-3.5turbo. This is the first model I've started actually using (vs playing around with for the love of the tech) instead of GPT-3.5.

An Apple M2 Pro with 32GB of RAM is in the same price range as a gaming PC with a 3090, but its another example of normal people with moderately high performance systems "accidentally" being able to run a GPT-3.5 comparable model.

If you have an Apple meeting these specs and want to play around, LLM Studio is open source and has made it really easy to get started: https://lmstudio.ai/

I hope to see a LOT more hobby hacking as a result of Mixtral and successors.

cjbprime · on Jan 9, 2024

I don't think it's true that LM Studio is open source. Maybe I'm missing something?

eyegor · on Jan 9, 2024

Lmstudio (that they linked) is definitely not open source, and doesn't even offer a pricing model for business use.

Llmstudio is, but I suspect that was a typo in their comment. https://github.com/TensorOpsAI/LLMStudio

barnabee · on Jan 9, 2024

I have so far run it on my M1 MacBook using llamafile [1] and found it to be great.

Is there any speed/performance/quality/context size/etc. advantage to using LLM Studio or any of the other *llama tools that require more setup than downloading and running a single llamafile executable?

[1] https://github.com/Mozilla-Ocho/llamafile/

nraford · on Jan 9, 2024

How did you get Mixtral to run an a 32gb M1?

I tried using Ollama on my machine (same specs as above) and it told me I needed 49gb RAM minimum.

eurekin · on Jan 9, 2024

I'm using:

  ollama run dolphin-mixtral:8x7b-v2.5-q3_K_S

mark_l_watson · on Jan 9, 2024

That runs on 32G? The original mixtral q3 wouldn’t run for me. Maybe the dolphin tuned version is smaller?

EDIT: I just checked, it runs great, thanks.

Me1000 · on Jan 9, 2024

LM Studio, sadly, is not open source.

LeoPanthera · on Jan 9, 2024

Google tells me that the RTX 3090 is priced between US$1,480 and $1,680.

You can buy a whole PC for that, I refuse to believe that a GPU priced that highly is "consumer grade" and "common".

Are there any GPUs that are good for LLMs or other genAI that aren't absurdly priced? Or ones specifically designed for AI rather than gaming graphics?

PrayagBhakar · on Jan 9, 2024

[A used RTX 3090 goes for around $700.](https://prayag.bhakar.org/apollo-ai-compute-cluster-for-the-...)

rfw300 · on Jan 9, 2024

I recently purchased a 3090 on Reddit’s hardwareswap community for $550. New GPUs are pricey right now because of shortages, but if you look around a bit it can be affordable.

alchemist1e9 · on Jan 9, 2024

Gamers and LLM/AI/ML GPU users do not find that absurdly priced. Absurdly priced in our world is $15,000 so your perceptions are off by about a order of magnitude.

__loam · on Jan 9, 2024

I can assure you a $1,500 graphics card is a big luxury for most gamers.

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

3090 isn't even in the top 30.

helloplanets · on Jan 9, 2024

Then again, the Apple II cost around $6.5k in today's dollars. [0] My hunch is that people caring less for tricking out their computers for gaming is about people not being all that interested in being able to have the top of the line graphics settings enabled when playing AAA games. But I think the history of PCs and gaming very much proves that even normal consumers are willing to spend the big bucks on technology when it enables something truly new.

[0]: https://en.wikipedia.org/wiki/Apple_II

fbdab103 · on Jan 9, 2024

I would go even further - anytime I look at the hardware survey, I am surprised by the anemic and dated hardware people are running. Most people who game are not building a custom box anymore.

contravariant · on Jan 9, 2024

I mean I was, but turns out a well built box lasts a while.

MacsHeadroom · on Jan 9, 2024

Actually it should be #23 on that list but is split up into two items with roughly 0.6% each. Seems to be a bug.

Search the page for 3090 and see for yourself, it's on the list twice.

__loam · on Jan 9, 2024

To be honest I didn't look that hard

ryanwaggoner · on Jan 9, 2024

Aren’t there others in that list above the 3090 that are even more expensive?

Conasg · on Jan 9, 2024

To be fair, I got a card second hand to play with, and it was only £700 ($900~) and it came with manufacturer warranty. It was a bit of a gamble, but the 24GB VRAM has been a godsend for experimenting with LLMs. And playing video games at 4K!

renewiltord · on Jan 9, 2024

Nah. It's about half that. You can pick up a used 4090 for that much.

minimaxir · on Jan 9, 2024

tl;dr, no, especially since AMD is lagging behind.

Apple is the one doing the best in terms of making consumer-friendly hardware that can perform AI/ML tasks...but that involves a different problem regarding video games.

RandomBK · on Jan 9, 2024

It's worth noting that the 4bit quants can run on cpu at ~reading speed, which should unlock many use cases - especially if we can precompute some of the results async.

minimaxir · on Jan 9, 2024

That assumes full CPU utilization, with no other tasks heavily using the CPU.

In the case of high-end video games, that's unlikely.

somnic · on Jan 9, 2024

Resource constraints would be a concern, yeah, so if you were developing a game featuring LLMs (which would, at this point in their development and maturity, be a gimmick) you would keep that in mind and keep other demands on resources low.

RandomBK · on Jan 9, 2024

True, yet many games are effectively single-threaded anyways.

The bigger problem is memory capacity and bandwidth, but I suspect folks will eventually figure out some sort of QoS setup to let the system crunch LLMs using otherwise unused/idle resources.

ilaksh · on Jan 9, 2024

Although in my testing the 4bit reasoning was not nearly as good.

LanternLight83 · on Jan 9, 2024

I've been working with local models as agents, and anyone interested in trying this needs to know about llama.ccp's "gammers" feature. You can force the model's output to conform to a specific structure, which is not only useful for ensuring that you recieve eg. valid JSON output but more specific stuff like "if you choose to do x, you must also provide y", which can be great for influencing it's thinking (eg. an actor who's planning ahead might be required to resond with three of any of the five W's (it's choice which three), but then it get's to be free-form inside of the JSON string values, which can be used as context for a following selection from a restricted set of actions; or a model might have option of asking for more time to think at the end if it's response, but if it doesn't then it needs to specify it's next action). This does't impact generation speed AFAICT and can be used in vry creative ways, but results can still need re-generated if they're truncated and I had to write a function to stop immediately when the valid JSON obj is closed (ie. at the end) or when more that like five newlines are generated it a row. This will vary by model.

sanjiwatsuki · on Jan 9, 2024

The VRAM usage is closer to a 47B model - although only 2 experts are used at a time for inference, all experts are needed to complete it.

discordance · on Jan 9, 2024

Confirmed. Currently running Mixtral 8x7B gguf (Q8_0) on a Macbook Pro M1 Max w 64GB ram, and RAM usage is sitting at 48.8 GB.

karolist · on Jan 9, 2024

How many t/s?

discordance · on Jan 12, 2024

Around 15 - 20 t/s

karolist · on Jan 22, 2024

Thank you, got the same build M1 Max during Christmas B&H sale and can confirm it's amazing for running local LLMs.

minimaxir · on Jan 9, 2024

The average gamer doesn't have a 3090-equivalent, or even an Nvidia GPU.

Running LLMs locally to create custom dialogue for games is still years away.

simion314 · on Jan 9, 2024

>Running LLMs locally to create custom dialogue for games is still years away

I had think about this, you need a small LLM for a game, you do not need it to know about movies, music bands, history, coding and all the text on the internet. I am thinking we need a small model similar to phi-2 , trained only on basic stuff then you would train it on the game world lore. Then the game would also use some "simpler" graphics( we had good looking game graphics decades ago so I do not think you could be limited to text adventures or 2D graphics, just you need some simpler and optimized graphics, it is always interesting when someone shows his unreal demo that uses most RAM/VRAM for a simple demo level then a giant game like GTA5)

kytazo · on Jan 9, 2024

This is wrong in many ways. First off you don't necessarily need a high-end consumer GPU, look at llama.cpp.

Or even better look at something like phi-2. It's likely to go even lower. I'm sure there are people here who can detail more specifics.

Its 1/24 and we're already there more or less.

somnic · on Jan 9, 2024

VR isn't pragmatically accessible to the average gamer due to hardware requirements and the necessity of setting up the right physical environment but there are still VR games.

__loam · on Jan 9, 2024

There's a few after almost a decade of vr being a thing.

Cyphase · on Jan 9, 2024

How did you arrive at "almost a decade"? There's been VR stuff since at least the late 1980s. On the flip side you could say it wasn't "a thing" until just a few years ago. Or that it isn't "a thing" yet.

__loam · on Jan 9, 2024

I think it's pretty obvious that I'm talking about post oculus rift vr and the crop of games that use controllers similar to that device.

michaelmrose · on Jan 9, 2024

Why couldn't this be handled remotely and be handled as part of a subscription to the game?

Mashimo · on Jan 9, 2024

> or even an Nvidia GPU.

What do you mean? Most gamers do have an nvidia GPU.

Edit: unless you talk about mobile gamers, and not PC gamers?

fragmede · on Jan 9, 2024

To wit, the steam report says 74% of its users have Nvidia.

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

fswd · on Jan 9, 2024

you cannot currently run mixtral with a 32k context on a 3090. Unless am I wrong? I think the largest context I was able to reproduce was around 1500 with 2 or 3 bit, I would have to look at my notes.