(Replying to PARENT post)
M3 Max is actually less than ideal because it peaks at 400 Gb/s for memory. What you really want is M1 or M2 Ultra, which offers up to 800 Gb/s (for comparison, RTX 3090 runs at 936 GB/s). A Mac Studio suitable for running 70B models with speeds fast enough for realtime chat can be had for ~$3K
The downside of Apple's hardware at the moment is that the training ecosystem is very much focused on CUDA; llama.cpp has an open issue about Metal-accelerated training: https://github.com/ggerganov/llama.cpp/issues/3799 - but no work on it so far. This is likely because training at any significant sizes requires enough juice that it's pretty much always better to do it in the cloud currently, where, again, CUDA is the well-established ecosystem, and it's cheaper and easier for datacenter operators to scale. But, in principle, much faster training on Apple hardware should be possible, and eventually someone will get it done.
👤int_19h🕑1y🔼0🗨️0
(Replying to PARENT post)
Microsoft accidentally leaked that ChatGPT-3.5-Turbo is apparently only 20B parameters.
24GB of VRAM is enough to run ~33B parameter models, and enough to run Mixtral (which is a MoE, which makes direct comparisons to “traditional” LLMs a little more confusing.)
I don’t think there’s a clear answer of what hardware someone should get. It depends. Should you give up performance on the models most people run locally in hopes of running very large models, or give up the ability to run very large models in favor of prioritizing performance on the models that are popular and proven today?