My personal collection of interesting models I've quantized from the past week (yes, just week)

noneabove1182@sh.itjust.works · 7 months ago

My personal collection of interesting models I've quantized from the past week (yes, just week)

noneabove1182@sh.itjust.works · 8 months ago

Interesting, hadn’t heard of it before today, but guess I don’t look at European car brands that often anyways

noneabove1182@sh.itjust.works · 8 months ago

Ah I mean fair enough :) I don’t keep up much with car brands and ownerships, but still TIL haha

noneabove1182@sh.itjust.works · 8 months ago

Huh, didn’t realize Volvo was primarily owned by a Chinese company, you got me there lol, genuinely always thought they were standalone and therefore a Swedish company

noneabove1182@sh.itjust.works · 8 months ago

If you’re using text generation webui there’s a bug where if your max new tokens is equal to your prompt truncation length it will remove all input and therefore just generate nonsense since there’s no prompt

Reduce your max new tokens and your prompt should actually get passed to the backend. This is more noticable in models with only 4k context (since a lot of people default max new tokens to 4k)

noneabove1182@sh.itjust.works · 8 months ago

I don’t understand the title, twitch isn’t mentioned anywhere in the article is it??

noneabove1182@sh.itjust.works · edit-2 8 months ago

Colour me intrigued. I want more manufactures that go against the norm. If they put out a generic slab with normal specs at an expected price, I won’t be very interested, but if they do something cool I’m all for it

Except I just noticed the part where it’s developed by Meizu so nevermind probably will be a generic Chinese phone

noneabove1182@sh.itjust.works · edit-2 8 months ago

itsme2417/PolyMind: A multimodal, function calling powered LLM webui.

noneabove1182@sh.itjust.works · 8 months ago

Introducing Nomic Embed: A Truly Open Embedding Model

noneabove1182@sh.itjust.works · 8 months ago

You shouldn’t need nvlink, I’m wondering if it’s something to do with AWQ since I know that exllamav2 and llama.cpp both support splitting in oobabooga

noneabove1182@sh.itjust.works · 9 months ago

InternLM2 models llama-fied

noneabove1182@sh.itjust.works · 9 months ago

WizardLM/WizardCoder-33B-V1.1 released!

noneabove1182@sh.itjust.works · 9 months ago

I live in Ontario where we go down to -30C in the harshest conditions.

We have a heat pump and a furnace and they alternate based on efficiency

Somewhere around -5 to +5 C it switches from the heat pump to the furnace

I think you could get by a bit colder but it really loses out on efficiency vs burning gas unless you invest in a geothermal heat pump

noneabove1182@sh.itjust.works · 9 months ago

Microsoft announces WaveCoder

noneabove1182@sh.itjust.works · 10 months ago

Mixture of Experts Explained (Huggingface blog)

noneabove1182@sh.itjust.works · 10 months ago

Mistral releases version 0.2 of their 7B model

noneabove1182@sh.itjust.works · 10 months ago

Mistral drops a new magnet download

noneabove1182@sh.itjust.works · 10 months ago

I use text-generation-webui mostly. If you’re only using GGUF files (llama.cpp), koboldcpp is a really good option

A lot of it is the automatic prompt formatting, there’s probably like 5-10 specific formats that are used, and using the right one for your model is very important to achieve optimal output. TheBloke usually lists the prompt format in his model card which is handy

Rope and yarn refer to extending the default context of a model through hacky (but functional) methods and probably deserve their own write up

noneabove1182@sh.itjust.works · 10 months ago

Yeah so those are mixed, definitely not putting each individual weight to 2 bits because as you said that’s very small, i don’t even think it averages out to 2 bits but more like 2.56

You can read some details here on bits per weight: https://huggingface.co/TheBloke/LLaMa-30B-GGML/blob/8c7fb5fb46c53d98ee377f841419f1033a32301d/README.md#explanation-of-the-new-k-quant-methods

Unfortunately this is not the whole story either, as they get further combined with other bits per weight, like q2_k is Q4_K for some of the weights and Q2_K for others, resulting in more like 2.8 bits per weight

Generally speaking you’ll want to use Q4_K_M unless going smaller really benefits you (like you can fit the full thing on GPU)

Also, the bigger the model you have (70B vs 7B) the lower you can go on quantization bits before it degrades to complete garbage

noneabove1182@sh.itjust.works · 10 months ago

If you’re using llama.cpp chances are you’re already using a quantized model, if not then yes you should be. Unfortunately without crazy fast ram you’re basically limited to 7B models if you want any amount of speed (5-10 tokens/s)

noneabove1182@sh.itjust.works · edit-2 10 months ago

Orca 2: Teaching Small Language Models How to Reason

noneabove1182@sh.itjust.works · 10 months ago

Hundreds of OpenAI employees threaten to resign and join Microsoft

noneabove1182@sh.itjust.works · 11 months ago

Catch me if you can! How to beat GPT-4 with a 13B model | LMSYS Org

noneabove1182@sh.itjust.works · 11 months ago

TensorRT-LLM evaluation of the new H200 GPU achieves 11,819 tokens/s on Llama2-13B

noneabove1182@sh.itjust.works · 11 months ago

Wtf? This is a weird take lol

noneabove1182@sh.itjust.works · 11 months ago

Hope it comes out soon that’s some nice QOL updates :)

noneabove1182@sh.itjust.works · edit-2 11 months ago

ExUI - a lightweight web UI for ExLlamaV2 by turboderp

noneabove1182@sh.itjust.works · 11 months ago

Phind V7 subjectively performing at GPT4 levels for coding

noneabove1182@sh.itjust.works · 1 year ago

Very interesting they wouldn’t let him film the camera bump… it must have some kind of branding on it like Hasselblad? Or maybe they’ve secretly found a way to have no bump! One can dream…

noneabove1182@sh.itjust.works · 1 year ago

Inside The OnePlus Open – And The Machines That Torture It [Exclusive] - MrMobile

noneabove1182@sh.itjust.works · 1 year ago

Yeah definitely need to still understand the open source limits, they’re getting pretty dam good at generating code but their comprehension isn’t quite there, I think the ideal is eventually having 2 models, one that determines the problem and what the solution would be, and another that generates the code, so that things like “fix this bug” or more vague questions like “how do I start writing this app” would be more successful

noneabove1182@sh.itjust.works · 1 year ago

I’ve had decent results with continue, it’s similar to copilot and actually works decently with local models lately:

https://github.com/continuedev/continue

noneabove1182@sh.itjust.works · 1 year ago

Beginner questions thread

noneabove1182@sh.itjust.works · 1 year ago

By far the biggest pain point of Sony… their software is clean stable and fast, with acceptable release cadence, but their promise of 2 years is completely unacceptable in this day

Wish there was any way at all to influence them

noneabove1182@sh.itjust.works · 1 year ago

deleted by creator

noneabove1182@sh.itjust.works · 1 year ago

Not a glowing review that this is accidentally not a reply to a comment. :p

noneabove1182@sh.itjust.works · 1 year ago

Yeah I guess I meant more it just doesn’t get nearly as much attention, but you’re right there’s some starting and that’s quite nice

noneabove1182@sh.itjust.works · 1 year ago

Stories and 10 Years of Telegram

noneabove1182@sh.itjust.works · 1 year ago

SimpleSecretsManager: A python library to manage encrypted secrets

noneabove1182@sh.itjust.works · 1 year ago

Best place to host a lemmy wiki

Moderates