proxy70

	[HN Gopher] Guide to running Llama 2 locally ___________________________________________________________________ Guide to running Llama 2 locally Author : bfirsh Score : 177 points Date : 2023-07-25 16:58 UTC (6 hours ago)
	web link (replicate.com)
	w3m dump (replicate.com)
	\| guy98238710 wrote: \| > curl -L "https://replicate.fyi/install-llama-cpp" \| bash \| \| Seriously? Pipe script from someone's website directly to bash? \| gattilorenz wrote: \| Yes. If you are worried, you can redirect it to file and then \| sh it. It doesn't get much easier to inspect than that... \| cjbprime wrote: \| Either you trust the TLS session to their website to deliver \| you software you're going to run, or you don't. \| madars wrote: \| That's the recommended way to get Rust nightly too: \| https://rustup.rs/ But don't look there, there is memory safety \| somewhere! \| raccolta wrote: \| oh, this again. \| handelaar wrote: \| Idiot question: if I have access to sentence-by-sentence \| professionally-translated text of foreign-language-to-English in \| gigantic quantities, and I fed the originals as prompts and the \| translations as completions... \| \| ... would I be likely to get anything useful if I then fed it new \| prompts in a similar style? Or would it just generate gibberish? \| seanthemon wrote: \| Indeed, it sounds like you have what's called fine tuned data \| (given an input, here's the output), there's loads of info both \| here on HN about fine tuning and on youtube's huggingface \| channels \| \| Note if you have sufficient data, look into existing models on \| huggingface, you may find a smaller, faster and more open \| (licencing-wise) model that you can fine tune to get the \| results you want - Llama is hot, but not a catch-all for all \| tasks (as no model should be) \| \| Happy inferring! \| nl wrote: \| If you have that much data you can build your own model that \| can be much smaller and faster. \| \| A simple version is a beginner tutorial: \| https://pytorch.org/tutorials/beginner/translation_transform... \| maxlin wrote: \| I might be missing something. The article asks me to run a bash \| script on windows. \| \| I assume this would still need to be run manually to access GPU \| resources etc, so can someone illuminate what is actually \| expected for a windows user to make this run? \| \| I'm currently paying 15$ a month in a personal \| translation/summarizer project's ChatGPT queries. I run whisper \| (const.me's GPU fork) locally and would love to get the LLM part \| local eventually too! The system generates 30k queries a month \| but is not super-affected by delay so lower token rates might \| work too. \| nomel wrote: \| Windows has supported linux tools for some time now, using WSL: \| https://learn.microsoft.com/en-us/windows/wsl/about \| \| No idea if it will work, in this case, but it does with \| llama.cpp: https://github.com/ggerganov/llama.cpp/issues/103 \| maxlin wrote: \| I know (should have included in my earlier response but \| editing would've felt weird) but I still assume one should \| run the result natively, so am asking if/where there's some \| jumping around required. \| \| Last time I tried running an LLM I tried wsl&native both on 2 \| machines and just got lovecraftian-tier errors so waiting if \| I'm missing something obvious before going down that route \| again \| nomand wrote: \| Is it possible for such local install to retain conversation \| history so if for example you're working on a project and use it \| as your assistance across many days that you can continue \| conversations and for the model to keep track of what you and it \| already know? \| simonw wrote: \| My LLM command line tool can do that - it logs everything to a \| SQLite database and has an option to continue a conversation: \| https://llm.datasette.io \| knodi123 wrote: \| llama is just an input/output engine. It takes a big string as \| input, and gives a big string of output. \| \| Save your outputs if you want, you can copy/paste them into any \| editor. Or make a shell script that mirrors outputs to a file \| and use _that_ as your main interface. It 's up to the user. \| jmiskovic wrote: \| There is no fully built solution, only bits and pieces. I \| noticed that llama outputs tend to degrade with amount of text, \| the text becomes too repetitive and focused, and you have to \| raise the temperature to break the model out of loops. \| nomand wrote: \| Does what you're saying mean you can only ask questions and \| get answers in a single step, and that having a long \| discussion where refinement of output is arrived at through \| conversation isn't possible? \| krisoft wrote: \| My understanding is that at a high level you can look at \| this model as a black box which accepts a string and \| outputs a string. \| \| If you want it to "remember" things you do that by \| appending all the previous conversations together and \| supply it in the input string. \| \| In an ideal world this would work perfectly. It would read \| through the whole conversation and would provide the right \| output you expect, exactly as if it would "remember" the \| conversation. In reality there are all kind of issues which \| can crop up as the input grows longer and longer. One is \| that it takes more and more processing power and time for \| it to "read through" everything previously said. And there \| are things like what jmiskovic said that the output quality \| can also degrade in perhaps unexpected ways. \| \| But that also doesn't mean that " refinement of output is \| arrived at through conversation isn't possible". It is not \| that black and white, just that you can run into troubles \| as the length of the discussion grows. \| \| I don't have direct experience with long conversations so I \| can't tell you how long is definietly too long, and how \| long is still safe. Plus probably there are some tricks one \| can do to work around these. Probably there are things one \| can do if one unpacks that "black box" understanding of the \| process. But even without that you could imagine a \| "consolidation" process where the AI is instructed to write \| short notes about a given length of conversation and then \| those shorter notes would be copied in to the next input \| instead of the full previous conversation. All of these are \| possible, but you won't have a turn-key solution for it \| just yet. \| cjbprime wrote: \| The limit here is the "context window" length of the \| model, measured in tokens, which will quickly become too \| short to contain all of your previous conversations, \| which will mean it has to answer questions without access \| to all of that text. And within a single conversation, it \| will mean that it starts forgetting the text from the \| start of the conversation, once the [conversation + new \| prompt] reaches the context length. \| \| The kind of hacks that work around this are to train the \| model on the past conversations, and then rely on \| similarity in tensor space to pull the right (lossy) data \| back out of the model (or a separate database) later, \| based on its similarity to your question, and include it \| (or a summary of it, since summaries are smaller) within \| the context window for your new conversation, combined \| with your prompt. This is what people are talking about \| when they use the term "embeddings". \| nomand wrote: \| My benchmark is having a peer programming session \| spanning days and dozens of queries with ChatGPT where we \| co-created a custom static site generator that works \| really well for my requirements. It was able to hold \| context for a while and not "forget" what code it \| provided me dozens of messages earlier, it was able to \| "remember" corrections and refactors that I gave it and \| overall was incredibly useful for working out things like \| recurrence for folder hierarchies and building data \| trees. This kind and similar use-cases where memory is \| important, when the model is used as a genuine assistant. \| krisoft wrote: \| Excelent! That sounds like a very usefull personal \| benchmark then. You could test llama v2 by copying in \| different lengths of snippets from that conversation and \| checking how usefull you find its outputs. \| RicoElectrico wrote: \| curl -L "https://replicate.fyi/windows-install-llama-cpp" \| \| ... returns 404 Not Found \| thisisit wrote: \| The easiest way I found was to use GPT4All. Just download and \| install, grab GGML version of Llama 2, copy to the models \| directory in the installation folder. Fire up GPT4All and run. \| andreyk wrote: \| This covers three things: Llama.cpp (Mac/Windows/Linux), Ollama \| (Mac), MLC LLM (iOS/Android) \| \| Which is not really comprehensive... If you have a linux machine \| with GPUs, i'd just use hugging face inference \| (https://github.com/huggingface/text-generation-inference). And I \| am sure there are other things that could be covered. \| krisoft wrote: \| > If you have a linux machine with GPUs \| \| How much VRAM one needs to run inference with llama 2 on a GPU \| approximately? \| lolinder wrote: \| Depends on which model. I haven't bothered doing it on my 8GB \| because the only model that would fit is the 7B model \| quantized to 4 bits, and that model at that size is pretty \| bad for most things. I think you could have fun with 13B with \| 12GB VRAM. The full size model would require >35GB even \| quantized. \| novaRom wrote: \| 16Gb is minimum to run 7B model with float16 weights; out of \| the box, with no further efforts. \| Patrick_Devine wrote: \| Ollama works with Windows and Linux as well too, but doesn't \| (yet) have GPU support for those platforms. You have to compile \| it yourself (it's a simple `go build .`), but should work fine \| (albeit slow). The benefit is you can still pull the llama2 \| model really easily (with `ollama pull llama2`) and even use it \| with other runners. \| \| DISCLAIMER: I'm one of the developers behind Ollama. \| mschuster91 wrote: \| > DISCLAIMER: I'm one of the developers behind Ollama. \| \| I got a feature suggestion - would it be possible to have the \| ollama CLI automatically start up the GUI/daemon if it's not \| running? There's only so much stuff one can keep in a Macbook \| Air's auto start. \| jmorgan wrote: \| Good suggestion! This is definitely on the radar, so that \| running `ollama` will start the server when it's needed \| (instead of erroring!): \| https://github.com/jmorganca/ollama/issues/47 \| DennisP wrote: \| I've been wondering, is the M2's neural engine usable for \| this? \| robotnikman wrote: \| Llama.cpp has been fun to experiment around with. I was \| suprised with how easy it was to set up, much easier than when \| I tried to set up a local llm almost a year ago. \| lolinder wrote: \| Just a note that you have to have at least 12GB VRAM for it to \| be worth even trying to use your GPU for LLaMA 2. \| \| The 7B model quantized to 4 bits can fit in 8GB VRAM with room \| for the context, but is pretty useless for getting good results \| in my experience. 13B is better but still not anything near as \| good as the 70B, which would require >35GB VRAM to use at 4 bit \| quantization. \| \| My solution for playing with this was just to upgrade my PC's \| RAM to 64GB. It's slower than the GPU, but it was way cheaper \| and I can run the 70B model easily. \| dc443 wrote: \| I have 2x 3090 do you know if it's feasible to use that 48GB \| total for running this? \| eurekin wrote: \| Yes, it runs totally fine. I ran it in Oobabooga/text \| generation web ui. Nice thing about it is that it \| autodownloads all necessary gpu binaries on it's own and \| creates a isolated conda env. I asked same questions on the \| official 70b demo and got same answers. I even got better \| answers with ooba, since the demo cuts text early \| \| Ooobabooga: https://github.com/oobabooga/text-generation- \| webui \| \| Model: TheBloke_Llama-2-70B-chat-GPTQ from \| https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ \| \| ExLlama_HF loader gpu split 20,22, context size 2048 \| \| on the Chat Settings tab, choose Instruction template tab \| and pick Llama-v2 from the instruction template dropdown \| \| Demo: https://huggingface.co/blog/llama2#demo \| zakki wrote: \| Is there any specific settings to make 2x3090 work \| together? \| NoMoreNicksLeft wrote: \| Trying to figure out what hardware to convince my boss to \| spend on... if we were to get one of the A6000/48gb cards, \| will that see significant performance improvements over just \| a 4090/24gb? The primary limitation is vram, is it not? \| cjbprime wrote: \| You might consider getting a Mac Studio (with as much RAM \| as you can afford up to 192GB) instead, since 192GB is more \| (unified) memory than you're going to easily get to with \| GPUs. \| lolinder wrote: \| VRAM is what gets you up to the larger model sizes, and \| 24GB isn't enough to load the full 70B even at 4 bits, you \| need at least 35 and some extra for the context. So it \| depends a lot on what you want to do--fine tuning will take \| even more as I understand it. \| \| The card's speed will affect your performance, but I don't \| know enough about different graphics cards to tell you \| specifics. \| ErneX wrote: \| Apple Silicon Macs might not have great GPUs but they do have \| unified memory. I need to try this on mine I have 96GB of RAM \| on my M2 Max. \| krychu wrote: \| Self-plug. Here's a fork of the original llama 2 code adapted to \| run on the CPU or MPS (M1/M2 GPU) if available: \| \| https://github.com/krychu/llama \| \| It runs with the original weights, and gets you to ~4 tokens/sec \| on MacBook Pro M1 with the 7B model. \| rootusrootus wrote: \| For most people who just want to play around and are using MacOS \| or Windows, I'd just recommend lmstudio.ai. Nice interface, with \| super easy searching and downloading of new models. \| dividedbyzero wrote: \| Does it make any sense to try this on a lower-end Mac (like a \| M2 Air)? \| mchiang wrote: \| Yeah! How much memory do you have? \| \| If by lower-end Macbook air, you mean with 8GB of memory, try \| the smaller models (Such as Orca Mini 3B). You can do this \| via LM Studio, Oogabooga/text-generation-webui, KoboldCPP, \| GPT4all, ctransformers, and more. \| \| I'm biased since I work on Ollama, and if you want to try it \| out: \| \| 1. Download https://ollama.ai/download \| \| 2. `ollama run orca` \| \| 3. Enter your input to prompt \| \| Note Ollama is open source, and you can compile it too from \| https://github.com/jmorganca/ollama \| bdavbdav wrote: \| I'm deliberating on how much RAM to get on my new MBP. Is \| 32gb going to stand me in good stead? \| mchiang wrote: \| Local memory management will definitely get better in the \| future. \| \| For now: \| \| You should have at least 8 GB of RAM to run the 3B \| models, 16 GB to run the 7B models, and 32 GB to run the \| 13B models. \| \| My personal recommendation is to get as much memory as \| you can if you want to work with local models [including \| VRAM if you are planning to be executing on GPU] \| rootusrootus wrote: \| 32GB should be fine. I went a little overboard and got a \| new MBP with M2 MAX and 96GB, but the hardware is really \| best suited at this point to a 30B model. I can and do \| play around with 65B models, but at that point you're \| making a fairly big tradeoff in generation speed for an \| incremental increase in quality. \| \| As a datapoint, I have a 30B model [0] loaded right now \| and it's using 23.44GB of RAM. Getting around 9 \| tokens/sec, which is very usable. I also have the 65B \| version of the same model [1] and it's good for around \| 3.6 tokens/second, but it uses 44GB of RAM. Not unusably \| slow, but more often than not I opt for the 30B because \| it's good enough and a lot faster. \| \| Haven't tried the llama2 70B yet. \| \| [0] https://huggingface.co/TheBloke/upstage- \| llama-30b-instruct-2... [1] \| https://huggingface.co/TheBloke/Upstage- \| Llama1-65B-Instruct-... \| swader999 wrote: \| What's your use case for local if you don't mind? \| dividedbyzero wrote: \| By lower-end I meant that the Airs are quite low-end in \| general (compared to Pro/Studio). I have the maxed-out \| 24gb, but 16gb may be more common among people who might \| use an Air for this kind of thing. \| Der_Einzige wrote: \| The correct answer, as always, is the oogabooga text generation \| webUI, which supports all of the relevant backends: \| https://github.com/oobabooga/text-generation-webui \| cypress66 wrote: \| Yep. Use ooba. And people who like to RP often use ooba as a \| backend, and sillitavern as a frontend. \| Roark66 wrote: \| Can it run onnx transformer models? I found optimised onnx \| models are at least twice the speed of vanilla pytorch on the \| CPU. \| TheAceOfHearts wrote: \| How do you decide what model variant to use? There's a bunch of \| Quant method variations of Llama-2-13B-chat-GGML [0], how do you \| know which one to use? Reading the "Explanation of the new \| k-quant methods" is a bit opaque. \| \| [0] https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML \| sva_ wrote: \| If you just want to do inference/mess around with the model and \| have a 16GB GPU, then this[0] is enough to paste into a notebook. \| You need to have access to the HF models though. \| \| 0. \| https://github.com/huggingface/blog/blob/main/llama2.md#usin... \| oaththrowaway wrote: \| Off topic: is there a way to use one of the LLMs and have it \| ingest data from a SQLite database and ask it questions about it? \| politelemon wrote: \| Have a look at this too, it's just an integration which \| langchain can be good at : https://walkingtree.tech/natural- \| language-to-query-your-sql-... \| seanthemon wrote: \| You can, but as a crazy idea you can also ask chatgpt to write \| select queries using the functions parameter they added \| recently - you can also ask it to write jsonpath. \| \| As long as it understands the schema and general idea of data, \| it does a fairly good job. Just be careful to do too much with \| one prompt, you can easily cause hallucinations \| simonw wrote: \| I've experimented with that a bit. \| \| Currently the absolutely best way to do that is to upload a \| SQLite database file to ChatGPT Code Interpreter. \| \| I'm hoping that someone will fine-tune an openly licensed model \| for this at some point that can give results as good as Code \| Interpreter does. \| siquick wrote: \| You can migrate that data to a vector database (eg Pinecone or \| pgVector) and then query it. I didn't write it but this guide \| has a good overview of concepts and some code. In your case \| your just replace the web crawler with database queries. All \| the libraries used also exist in Python. \| \| https://www.pinecone.io/learn/javascript-chatbot/ \| thisisit wrote: \| You can but what you'll end up trading precise answers while \| querying to a chance of hallucinations. \| politelemon wrote: \| Llama.cpp can run on Android too. \| synaesthesisx wrote: \| This is usable, but hopefully folks manage to tweak it a bit \| further for even higher tokens/s. I'm running Llama.cpp locally \| on my M2 Max (32 GB) with decent performance but sticking to the \| 7B model for now. ___________________________________________________________________ (page generated 2023-07-25 23:00 UTC)