[HN Gopher] Guide to running Llama 2 locally
___________________________________________________________________
 
Guide to running Llama 2 locally
 
Author : bfirsh
Score  : 177 points
Date   : 2023-07-25 16:58 UTC (6 hours ago)
 
web link (replicate.com)
w3m dump (replicate.com)
 
| guy98238710 wrote:
| > curl -L "https://replicate.fyi/install-llama-cpp" | bash
| 
| Seriously? Pipe script from someone's website directly to bash?
 
  | gattilorenz wrote:
  | Yes. If you are worried, you can redirect it to file and then
  | sh it. It doesn't get much easier to inspect than that...
 
  | cjbprime wrote:
  | Either you trust the TLS session to their website to deliver
  | you software you're going to run, or you don't.
 
  | madars wrote:
  | That's the recommended way to get Rust nightly too:
  | https://rustup.rs/ But don't look there, there is memory safety
  | somewhere!
 
    | raccolta wrote:
    | oh, this again.
 
| handelaar wrote:
| Idiot question: if I have access to sentence-by-sentence
| professionally-translated text of foreign-language-to-English in
| gigantic quantities, and I fed the originals as prompts and the
| translations as completions...
| 
| ... would I be likely to get anything useful if I then fed it new
| prompts in a similar style? Or would it just generate gibberish?
 
  | seanthemon wrote:
  | Indeed, it sounds like you have what's called fine tuned data
  | (given an input, here's the output), there's loads of info both
  | here on HN about fine tuning and on youtube's huggingface
  | channels
  | 
  | Note if you have sufficient data, look into existing models on
  | huggingface, you may find a smaller, faster and more open
  | (licencing-wise) model that you can fine tune to get the
  | results you want - Llama is hot, but not a catch-all for all
  | tasks (as no model should be)
  | 
  | Happy inferring!
 
  | nl wrote:
  | If you have that much data you can build your own model that
  | can be much smaller and faster.
  | 
  | A simple version is a beginner tutorial:
  | https://pytorch.org/tutorials/beginner/translation_transform...
 
| maxlin wrote:
| I might be missing something. The article asks me to run a bash
| script on windows.
| 
| I assume this would still need to be run manually to access GPU
| resources etc, so can someone illuminate what is actually
| expected for a windows user to make this run?
| 
| I'm currently paying 15$ a month in a personal
| translation/summarizer project's ChatGPT queries. I run whisper
| (const.me's GPU fork) locally and would love to get the LLM part
| local eventually too! The system generates 30k queries a month
| but is not super-affected by delay so lower token rates might
| work too.
 
  | nomel wrote:
  | Windows has supported linux tools for some time now, using WSL:
  | https://learn.microsoft.com/en-us/windows/wsl/about
  | 
  | No idea if it will work, in this case, but it does with
  | llama.cpp: https://github.com/ggerganov/llama.cpp/issues/103
 
    | maxlin wrote:
    | I know (should have included in my earlier response but
    | editing would've felt weird) but I still assume one should
    | run the result natively, so am asking if/where there's some
    | jumping around required.
    | 
    | Last time I tried running an LLM I tried wsl&native both on 2
    | machines and just got lovecraftian-tier errors so waiting if
    | I'm missing something obvious before going down that route
    | again
 
| nomand wrote:
| Is it possible for such local install to retain conversation
| history so if for example you're working on a project and use it
| as your assistance across many days that you can continue
| conversations and for the model to keep track of what you and it
| already know?
 
  | simonw wrote:
  | My LLM command line tool can do that - it logs everything to a
  | SQLite database and has an option to continue a conversation:
  | https://llm.datasette.io
 
  | knodi123 wrote:
  | llama is just an input/output engine. It takes a big string as
  | input, and gives a big string of output.
  | 
  | Save your outputs if you want, you can copy/paste them into any
  | editor. Or make a shell script that mirrors outputs to a file
  | and use _that_ as your main interface. It 's up to the user.
 
  | jmiskovic wrote:
  | There is no fully built solution, only bits and pieces. I
  | noticed that llama outputs tend to degrade with amount of text,
  | the text becomes too repetitive and focused, and you have to
  | raise the temperature to break the model out of loops.
 
    | nomand wrote:
    | Does what you're saying mean you can only ask questions and
    | get answers in a single step, and that having a long
    | discussion where refinement of output is arrived at through
    | conversation isn't possible?
 
      | krisoft wrote:
      | My understanding is that at a high level you can look at
      | this model as a black box which accepts a string and
      | outputs a string.
      | 
      | If you want it to "remember" things you do that by
      | appending all the previous conversations together and
      | supply it in the input string.
      | 
      | In an ideal world this would work perfectly. It would read
      | through the whole conversation and would provide the right
      | output you expect, exactly as if it would "remember" the
      | conversation. In reality there are all kind of issues which
      | can crop up as the input grows longer and longer. One is
      | that it takes more and more processing power and time for
      | it to "read through" everything previously said. And there
      | are things like what jmiskovic said that the output quality
      | can also degrade in perhaps unexpected ways.
      | 
      | But that also doesn't mean that " refinement of output is
      | arrived at through conversation isn't possible". It is not
      | that black and white, just that you can run into troubles
      | as the length of the discussion grows.
      | 
      | I don't have direct experience with long conversations so I
      | can't tell you how long is definietly too long, and how
      | long is still safe. Plus probably there are some tricks one
      | can do to work around these. Probably there are things one
      | can do if one unpacks that "black box" understanding of the
      | process. But even without that you could imagine a
      | "consolidation" process where the AI is instructed to write
      | short notes about a given length of conversation and then
      | those shorter notes would be copied in to the next input
      | instead of the full previous conversation. All of these are
      | possible, but you won't have a turn-key solution for it
      | just yet.
 
        | cjbprime wrote:
        | The limit here is the "context window" length of the
        | model, measured in tokens, which will quickly become too
        | short to contain all of your previous conversations,
        | which will mean it has to answer questions without access
        | to all of that text. And within a single conversation, it
        | will mean that it starts forgetting the text from the
        | start of the conversation, once the [conversation + new
        | prompt] reaches the context length.
        | 
        | The kind of hacks that work around this are to train the
        | model on the past conversations, and then rely on
        | similarity in tensor space to pull the right (lossy) data
        | back out of the model (or a separate database) later,
        | based on its similarity to your question, and include it
        | (or a summary of it, since summaries are smaller) within
        | the context window for your new conversation, combined
        | with your prompt. This is what people are talking about
        | when they use the term "embeddings".
 
        | nomand wrote:
        | My benchmark is having a peer programming session
        | spanning days and dozens of queries with ChatGPT where we
        | co-created a custom static site generator that works
        | really well for my requirements. It was able to hold
        | context for a while and not "forget" what code it
        | provided me dozens of messages earlier, it was able to
        | "remember" corrections and refactors that I gave it and
        | overall was incredibly useful for working out things like
        | recurrence for folder hierarchies and building data
        | trees. This kind and similar use-cases where memory is
        | important, when the model is used as a genuine assistant.
 
        | krisoft wrote:
        | Excelent! That sounds like a very usefull personal
        | benchmark then. You could test llama v2 by copying in
        | different lengths of snippets from that conversation and
        | checking how usefull you find its outputs.
 
| RicoElectrico wrote:
| curl -L "https://replicate.fyi/windows-install-llama-cpp"
| 
| ... returns 404 Not Found
 
| thisisit wrote:
| The easiest way I found was to use GPT4All. Just download and
| install, grab GGML version of Llama 2, copy to the models
| directory in the installation folder. Fire up GPT4All and run.
 
| andreyk wrote:
| This covers three things: Llama.cpp (Mac/Windows/Linux), Ollama
| (Mac), MLC LLM (iOS/Android)
| 
| Which is not really comprehensive... If you have a linux machine
| with GPUs, i'd just use hugging face inference
| (https://github.com/huggingface/text-generation-inference). And I
| am sure there are other things that could be covered.
 
  | krisoft wrote:
  | > If you have a linux machine with GPUs
  | 
  | How much VRAM one needs to run inference with llama 2 on a GPU
  | approximately?
 
    | lolinder wrote:
    | Depends on which model. I haven't bothered doing it on my 8GB
    | because the only model that would fit is the 7B model
    | quantized to 4 bits, and that model at that size is pretty
    | bad for most things. I think you could have fun with 13B with
    | 12GB VRAM. The full size model would require >35GB even
    | quantized.
 
    | novaRom wrote:
    | 16Gb is minimum to run 7B model with float16 weights; out of
    | the box, with no further efforts.
 
  | Patrick_Devine wrote:
  | Ollama works with Windows and Linux as well too, but doesn't
  | (yet) have GPU support for those platforms. You have to compile
  | it yourself (it's a simple `go build .`), but should work fine
  | (albeit slow). The benefit is you can still pull the llama2
  | model really easily (with `ollama pull llama2`) and even use it
  | with other runners.
  | 
  | DISCLAIMER: I'm one of the developers behind Ollama.
 
    | mschuster91 wrote:
    | > DISCLAIMER: I'm one of the developers behind Ollama.
    | 
    | I got a feature suggestion - would it be possible to have the
    | ollama CLI automatically start up the GUI/daemon if it's not
    | running? There's only so much stuff one can keep in a Macbook
    | Air's auto start.
 
      | jmorgan wrote:
      | Good suggestion! This is definitely on the radar, so that
      | running `ollama` will start the server when it's needed
      | (instead of erroring!):
      | https://github.com/jmorganca/ollama/issues/47
 
    | DennisP wrote:
    | I've been wondering, is the M2's neural engine usable for
    | this?
 
  | robotnikman wrote:
  | Llama.cpp has been fun to experiment around with. I was
  | suprised with how easy it was to set up, much easier than when
  | I tried to set up a local llm almost a year ago.
 
  | lolinder wrote:
  | Just a note that you have to have at least 12GB VRAM for it to
  | be worth even trying to use your GPU for LLaMA 2.
  | 
  | The 7B model quantized to 4 bits can fit in 8GB VRAM with room
  | for the context, but is pretty useless for getting good results
  | in my experience. 13B is better but still not anything near as
  | good as the 70B, which would require >35GB VRAM to use at 4 bit
  | quantization.
  | 
  | My solution for playing with this was just to upgrade my PC's
  | RAM to 64GB. It's slower than the GPU, but it was way cheaper
  | and I can run the 70B model easily.
 
    | dc443 wrote:
    | I have 2x 3090 do you know if it's feasible to use that 48GB
    | total for running this?
 
      | eurekin wrote:
      | Yes, it runs totally fine. I ran it in Oobabooga/text
      | generation web ui. Nice thing about it is that it
      | autodownloads all necessary gpu binaries on it's own and
      | creates a isolated conda env. I asked same questions on the
      | official 70b demo and got same answers. I even got better
      | answers with ooba, since the demo cuts text early
      | 
      | Ooobabooga: https://github.com/oobabooga/text-generation-
      | webui
      | 
      | Model: TheBloke_Llama-2-70B-chat-GPTQ from
      | https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ
      | 
      | ExLlama_HF loader gpu split 20,22, context size 2048
      | 
      | on the Chat Settings tab, choose Instruction template tab
      | and pick Llama-v2 from the instruction template dropdown
      | 
      | Demo: https://huggingface.co/blog/llama2#demo
 
        | zakki wrote:
        | Is there any specific settings to make 2x3090 work
        | together?
 
    | NoMoreNicksLeft wrote:
    | Trying to figure out what hardware to convince my boss to
    | spend on... if we were to get one of the A6000/48gb cards,
    | will that see significant performance improvements over just
    | a 4090/24gb? The primary limitation is vram, is it not?
 
      | cjbprime wrote:
      | You might consider getting a Mac Studio (with as much RAM
      | as you can afford up to 192GB) instead, since 192GB is more
      | (unified) memory than you're going to easily get to with
      | GPUs.
 
      | lolinder wrote:
      | VRAM is what gets you up to the larger model sizes, and
      | 24GB isn't enough to load the full 70B even at 4 bits, you
      | need at least 35 and some extra for the context. So it
      | depends a lot on what you want to do--fine tuning will take
      | even more as I understand it.
      | 
      | The card's speed will affect your performance, but I don't
      | know enough about different graphics cards to tell you
      | specifics.
 
    | ErneX wrote:
    | Apple Silicon Macs might not have great GPUs but they do have
    | unified memory. I need to try this on mine I have 96GB of RAM
    | on my M2 Max.
 
| krychu wrote:
| Self-plug. Here's a fork of the original llama 2 code adapted to
| run on the CPU or MPS (M1/M2 GPU) if available:
| 
| https://github.com/krychu/llama
| 
| It runs with the original weights, and gets you to ~4 tokens/sec
| on MacBook Pro M1 with the 7B model.
 
| rootusrootus wrote:
| For most people who just want to play around and are using MacOS
| or Windows, I'd just recommend lmstudio.ai. Nice interface, with
| super easy searching and downloading of new models.
 
  | dividedbyzero wrote:
  | Does it make any sense to try this on a lower-end Mac (like a
  | M2 Air)?
 
    | mchiang wrote:
    | Yeah! How much memory do you have?
    | 
    | If by lower-end Macbook air, you mean with 8GB of memory, try
    | the smaller models (Such as Orca Mini 3B). You can do this
    | via LM Studio, Oogabooga/text-generation-webui, KoboldCPP,
    | GPT4all, ctransformers, and more.
    | 
    | I'm biased since I work on Ollama, and if you want to try it
    | out:
    | 
    | 1. Download https://ollama.ai/download
    | 
    | 2. `ollama run orca`
    | 
    | 3. Enter your input to prompt
    | 
    | Note Ollama is open source, and you can compile it too from
    | https://github.com/jmorganca/ollama
 
      | bdavbdav wrote:
      | I'm deliberating on how much RAM to get on my new MBP. Is
      | 32gb going to stand me in good stead?
 
        | mchiang wrote:
        | Local memory management will definitely get better in the
        | future.
        | 
        | For now:
        | 
        | You should have at least 8 GB of RAM to run the 3B
        | models, 16 GB to run the 7B models, and 32 GB to run the
        | 13B models.
        | 
        | My personal recommendation is to get as much memory as
        | you can if you want to work with local models [including
        | VRAM if you are planning to be executing on GPU]
 
        | rootusrootus wrote:
        | 32GB should be fine. I went a little overboard and got a
        | new MBP with M2 MAX and 96GB, but the hardware is really
        | best suited at this point to a 30B model. I can and do
        | play around with 65B models, but at that point you're
        | making a fairly big tradeoff in generation speed for an
        | incremental increase in quality.
        | 
        | As a datapoint, I have a 30B model [0] loaded right now
        | and it's using 23.44GB of RAM. Getting around 9
        | tokens/sec, which is very usable. I also have the 65B
        | version of the same model [1] and it's good for around
        | 3.6 tokens/second, but it uses 44GB of RAM. Not unusably
        | slow, but more often than not I opt for the 30B because
        | it's good enough and a lot faster.
        | 
        | Haven't tried the llama2 70B yet.
        | 
        | [0] https://huggingface.co/TheBloke/upstage-
        | llama-30b-instruct-2... [1]
        | https://huggingface.co/TheBloke/Upstage-
        | Llama1-65B-Instruct-...
 
        | swader999 wrote:
        | What's your use case for local if you don't mind?
 
      | dividedbyzero wrote:
      | By lower-end I meant that the Airs are quite low-end in
      | general (compared to Pro/Studio). I have the maxed-out
      | 24gb, but 16gb may be more common among people who might
      | use an Air for this kind of thing.
 
| Der_Einzige wrote:
| The correct answer, as always, is the oogabooga text generation
| webUI, which supports all of the relevant backends:
| https://github.com/oobabooga/text-generation-webui
 
  | cypress66 wrote:
  | Yep. Use ooba. And people who like to RP often use ooba as a
  | backend, and sillitavern as a frontend.
 
    | Roark66 wrote:
    | Can it run onnx transformer models? I found optimised onnx
    | models are at least twice the speed of vanilla pytorch on the
    | CPU.
 
| TheAceOfHearts wrote:
| How do you decide what model variant to use? There's a bunch of
| Quant method variations of Llama-2-13B-chat-GGML [0], how do you
| know which one to use? Reading the "Explanation of the new
| k-quant methods" is a bit opaque.
| 
| [0] https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML
 
| sva_ wrote:
| If you just want to do inference/mess around with the model and
| have a 16GB GPU, then this[0] is enough to paste into a notebook.
| You need to have access to the HF models though.
| 
| 0.
| https://github.com/huggingface/blog/blob/main/llama2.md#usin...
 
| oaththrowaway wrote:
| Off topic: is there a way to use one of the LLMs and have it
| ingest data from a SQLite database and ask it questions about it?
 
  | politelemon wrote:
  | Have a look at this too, it's just an integration which
  | langchain can be good at : https://walkingtree.tech/natural-
  | language-to-query-your-sql-...
 
  | seanthemon wrote:
  | You can, but as a crazy idea you can also ask chatgpt to write
  | select queries using the functions parameter they added
  | recently - you can also ask it to write jsonpath.
  | 
  | As long as it understands the schema and general idea of data,
  | it does a fairly good job. Just be careful to do too much with
  | one prompt, you can easily cause hallucinations
 
  | simonw wrote:
  | I've experimented with that a bit.
  | 
  | Currently the absolutely best way to do that is to upload a
  | SQLite database file to ChatGPT Code Interpreter.
  | 
  | I'm hoping that someone will fine-tune an openly licensed model
  | for this at some point that can give results as good as Code
  | Interpreter does.
 
  | siquick wrote:
  | You can migrate that data to a vector database (eg Pinecone or
  | pgVector) and then query it. I didn't write it but this guide
  | has a good overview of concepts and some code. In your case
  | your just replace the web crawler with database queries. All
  | the libraries used also exist in Python.
  | 
  | https://www.pinecone.io/learn/javascript-chatbot/
 
  | thisisit wrote:
  | You can but what you'll end up trading precise answers while
  | querying to a chance of hallucinations.
 
| politelemon wrote:
| Llama.cpp can run on Android too.
 
| synaesthesisx wrote:
| This is usable, but hopefully folks manage to tweak it a bit
| further for even higher tokens/s. I'm running Llama.cpp locally
| on my M2 Max (32 GB) with decent performance but sticking to the
| 7B model for now.
 
___________________________________________________________________
(page generated 2023-07-25 23:00 UTC)