|
| guy98238710 wrote:
| > curl -L "https://replicate.fyi/install-llama-cpp" | bash
|
| Seriously? Pipe script from someone's website directly to bash?
| gattilorenz wrote:
| Yes. If you are worried, you can redirect it to file and then
| sh it. It doesn't get much easier to inspect than that...
| cjbprime wrote:
| Either you trust the TLS session to their website to deliver
| you software you're going to run, or you don't.
| madars wrote:
| That's the recommended way to get Rust nightly too:
| https://rustup.rs/ But don't look there, there is memory safety
| somewhere!
| raccolta wrote:
| oh, this again.
| handelaar wrote:
| Idiot question: if I have access to sentence-by-sentence
| professionally-translated text of foreign-language-to-English in
| gigantic quantities, and I fed the originals as prompts and the
| translations as completions...
|
| ... would I be likely to get anything useful if I then fed it new
| prompts in a similar style? Or would it just generate gibberish?
| seanthemon wrote:
| Indeed, it sounds like you have what's called fine tuned data
| (given an input, here's the output), there's loads of info both
| here on HN about fine tuning and on youtube's huggingface
| channels
|
| Note if you have sufficient data, look into existing models on
| huggingface, you may find a smaller, faster and more open
| (licencing-wise) model that you can fine tune to get the
| results you want - Llama is hot, but not a catch-all for all
| tasks (as no model should be)
|
| Happy inferring!
| nl wrote:
| If you have that much data you can build your own model that
| can be much smaller and faster.
|
| A simple version is a beginner tutorial:
| https://pytorch.org/tutorials/beginner/translation_transform...
| maxlin wrote:
| I might be missing something. The article asks me to run a bash
| script on windows.
|
| I assume this would still need to be run manually to access GPU
| resources etc, so can someone illuminate what is actually
| expected for a windows user to make this run?
|
| I'm currently paying 15$ a month in a personal
| translation/summarizer project's ChatGPT queries. I run whisper
| (const.me's GPU fork) locally and would love to get the LLM part
| local eventually too! The system generates 30k queries a month
| but is not super-affected by delay so lower token rates might
| work too.
| nomel wrote:
| Windows has supported linux tools for some time now, using WSL:
| https://learn.microsoft.com/en-us/windows/wsl/about
|
| No idea if it will work, in this case, but it does with
| llama.cpp: https://github.com/ggerganov/llama.cpp/issues/103
| maxlin wrote:
| I know (should have included in my earlier response but
| editing would've felt weird) but I still assume one should
| run the result natively, so am asking if/where there's some
| jumping around required.
|
| Last time I tried running an LLM I tried wsl&native both on 2
| machines and just got lovecraftian-tier errors so waiting if
| I'm missing something obvious before going down that route
| again
| nomand wrote:
| Is it possible for such local install to retain conversation
| history so if for example you're working on a project and use it
| as your assistance across many days that you can continue
| conversations and for the model to keep track of what you and it
| already know?
| simonw wrote:
| My LLM command line tool can do that - it logs everything to a
| SQLite database and has an option to continue a conversation:
| https://llm.datasette.io
| knodi123 wrote:
| llama is just an input/output engine. It takes a big string as
| input, and gives a big string of output.
|
| Save your outputs if you want, you can copy/paste them into any
| editor. Or make a shell script that mirrors outputs to a file
| and use _that_ as your main interface. It 's up to the user.
| jmiskovic wrote:
| There is no fully built solution, only bits and pieces. I
| noticed that llama outputs tend to degrade with amount of text,
| the text becomes too repetitive and focused, and you have to
| raise the temperature to break the model out of loops.
| nomand wrote:
| Does what you're saying mean you can only ask questions and
| get answers in a single step, and that having a long
| discussion where refinement of output is arrived at through
| conversation isn't possible?
| krisoft wrote:
| My understanding is that at a high level you can look at
| this model as a black box which accepts a string and
| outputs a string.
|
| If you want it to "remember" things you do that by
| appending all the previous conversations together and
| supply it in the input string.
|
| In an ideal world this would work perfectly. It would read
| through the whole conversation and would provide the right
| output you expect, exactly as if it would "remember" the
| conversation. In reality there are all kind of issues which
| can crop up as the input grows longer and longer. One is
| that it takes more and more processing power and time for
| it to "read through" everything previously said. And there
| are things like what jmiskovic said that the output quality
| can also degrade in perhaps unexpected ways.
|
| But that also doesn't mean that " refinement of output is
| arrived at through conversation isn't possible". It is not
| that black and white, just that you can run into troubles
| as the length of the discussion grows.
|
| I don't have direct experience with long conversations so I
| can't tell you how long is definietly too long, and how
| long is still safe. Plus probably there are some tricks one
| can do to work around these. Probably there are things one
| can do if one unpacks that "black box" understanding of the
| process. But even without that you could imagine a
| "consolidation" process where the AI is instructed to write
| short notes about a given length of conversation and then
| those shorter notes would be copied in to the next input
| instead of the full previous conversation. All of these are
| possible, but you won't have a turn-key solution for it
| just yet.
| cjbprime wrote:
| The limit here is the "context window" length of the
| model, measured in tokens, which will quickly become too
| short to contain all of your previous conversations,
| which will mean it has to answer questions without access
| to all of that text. And within a single conversation, it
| will mean that it starts forgetting the text from the
| start of the conversation, once the [conversation + new
| prompt] reaches the context length.
|
| The kind of hacks that work around this are to train the
| model on the past conversations, and then rely on
| similarity in tensor space to pull the right (lossy) data
| back out of the model (or a separate database) later,
| based on its similarity to your question, and include it
| (or a summary of it, since summaries are smaller) within
| the context window for your new conversation, combined
| with your prompt. This is what people are talking about
| when they use the term "embeddings".
| nomand wrote:
| My benchmark is having a peer programming session
| spanning days and dozens of queries with ChatGPT where we
| co-created a custom static site generator that works
| really well for my requirements. It was able to hold
| context for a while and not "forget" what code it
| provided me dozens of messages earlier, it was able to
| "remember" corrections and refactors that I gave it and
| overall was incredibly useful for working out things like
| recurrence for folder hierarchies and building data
| trees. This kind and similar use-cases where memory is
| important, when the model is used as a genuine assistant.
| krisoft wrote:
| Excelent! That sounds like a very usefull personal
| benchmark then. You could test llama v2 by copying in
| different lengths of snippets from that conversation and
| checking how usefull you find its outputs.
| RicoElectrico wrote:
| curl -L "https://replicate.fyi/windows-install-llama-cpp"
|
| ... returns 404 Not Found
| thisisit wrote:
| The easiest way I found was to use GPT4All. Just download and
| install, grab GGML version of Llama 2, copy to the models
| directory in the installation folder. Fire up GPT4All and run.
| andreyk wrote:
| This covers three things: Llama.cpp (Mac/Windows/Linux), Ollama
| (Mac), MLC LLM (iOS/Android)
|
| Which is not really comprehensive... If you have a linux machine
| with GPUs, i'd just use hugging face inference
| (https://github.com/huggingface/text-generation-inference). And I
| am sure there are other things that could be covered.
| krisoft wrote:
| > If you have a linux machine with GPUs
|
| How much VRAM one needs to run inference with llama 2 on a GPU
| approximately?
| lolinder wrote:
| Depends on which model. I haven't bothered doing it on my 8GB
| because the only model that would fit is the 7B model
| quantized to 4 bits, and that model at that size is pretty
| bad for most things. I think you could have fun with 13B with
| 12GB VRAM. The full size model would require >35GB even
| quantized.
| novaRom wrote:
| 16Gb is minimum to run 7B model with float16 weights; out of
| the box, with no further efforts.
| Patrick_Devine wrote:
| Ollama works with Windows and Linux as well too, but doesn't
| (yet) have GPU support for those platforms. You have to compile
| it yourself (it's a simple `go build .`), but should work fine
| (albeit slow). The benefit is you can still pull the llama2
| model really easily (with `ollama pull llama2`) and even use it
| with other runners.
|
| DISCLAIMER: I'm one of the developers behind Ollama.
| mschuster91 wrote:
| > DISCLAIMER: I'm one of the developers behind Ollama.
|
| I got a feature suggestion - would it be possible to have the
| ollama CLI automatically start up the GUI/daemon if it's not
| running? There's only so much stuff one can keep in a Macbook
| Air's auto start.
| jmorgan wrote:
| Good suggestion! This is definitely on the radar, so that
| running `ollama` will start the server when it's needed
| (instead of erroring!):
| https://github.com/jmorganca/ollama/issues/47
| DennisP wrote:
| I've been wondering, is the M2's neural engine usable for
| this?
| robotnikman wrote:
| Llama.cpp has been fun to experiment around with. I was
| suprised with how easy it was to set up, much easier than when
| I tried to set up a local llm almost a year ago.
| lolinder wrote:
| Just a note that you have to have at least 12GB VRAM for it to
| be worth even trying to use your GPU for LLaMA 2.
|
| The 7B model quantized to 4 bits can fit in 8GB VRAM with room
| for the context, but is pretty useless for getting good results
| in my experience. 13B is better but still not anything near as
| good as the 70B, which would require >35GB VRAM to use at 4 bit
| quantization.
|
| My solution for playing with this was just to upgrade my PC's
| RAM to 64GB. It's slower than the GPU, but it was way cheaper
| and I can run the 70B model easily.
| dc443 wrote:
| I have 2x 3090 do you know if it's feasible to use that 48GB
| total for running this?
| eurekin wrote:
| Yes, it runs totally fine. I ran it in Oobabooga/text
| generation web ui. Nice thing about it is that it
| autodownloads all necessary gpu binaries on it's own and
| creates a isolated conda env. I asked same questions on the
| official 70b demo and got same answers. I even got better
| answers with ooba, since the demo cuts text early
|
| Ooobabooga: https://github.com/oobabooga/text-generation-
| webui
|
| Model: TheBloke_Llama-2-70B-chat-GPTQ from
| https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ
|
| ExLlama_HF loader gpu split 20,22, context size 2048
|
| on the Chat Settings tab, choose Instruction template tab
| and pick Llama-v2 from the instruction template dropdown
|
| Demo: https://huggingface.co/blog/llama2#demo
| zakki wrote:
| Is there any specific settings to make 2x3090 work
| together?
| NoMoreNicksLeft wrote:
| Trying to figure out what hardware to convince my boss to
| spend on... if we were to get one of the A6000/48gb cards,
| will that see significant performance improvements over just
| a 4090/24gb? The primary limitation is vram, is it not?
| cjbprime wrote:
| You might consider getting a Mac Studio (with as much RAM
| as you can afford up to 192GB) instead, since 192GB is more
| (unified) memory than you're going to easily get to with
| GPUs.
| lolinder wrote:
| VRAM is what gets you up to the larger model sizes, and
| 24GB isn't enough to load the full 70B even at 4 bits, you
| need at least 35 and some extra for the context. So it
| depends a lot on what you want to do--fine tuning will take
| even more as I understand it.
|
| The card's speed will affect your performance, but I don't
| know enough about different graphics cards to tell you
| specifics.
| ErneX wrote:
| Apple Silicon Macs might not have great GPUs but they do have
| unified memory. I need to try this on mine I have 96GB of RAM
| on my M2 Max.
| krychu wrote:
| Self-plug. Here's a fork of the original llama 2 code adapted to
| run on the CPU or MPS (M1/M2 GPU) if available:
|
| https://github.com/krychu/llama
|
| It runs with the original weights, and gets you to ~4 tokens/sec
| on MacBook Pro M1 with the 7B model.
| rootusrootus wrote:
| For most people who just want to play around and are using MacOS
| or Windows, I'd just recommend lmstudio.ai. Nice interface, with
| super easy searching and downloading of new models.
| dividedbyzero wrote:
| Does it make any sense to try this on a lower-end Mac (like a
| M2 Air)?
| mchiang wrote:
| Yeah! How much memory do you have?
|
| If by lower-end Macbook air, you mean with 8GB of memory, try
| the smaller models (Such as Orca Mini 3B). You can do this
| via LM Studio, Oogabooga/text-generation-webui, KoboldCPP,
| GPT4all, ctransformers, and more.
|
| I'm biased since I work on Ollama, and if you want to try it
| out:
|
| 1. Download https://ollama.ai/download
|
| 2. `ollama run orca`
|
| 3. Enter your input to prompt
|
| Note Ollama is open source, and you can compile it too from
| https://github.com/jmorganca/ollama
| bdavbdav wrote:
| I'm deliberating on how much RAM to get on my new MBP. Is
| 32gb going to stand me in good stead?
| mchiang wrote:
| Local memory management will definitely get better in the
| future.
|
| For now:
|
| You should have at least 8 GB of RAM to run the 3B
| models, 16 GB to run the 7B models, and 32 GB to run the
| 13B models.
|
| My personal recommendation is to get as much memory as
| you can if you want to work with local models [including
| VRAM if you are planning to be executing on GPU]
| rootusrootus wrote:
| 32GB should be fine. I went a little overboard and got a
| new MBP with M2 MAX and 96GB, but the hardware is really
| best suited at this point to a 30B model. I can and do
| play around with 65B models, but at that point you're
| making a fairly big tradeoff in generation speed for an
| incremental increase in quality.
|
| As a datapoint, I have a 30B model [0] loaded right now
| and it's using 23.44GB of RAM. Getting around 9
| tokens/sec, which is very usable. I also have the 65B
| version of the same model [1] and it's good for around
| 3.6 tokens/second, but it uses 44GB of RAM. Not unusably
| slow, but more often than not I opt for the 30B because
| it's good enough and a lot faster.
|
| Haven't tried the llama2 70B yet.
|
| [0] https://huggingface.co/TheBloke/upstage-
| llama-30b-instruct-2... [1]
| https://huggingface.co/TheBloke/Upstage-
| Llama1-65B-Instruct-...
| swader999 wrote:
| What's your use case for local if you don't mind?
| dividedbyzero wrote:
| By lower-end I meant that the Airs are quite low-end in
| general (compared to Pro/Studio). I have the maxed-out
| 24gb, but 16gb may be more common among people who might
| use an Air for this kind of thing.
| Der_Einzige wrote:
| The correct answer, as always, is the oogabooga text generation
| webUI, which supports all of the relevant backends:
| https://github.com/oobabooga/text-generation-webui
| cypress66 wrote:
| Yep. Use ooba. And people who like to RP often use ooba as a
| backend, and sillitavern as a frontend.
| Roark66 wrote:
| Can it run onnx transformer models? I found optimised onnx
| models are at least twice the speed of vanilla pytorch on the
| CPU.
| TheAceOfHearts wrote:
| How do you decide what model variant to use? There's a bunch of
| Quant method variations of Llama-2-13B-chat-GGML [0], how do you
| know which one to use? Reading the "Explanation of the new
| k-quant methods" is a bit opaque.
|
| [0] https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML
| sva_ wrote:
| If you just want to do inference/mess around with the model and
| have a 16GB GPU, then this[0] is enough to paste into a notebook.
| You need to have access to the HF models though.
|
| 0.
| https://github.com/huggingface/blog/blob/main/llama2.md#usin...
| oaththrowaway wrote:
| Off topic: is there a way to use one of the LLMs and have it
| ingest data from a SQLite database and ask it questions about it?
| politelemon wrote:
| Have a look at this too, it's just an integration which
| langchain can be good at : https://walkingtree.tech/natural-
| language-to-query-your-sql-...
| seanthemon wrote:
| You can, but as a crazy idea you can also ask chatgpt to write
| select queries using the functions parameter they added
| recently - you can also ask it to write jsonpath.
|
| As long as it understands the schema and general idea of data,
| it does a fairly good job. Just be careful to do too much with
| one prompt, you can easily cause hallucinations
| simonw wrote:
| I've experimented with that a bit.
|
| Currently the absolutely best way to do that is to upload a
| SQLite database file to ChatGPT Code Interpreter.
|
| I'm hoping that someone will fine-tune an openly licensed model
| for this at some point that can give results as good as Code
| Interpreter does.
| siquick wrote:
| You can migrate that data to a vector database (eg Pinecone or
| pgVector) and then query it. I didn't write it but this guide
| has a good overview of concepts and some code. In your case
| your just replace the web crawler with database queries. All
| the libraries used also exist in Python.
|
| https://www.pinecone.io/learn/javascript-chatbot/
| thisisit wrote:
| You can but what you'll end up trading precise answers while
| querying to a chance of hallucinations.
| politelemon wrote:
| Llama.cpp can run on Android too.
| synaesthesisx wrote:
| This is usable, but hopefully folks manage to tweak it a bit
| further for even higher tokens/s. I'm running Llama.cpp locally
| on my M2 Max (32 GB) with decent performance but sticking to the
| 7B model for now.
___________________________________________________________________
(page generated 2023-07-25 23:00 UTC) |