|
| handfuloflight wrote:
| Github: https://github.com/arobey1/smooth-llm
| ipython wrote:
| It concerns me that these defensive techniques themselves often
| require even more llm inference calls.
|
| Just skimmed the GitHub repo for this one and the read me
| mentions four additional llm inferences for each incoming request
| - so now we've 5x'ed the (already expensive) compute required to
| answer a query?
| padolsey wrote:
| So basically this just adds random characters to input prompts to
| break jailbreaking attempts? IMHO If you can't make a single-
| inference solution, you may as well just run a couple of output
| filters, no? That appeared to have reasonable results, and if you
| make such filtering more domain-specific, you'll probably make it
| even better. Intuition says there's no "general solution" to
| jailbreaking, so maybe it's a lost cause and we need to build up
| layers of obscurity, of which smooth-llm is just one part.
| ipython wrote:
| Right. This seems to be the latest in the "throw random stuff
| at the wall and see what sticks" series of generative ai
| papers.
|
| I don't know if I'm too stupid to understand or if truly this
| is just "add random stuff to prompt" dressed up in flowery
| academic language.
| pxmpxm wrote:
| Not surprising - from what I can tell, machine learning has
| been going down this route for a decade.
|
| Anything involving the higher level abstractions (tensor flow
| / keras /whatever) is full of handwavy stuff about this or
| that activation function / number of layers / model
| architecture working the best and doing a trial error with a
| different component in the above if it doesn't. Closer to
| kids playing with legos than statistics.
| malwrar wrote:
| I've actually noticed this in other areas too. Tons of them
| just swap parts out of existing works, maybe add a novel
| idea or two, then boom new proposed technique new paper. I
| remember when I first noticed it after learning to parse
| the academic nomenclature for a particular subject I was
| into at the time (SLAM) and feeling ripped off, but hey if
| you catch up with a subject it's a good reading shortcut
| and helps zoom in on new ideas.
| mapmeld wrote:
| There are some authors in common with a more recent paper
| "Defending Large Language Models against Jailbreak Attacks via
| Semantic Smoothing" https://arxiv.org/abs/2402.16192
| freeone3000 wrote:
| I find it very interesting that "aligning with human desires"
| somehow includes prevention of a human trying to bypass the
| safeguards to generate "objectionable" content (whatever that
| is). I think the "safeguards" are a bigger problem with aligning
| with my desires.
| ipython wrote:
| We've seen where that ends up.
| https://en.m.wikipedia.org/wiki/Tay_(chatbot)
| wruza wrote:
| Another question is whether that initial unalignment comes from
| poor filtering of datasets, or is it emergent from regular,
| pre-filtered cultured texts.
|
| In other words, was an "unaligned" LLM taught bad things from
| bad people, or does it simply _see it naturally_ and point it
| out with the purity of a child? The latter would mean something
| about ourselves. Personally I think that people tend to
| selectively ignore things too much.
| GuB-42 wrote:
| We can't avoid teaching bad things to a LLM if we want it to
| have useful knowledge. For example, you may teach a LLM about
| nazis, that's expected knowledge. But then, you can prompt a
| LLM to be a nazi. You can teach it about how to avoid
| poisoning yourself, but then, you taught it how to poison
| people. And the smarter the model is, the better it will be
| at extracting bad things from good things by negation.
|
| There are actually training dataset full of bad thing by bad
| people, the intention is to use them negatively, as to teach
| the LLM some morality.
| ujikoluk wrote:
| Maybe we should just avoid trying to classify things as
| good or bad.
| threeseed wrote:
| The safeguards stems from a desire to make tools like Claude
| accessible to a very wide audience as use cases such as
| education are very important.
|
| And so it seems like people such as yourself who do have an
| issue with safeguards should seek out LLMs that are catered to
| adult audiences rather than trying to remove safeguards
| entirely.
| Zambyte wrote:
| How does making it harder for the user to extract information
| they are trying to extract make it safer for a wider
| audience?
| dbspin wrote:
| Assuming that this question is good faith...
|
| There are numerous things that might be true, that may be
| damaging to a child's development to be exposed to. From
| overly punitive criticism to graphic depictions of
| violence, to advocacy and specific directions for self
| harm. Countless examples are trivial to generate.
|
| Similarly, the use of these tools is already having
| dramatic effects on spearfishing, misinformation etc.
| Guardrails on all the non open-source models have enormous
| impact on slowing / limiting the damage this has at scale.
| Even with retrained Llama based models, it's more difficult
| than you might imagine to create a truly machiavellian or
| uncensored LLM - which is entirely due to the work that's
| been doing during and post training to constrain those
| behaviours. This is an unalloyed good in constraining the
| weaponisation of LLMs.
| Drakim wrote:
| That's like asking why we should have porn filters on
| school computers, after all, all it does is prevent the
| user from finding what they are looking for, which is bad.
| selfhoster11 wrote:
| Here is a revolutionary concept: give the users a toggle.
|
| Make it controllable by an IT department if logging in with
| an organisation-tied account, but give people a choice.
| Zambyte wrote:
| What tools do we have to defend against LLM lockdown attacks?
___________________________________________________________________
(page generated 2024-11-17 16:01 UTC) |