[HN Gopher] SmoothLLM: Defending Large Language Models Against J...
___________________________________________________________________
 
SmoothLLM: Defending Large Language Models Against Jailbreaking
Attacks
 
Author : amai
Score  : 44 points
Date   : 2024-11-16 22:37 UTC (17 hours ago)
 
web link (arxiv.org)
w3m dump (arxiv.org)
 
| handfuloflight wrote:
| Github: https://github.com/arobey1/smooth-llm
 
| ipython wrote:
| It concerns me that these defensive techniques themselves often
| require even more llm inference calls.
| 
| Just skimmed the GitHub repo for this one and the read me
| mentions four additional llm inferences for each incoming request
| - so now we've 5x'ed the (already expensive) compute required to
| answer a query?
 
| padolsey wrote:
| So basically this just adds random characters to input prompts to
| break jailbreaking attempts? IMHO If you can't make a single-
| inference solution, you may as well just run a couple of output
| filters, no? That appeared to have reasonable results, and if you
| make such filtering more domain-specific, you'll probably make it
| even better. Intuition says there's no "general solution" to
| jailbreaking, so maybe it's a lost cause and we need to build up
| layers of obscurity, of which smooth-llm is just one part.
 
  | ipython wrote:
  | Right. This seems to be the latest in the "throw random stuff
  | at the wall and see what sticks" series of generative ai
  | papers.
  | 
  | I don't know if I'm too stupid to understand or if truly this
  | is just "add random stuff to prompt" dressed up in flowery
  | academic language.
 
    | pxmpxm wrote:
    | Not surprising - from what I can tell, machine learning has
    | been going down this route for a decade.
    | 
    | Anything involving the higher level abstractions (tensor flow
    | / keras /whatever) is full of handwavy stuff about this or
    | that activation function / number of layers / model
    | architecture working the best and doing a trial error with a
    | different component in the above if it doesn't. Closer to
    | kids playing with legos than statistics.
 
      | malwrar wrote:
      | I've actually noticed this in other areas too. Tons of them
      | just swap parts out of existing works, maybe add a novel
      | idea or two, then boom new proposed technique new paper. I
      | remember when I first noticed it after learning to parse
      | the academic nomenclature for a particular subject I was
      | into at the time (SLAM) and feeling ripped off, but hey if
      | you catch up with a subject it's a good reading shortcut
      | and helps zoom in on new ideas.
 
| mapmeld wrote:
| There are some authors in common with a more recent paper
| "Defending Large Language Models against Jailbreak Attacks via
| Semantic Smoothing" https://arxiv.org/abs/2402.16192
 
| freeone3000 wrote:
| I find it very interesting that "aligning with human desires"
| somehow includes prevention of a human trying to bypass the
| safeguards to generate "objectionable" content (whatever that
| is). I think the "safeguards" are a bigger problem with aligning
| with my desires.
 
  | ipython wrote:
  | We've seen where that ends up.
  | https://en.m.wikipedia.org/wiki/Tay_(chatbot)
 
  | wruza wrote:
  | Another question is whether that initial unalignment comes from
  | poor filtering of datasets, or is it emergent from regular,
  | pre-filtered cultured texts.
  | 
  | In other words, was an "unaligned" LLM taught bad things from
  | bad people, or does it simply _see it naturally_ and point it
  | out with the purity of a child? The latter would mean something
  | about ourselves. Personally I think that people tend to
  | selectively ignore things too much.
 
    | GuB-42 wrote:
    | We can't avoid teaching bad things to a LLM if we want it to
    | have useful knowledge. For example, you may teach a LLM about
    | nazis, that's expected knowledge. But then, you can prompt a
    | LLM to be a nazi. You can teach it about how to avoid
    | poisoning yourself, but then, you taught it how to poison
    | people. And the smarter the model is, the better it will be
    | at extracting bad things from good things by negation.
    | 
    | There are actually training dataset full of bad thing by bad
    | people, the intention is to use them negatively, as to teach
    | the LLM some morality.
 
      | ujikoluk wrote:
      | Maybe we should just avoid trying to classify things as
      | good or bad.
 
  | threeseed wrote:
  | The safeguards stems from a desire to make tools like Claude
  | accessible to a very wide audience as use cases such as
  | education are very important.
  | 
  | And so it seems like people such as yourself who do have an
  | issue with safeguards should seek out LLMs that are catered to
  | adult audiences rather than trying to remove safeguards
  | entirely.
 
    | Zambyte wrote:
    | How does making it harder for the user to extract information
    | they are trying to extract make it safer for a wider
    | audience?
 
      | dbspin wrote:
      | Assuming that this question is good faith...
      | 
      | There are numerous things that might be true, that may be
      | damaging to a child's development to be exposed to. From
      | overly punitive criticism to graphic depictions of
      | violence, to advocacy and specific directions for self
      | harm. Countless examples are trivial to generate.
      | 
      | Similarly, the use of these tools is already having
      | dramatic effects on spearfishing, misinformation etc.
      | Guardrails on all the non open-source models have enormous
      | impact on slowing / limiting the damage this has at scale.
      | Even with retrained Llama based models, it's more difficult
      | than you might imagine to create a truly machiavellian or
      | uncensored LLM - which is entirely due to the work that's
      | been doing during and post training to constrain those
      | behaviours. This is an unalloyed good in constraining the
      | weaponisation of LLMs.
 
      | Drakim wrote:
      | That's like asking why we should have porn filters on
      | school computers, after all, all it does is prevent the
      | user from finding what they are looking for, which is bad.
 
    | selfhoster11 wrote:
    | Here is a revolutionary concept: give the users a toggle.
    | 
    | Make it controllable by an IT department if logging in with
    | an organisation-tied account, but give people a choice.
 
  | Zambyte wrote:
  | What tools do we have to defend against LLM lockdown attacks?
 
___________________________________________________________________
(page generated 2024-11-17 16:01 UTC)