[HN Gopher] AI and Open Source in 2023
___________________________________________________________________
 
AI and Open Source in 2023
 
Author : belter
Score  : 42 points
Date   : 2023-11-04 18:50 UTC (4 hours ago)
 
web link (magazine.sebastianraschka.com)
w3m dump (magazine.sebastianraschka.com)
 
| gumballindie wrote:
| As soon as DRM for text and images is implemented companies such
| openai will be in for a ride. Unfortunately though open source
| models will be sacrificed in the process, but we need means to
| protect against the rampant ip theft ai companies do.
 
  | artninja1988 wrote:
  | No such thing as IP theft
 
    | gumballindie wrote:
    | Let me guess - you think ip and copyright are "rent seeking"?
    | What a weird age we live in. Where people defend corporations
    | from stealing our work. Quite a shift from the reverse.
 
    | minimaxir wrote:
    | It's entirely possible to steal IP, but the "AI art is theft"
    | part of it is still legally up in the air.
 
      | gumballindie wrote:
      | There are all sorts of things that are legal and immoral or
      | disagreeable so even if ai art theft is legalised it's
      | still theft if the author doesnt want it to be used that
      | way. It seems like "ai" is quite reliant on ingesting and
      | storing massive property data to emulate "intelligence" -
      | and thats equal to people downloading and storing movies
      | and music. A thing we are not permitted to by the same
      | corporations that you wish to help.
 
      | jrm4 wrote:
      | I think what OP is referring to is the entirely reasonable
      | legal argument that IP infringement is not actually "theft"
      | 
      | The idea being: "Theft" isn't about "you get something you
      | don't own," it means "you deprive someone else of THEIR
      | property."
 
  | minimaxir wrote:
  | Which means that companies will just license the data used to
  | train models because they have the money to do so, or use their
  | own data instead. That's how Adobe's Firefly works right now,
  | and OpenAI just signed a licensing agreement with Shutterstock:
  | https://venturebeat.com/ai/shutterstock-signs-6-year-trainin...
  | 
  | Even if it became impossible to train AI on internet-accessible
  | data, there's no change to the proliferation of generative AI
  | other than keeping it entrenched and centralized in the power
  | of a few players, and it has no impact on potentially taking
  | jobs from artists, other than making it _harder_ for artists to
  | compete due to the lack of open-source alternatives.
 
    | gumballindie wrote:
    | No problem then, people willing to make their content
    | available to ai can do so by using such websites, people that
    | value their work can use something else.
 
      | ben_w wrote:
      | That has the same vibe as responding to the invention of
      | the Jacquard loom by saying: "No problem then, people
      | willing to make their designs available to automation can
      | do so by using such punched cards, people that value their
      | work can use something else."
      | 
      | Home weaving does still exist. Not a very big employer any
      | more, though.
 
        | LastTrain wrote:
        | All analogies are fraught but this one takes the cake. A
        | more apt one is not wanting the Jaquard loom people to
        | steal my designs.
 
  | jrm4 wrote:
  | You're probably getting downvoted because "DRM" was nearly a
  | complete technical failure already, and there's no reason
  | believe it would be different for AI?
 
    | gumballindie wrote:
    | Normally i wouldnt advocate for drm but there needs to be a
    | way to protect our content from this madness. I understand
    | the backlash though and I am not worried about downvotes.
 
      | Krasnol wrote:
      | Your content was never protected in the sense you want it
      | to be protected.
      | 
      | Since the moment you put it up online for people to see and
      | hear, they were able to move on and create something else
      | based upon this. Most of the time unconsciously. This is
      | how humanity works. This is the reason we're still on this
      | planet. AI accelerates the process like any other tool
      | we've come up with since we climbed down the trees.
      | 
      | You can complain and scream as much as you want, but it
      | won't change. Even if you manage to regulate the whole
      | western part of the internet. The rest of the world is
      | bigger and won't sleep.
 
    | ls612 wrote:
    | Unfortunatley I think you are wrong about this. DRM schemes
    | are evolving to be nearly unbreakable in the future with the
    | widespread adoption of security processors in everything.
    | 
    | As long as there is a massive fundamental asymmetry between
    | assembling a chip with a small amount of ROM and
    | disassembling & reading that ROM while still making the chip
    | usable, DRM schemes using PKI methods will become widespread
    | and nigh unbreakable.
 
      | ben_w wrote:
      | Point [camera/microphone/eyeball] at [video/audio/text],
      | [press record/press record/start writing down what you
      | see].
 
  | candiddevmike wrote:
  | IMO, I think the entire "train on as much data as possible" is
  | nearing its end. There are diminishing returns and it seems
  | like a dead end strategy.
 
  | babyshake wrote:
  | Watermarking images, particularly very high resolution images,
  | I can understand, but I fail to see how with text, you would
  | watermark it in a way that provides sufficient evidence it has
  | been used for training data, unless the model is just quoting
  | it at length.
 
| andy99 wrote:
| Most importantly, 2023 was the year when "open source" got
| watered down to mean "you can look at the source code / weights"
| if you agree with a bunch of stuff. Most of the models
| referenced, like stable diffusion (RAIL license) and Llama &
| derivatives (proprietary facebook license with conditions on uses
| and some commercial terms) are not open source in the sense that
| it was understood a year ago. People protested a bit when the
| terminology started being abused, and now that's mostly died down
| and people now call these restrictive licenses open source. This
| (plus ongoing regulatory capture) is going to be the wedge that
| destroys software freedom and brings us back to a regime where a
| few companies dictate how computers can be used.
 
  | Der_Einzige wrote:
  | In practice this matters less than you think. You can't easily
  | prove that any outputs were generated by a particular model in
  | general, so any user can simply ignore your licenses and do as
  | they please.
  | 
  | I know it rustles purist feathers, but I don't understand why
  | we live in this pretend world that assumes that folks
  | particularly care about respecting licenses. Consider how
  | little success that the GNU folks have had with using the
  | courts for any enforcement of their licenses, and that's by
  | stallmans own admission.
  | 
  | AI is itself a subversive technology, whose current versions
  | rely on subversive training techniques. Why should we expect
  | everyone to suddenly want to follow the rules when they read a
  | poorly written restrictive open source license?
 
    | andy99 wrote:
    | For personal or noncommercial use I agree the restrictions
    | are meaningless. As they are for "bad actors" that would
    | potentially abuse the tools in contravention of the license.
    | But the license terms are a risk for commercial users
    | especially when dealing with a big company like Meta. These
    | risks werent previously there in say pytorch that is MIT
    | licensed. The ironic thing with these licenses is that they
    | are least enforceable on those who would be most likely to
    | abuse them: https://katedowninglaw.com/2023/07/13/ai-
    | licensing-cant-bala...
    | 
    | Re success of free licenses, linux (other than a few arguable
    | abuses) has remained free and unencumbered thanks to GPL
    | licensing.
 
    | nologic01 wrote:
    | Somehow the "AI defense" (namely that it is not possible to
    | "prove" anything was used illegally) will open Pandora's box
    | in terms of providing viable channels for whitewashing
    | explicit theft activity. Steal anything proprietary, run it
    | through an AI filter that mixes it with other stuff and claim
    | it as your own.
 
  | ebalit wrote:
  | Mistral 7B [1] and many models stemming from it are released
  | under permissive Apache license.
  | 
  | Some might argue that a "pure" open-source would require the
  | dataset and the training "recipe" as it would be needed to
  | reproduce the training, but it would be so expensive that most
  | people wouldn't be able to do much with it.
  | 
  | IMO, a release with open weights without the "source" is much
  | better than the opposite, a release with open source and no
  | trained weights.
  | 
  | And it's not like there was no progress on the open dataset
  | front: - Together just released RedPajama V2, with enough
  | tokens to train a very sizeable base model. - Tsinghua released
  | UltraFeedback which allowed more people to align models using
  | RLHF methods (like the Zephyr models from Hugging Face) - and
  | many many others
  | 
  | [1] https://mistral.ai/news/announcing-mistral-7b/ [2]
  | https://github.com/togethercomputer/RedPajama-Data
 
  | seydor wrote:
  | mistral appears to be quite open, and even better than llama
  | imho
 
  | emadm wrote:
  | Check out our fully open recent 3b model which outperforms most
  | 7b models and runs on an iPhone/cpu, fully open including data
  | and details
  | 
  | Tuned versions outperform 13b vicuna, wizard etc
  | 
  | https://stability.wandb.io/stability-llm/stable-lm/reports/S...
 
| nologic01 wrote:
| Is there a truly open source effort in the LLM space? Like a
| collaborative, crowd-sourced effort (possibly with academic
| institutions playing a major role) that relies on creative
| commons licensed or otherwise open data and produces a public
| good as final outcome?
| 
| There is this ridiculous idea of AI moats and other machinations
| for the next big VC thing (god bless them, people have spend
| their energy on worse pursuits) but in a fundamental sense there
| is a public good type infrastructure crying out to be developed
| for each major linguistic domain.
| 
| Maybe such an effort would not be cutting edge enough to power
| the next corporate chatbot that will eliminate 99% of all jobs,
| but it would be a significant step up in our ability to process
| text.
 
  | vinni2 wrote:
  | I think OpenAssistant is the closest to what you are describing
  | but their models are not yet that great. https://open-
  | assistant.io/
 
    | nulld3v wrote:
    | Open Assistant just shut down:
    | https://www.youtube.com/watch?v=gqtmUHhaplo
    | 
    | Cited reasons: Lack of resources, lack of maintainer time and
    | there being many new good alternatives.
 
  | dartos wrote:
  | RWKV is fully open source and even part of the Linux foundation
  | 
  | Idk why nobody ever talks about it
 
  | TheCaptain4815 wrote:
  | ElutherAi fits that I believe. In the olden days (1.5 years
  | ago) they probably had the best open source model with their
  | NeoX model, but it's been ellipsed by Llamma and other "open
  | source" models I believe. They still have an active discord
  | with a great community pushing forward.
 
  | emadm wrote:
  | We back rwkv, eleuther ai and others at stability ai
  | 
  | We also have our carper.ai lab for the rl buts
  | 
  | We are rolling out open language models and datasets soon for a
  | number of languages too, see our recent Japanese language
  | models for example
  | 
  | Got some big plans soon, have funded it all ourself but sure
  | other would like to help
 
| seydor wrote:
| $NVDA went to the moon, AI stocks skyrocketed including any beer
| with "AI" in its name. The rest of the story is typical by now:
| vc money flows, companies hide their trade secrets (prompts),
| public research is derailed. It's all very premature, LLMs was
| not the end of the road.
 
  | brrrrrm wrote:
  | Why do you say "prompts" is the canonical trade secret?
 
| jimmySixDOF wrote:
| On a retrospective take of the state of AI 1 year this month into
| LLMs post ChatGPT, I would like to single out Simon Wilson as the
| MVP for Open AI tooling contribution. His datasett projects are a
| great work in in progress and the prodigious blog posts and TIL
| snips are state of the art. Great onboarding to the whole
| ecosystem. I find myself using something he has produced in some
| way everyday.
| 
| https://simonwillison.net/
 
| raincole wrote:
| I think open models are more like closed source freemium
| applications. You got the weights, which are "compiled" from the
| source material. You're free to use it, but you can't, for
| example, remove one source material from it.
 
___________________________________________________________________
(page generated 2023-11-04 23:00 UTC)