[HN Gopher] SoundStorm: Efficient Parallel Audio Generation
___________________________________________________________________
 
SoundStorm: Efficient Parallel Audio Generation
 
Author : sh_tomer
Score  : 194 points
Date   : 2023-07-16 16:53 UTC (6 hours ago)
 
web link (google-research.github.io)
w3m dump (google-research.github.io)
 
| binary132 wrote:
| When people wax eloquent about how the artisans will just find
| something new to do for work, what they fail to mention is that
| the new work is often a menial and lower-paid job. When Amazon
| puts mom and pop shops out of business, they don't go start new
| businesses, they go get jobs at Wal-Mart.
 
| qwertox wrote:
| In CGI there were always these milestones which I observed
| getting reached. Like trees with leaves finally looking close to
| realistic, wind blowing in grass looking almost realistic, hair,
| jelly, and it were usually Pixar shorts pointing out what they
| have been focusing on and then seeing it applied to their movies.
| 
| Then mocap, mapping digital faces on real actors which was first
| mind-blowing to see in Pirates of the Caribbean, then the apes in
| one of the Planet of the Apes movies... So much in the CGI
| industry has already reached a point where the hardest problems
| seem to have been solved.
| 
| When I now clicked play on the first Synthesized Dialoge from
| Dialogue Synthesis "Where did you go last summer? | I went to
| Greece, it was amazing.", I was blown away. It's as if we've now
| reached one of those milestones where a problem appears to be
| fixed or cracked. Machines will be able to really sound like
| humans, indistinguishable from them.
| 
| 10-5 years ago, if you wanted to deal with TTS, the best option
| you had was to let your Android phone render a TTS into an audio
| file, because everything else sounded really bad. Specially Open
| Source stuff sounded absolutely horrible.
| 
| So how long will it be until we will be able to download
| something of this quality onto a future-gen Raspberry Pi which
| can do some AI processing, where we make an HTTP call and it
| starts speaking through the audio out in a perfect voice without
| relying on the cloud? 5 years?
 
  | bckr wrote:
  | I would bet 2 years tops
 
  | amelius wrote:
  | Another question, how long until we have systems that can sing
  | 10 octaves and we don't need/want any actual human singers
  | anymore?
 
    | jayd16 wrote:
    | People like to song along though.
 
      | tialaramex wrote:
      | People like playing drums too, but a drum machine means
      | that if you're not any good at it or too busy but you need
      | drum sounds you can have drum sounds.
      | 
      | There are rights issues if the result is it replaces a
      | particular singer, if you made it so that Sneaker Pimps can
      | fire Kelli but still have her voice on subsequent songs
      | that's a problem. But suppose you're a bedroom musician,
      | and you realise you've got a piece that really wants
      | somebody with a different voice than yours to make it work
      | - you _can_ pay someone, but technology like this offers a
      | cheaper, easier option.
 
    | ttul wrote:
    | As a choral singer, if there's an app that one day allows me
    | to sing with a fake choir of extremely good singers, I would
    | enjoy doing that all day long. And it would allow my actual
    | choir to practice way more, making our performances far
    | better.
 
      | nraford wrote:
      | This exists right now!
      | 
      | Not as an app exactly, but you should check out Holly
      | Herndon and Mat Dryhurt's suite of tools called "Holly
      | Plus":
      | 
      | https://holly.plus/
      | 
      | I'm pretty sure you can access their model somehow and even
      | train your own voice using their "spawning" approach.
      | 
      | She did an awesome TED talk demonstrating this:
      | 
      | https://www.ted.com/talks/holly_herndon_what_if_you_could_s
      | i...
      | 
      | Here's a cool example, using Dolly Parton's song "Jolene":
      | 
      | https://www.youtube.com/watch?v=kPAEMUzDxuo
      | 
      | I don't think it's quite at the level of consumer use yet,
      | but I know they're working on it. Definitely check it out.
 
  | JonathanFly wrote:
  | >So how long will it be until we will be able to download
  | something of this quality onto a future-gen Raspberry Pi which
  | can do some AI processing, where we make an HTTP call and it
  | starts speaking through the audio out in a perfect voice
  | without relying on the cloud?
  | 
  | 5 years? It's probably possible roughly whenever the larger
  | Whisper models can run on it. Probably the next Raspberry Pi,
  | running quantized or optimized versions of some audio model.
  | 
  | It may be almost possible right now if you tried really realy
  | hard, and you used a small model fine-tuned on a single voice,
  | instead of something larger and more general purpose that can
  | do any voice. I think whisper-tiny works on a Pi on real time,
  | right? And that's not leveraging the GPU on the Pi.
  | (https://github.com/ggerganov/whisper.cpp/discussions/166)
  | 
  | Edit: looks like medium is 30x slower on the Pi than tiny
  | model, so I may have been overly optimistic. I didn't realize
  | Whisper tiny was that much faster than medium.
  | 
  | This method works pretty well with Tortoise, letting you use
  | the super fast Tortoise quality settings but get quality
  | similar to the larger models. Fine-tuning the whole thing on
  | just one voice removes a lot of the cool capabilities of
  | course. With Tortoise, that would still be way too slow for a
  | Pi but potentially that same strategy could work with faster
  | models like SoundStorm.
  | 
  | In terms of quality there's still a lot of room to go with long
  | term coherence, like long audio segments. When a real person
  | reads an audiobook the words at the top the page have a pretty
  | big impact on how many words at the bottom the page are read.
  | And there can be some impact at any distance, page 10 to page
  | 300. When you try audiobooks on super high end TTS models and
  | listen carefully you really notice the mismatch. It's like the
  | reader recorded the paragraphs out of order, or a video game
  | voice lines where you can tell the actors recorded all the
  | lines separately, and were not reacting to each other's
  | performance.
  | 
  | You can bump the context windows, a minute, two minutes. That's
  | gonna get you closer and probably good enough for some books.
  | In the short term a human could simply adjust all the all the
  | audio samples and manually tweak things to sound correct. So
  | this will enable fan-created audiobooks where they take the
  | time to get it right. But for fully automated books the
  | mismatch drives me nuts. The performance is just soooo close
  | for certain segments that when you get a tonal mismatch it
  | hurts.
 
  | nine_k wrote:
  | In you need a really compact form factor, you can buy a Jetson
  | right now and run more complex models on it. It's pricey
  | though.
 
| JonathanFly wrote:
| Interesting that SoundStorm was trained to produce dialog between
| two people using transcripts annotated with '|' marking changes
| in voice. But the exact same '|' characters seem to mostly work
| in the Bark model out of the box and also produce a dialog?
| 
| Maybe a third or a bit more of Bark outputs are a dialog person
| talking to _themselves_ -- and it often misses a voice change.
| But the pipe characters do reliably produce audio that sounds
| like a _dialog_ in the performance style.
| 
| https://twitter.com/jonathanfly/status/1675987073893904386
| 
| Is there some text-audio data somewhere in the training data that
| uses | for voice changes?
| 
| Amusingly, Bark tends to render the SoundStorm prompts
| sarcastically. Not sure if that's a difference in style in the
| models, or just Google cherry picking the more straightforward
| line readings as the featured samples.
 
  | og_kalu wrote:
  | The creators won't say as far as i know but bark looks to be
  | trained on lot of youtube corpora (rather than typical ML audio
  | datasets) where audio may have transcripts like that and why
  | stuff like [laughs] work
 
    | neilv wrote:
    | In the future, will children think it's normal to talk like,
    | "Hey, what up, Youtube! ... Be sure to like and subscribe!
    | ... Smash that like button! ... Let me know in the comments
    | down below!"?
    | 
    | I wonder how ML trained on the tone transitions to a
    | sponsored segment dripping with secret shame... would infect
    | general speech.
 
    | JonathanFly wrote:
    | Yeah I often try to think about what might be in a YouTube
    | caption when finding prompts that work in Bark. But pipe
    | character isn't one I remember seeing on YouTube. Maybe it's
    | part of some other audio dataset though. Or maybe it's on
    | YouTube but only in non English videos.
 
| butz wrote:
| With all recent advances, are there any decent TTS voices for
| Linux that are not complicated to set up for regular user?
 
| elAhmo wrote:
| This is nothing short of amazing. It is exciting, a bit scary as
| well, what the future will bring.
| 
| It just makes me sad that I cannot open this page on Safari. It
| will not play a single audio, yet Chrome plays it fine. So here
| we are, able to generate audio, video, code, do amazing things
| with AI, but a simple website that has text and audio is not
| working on the most popular laptop out there.
 
| mg wrote:
| I wonder if work marketplaces like UpWork and Fiverr will adapt
| quickly enough to this new situation, where many of their
| services, which in the past were done by humans, can now be done
| by software.
| 
| Their current marketplace interface seems inadequate for this.
| Instead of contacting a human and then wait for them to finish
| the work, buyers will want to get results right away.
| 
| Therefore they will have to change their platform to work like an
| app store. Where the sellers connect their services and buyers
| can use these services.
 
  | seydor wrote:
  | > where many of their services, which in the past were done by
  | humans, can now be done by software.
  | 
  | Their users are already using AI to do the work that they are
  | supposed to do. i think that's fine
 
  | throw47474777j wrote:
  | Why wouldn't people just use existing software markets?
 
    | mg wrote:
    | For example?
 
      | throw47474777j wrote:
      | App Stores, the web, etc. How else does software as a
      | service get sold? It's not a new thing. Probably a lot of
      | these things will just end up as features in existing
      | systems.
 
        | mg wrote:
        | Existing appstores like the ones on iOS and Android
        | mostly target casual use cases, mobile devices and on-
        | device software. Not "buy once" experiences for work via
        | software as a service. They also do not offer a unified
        | experience. Two "text-to-speach" apps could have
        | completely different user interfaces.
        | 
        | The web does not have good discovery and reputation
        | management and also does not provide a unified interface.
        | That is why market places like Booking.com, Amazon,
        | Spotify etc have become so big.
 
  | Legend2440 wrote:
  | Why does everybody focus on "how will this replace humans?"
  | It's just a really good text-to-speech.
 
    | pjmlp wrote:
    | Maybe because I no longer hear friendly human voices on train
    | stations, rather computer generated train announcements?
    | 
    | While those people are now looking for jobs elsewhere.
 
      | Legend2440 wrote:
      | Fantastic! That's a massive efficency gain.
      | 
      | We will not run out of productive things to do with our
      | time. Labor force participation has stayed in 60-70%
      | despite centuries of automation.
 
        | pjmlp wrote:
        | Lovely capitalism.
 
      | relativ575 wrote:
      | Announcements often get played repeatedly -- "Train 101 to
      | Lisbon is now on track 5". Why do you want to torture
      | station's workers with that?
      | 
      | Instead, make an effort to start a conversation with your
      | fellow travelers, or graciously respond to such effort from
      | them. Apologize if you already do.
 
        | pjmlp wrote:
        | Better a tortured job that puts food on the table than
        | none at all.
 
        | cpill wrote:
        | tell that to the kids in Nike sweat shops
 
    | ImHereToVote wrote:
    | Personally I can't wait for all the streets to be lined with
    | the homeless like in SF. So good.
 
      | akaij wrote:
      | It's kinda sad to see you believe that this is the
      | inevitable outcome.
 
        | pjmlp wrote:
        | Well, if we imagine that the only thing that will be left
        | are physical jobs that can't be done by computers.
        | 
        | At least until they get clever enough to start a
        | transformers line factory.
 
        | Legend2440 wrote:
        | This is the lump of labor fallacy. It's not about "what
        | jobs will be left", it's about the new jobs we'll invent
        | with all the time we'll have on our hands.
        | 
        | There was never a fixed number of jobs, there's a fixed
        | number of workers.
 
        | pjmlp wrote:
        | Well, we can also return to feudalism.
 
    | PhasmaFelis wrote:
    | Because it _will_ replace humans, and that 's worth thinking
    | about?
 
| nwoli wrote:
| Seems like we wouldn't be far at all from just correlating this
| to face movement (including subtle iris movement and blinks, not
| just the mouth). As long as you clearly label it as CGI it's
| harmless and I'm excited for the day to come. Might be quite fun
| to chat with a little buddy this way
 
| og_kalu wrote:
| It's good that Bing, Bard are using the latest Microsoft, Google
| Cloud offerings but it would be nice to see these speech advances
| (along with audio palm - https://google-
| research.github.io/seanet/audiopalm/examples/ etc) hit public
| api's and/or user interfaces.
| 
| Bard's TTS is alright but it's clearly behind.
| 
| On that note, Bing's English/Korean TTS is really good. I also
| didn't realize Microsoft uses the best offerings for free TTS on
| edge so it blows google's default tts voices away.
 
  | jameszhao00 wrote:
  | Have you tried Google Cloud Studio voices?
  | 
  | https://cloud.google.com/text-to-speech/docs/wavenet#studio_...
 
    | og_kalu wrote:
    | Yes. I'm not saying Google's Top Cloud offerings are bad
    | although i still think microsoft's stuff is better.
    | 
    | Just that
    | 
    | 1. It's behind their current sota research
    | 
    | 2. You can only use those voices extensively by paying for
    | it. Microsoft offers their best stuff on edge for free. So
    | for reading aloud a pdf or web page, microsoft is far better.
 
      | jameszhao00 wrote:
      | By "SOTA" tts I think you mean LLM based TTS? With sound
      | and language tokens trained GPT style?
      | 
      | Without going into too much details, imo they're not really
      | usable right now for TTS use cases.
 
      | skybrian wrote:
      | It's disappointing, but I wouldn't expect research
      | algorithms to be available immediately unless they held it
      | back until the product is ready. I guess Apple would do
      | that?
 
  | GordonS wrote:
  | I used Azure TTS for a product demo voice-over recently, and
  | nobody I showed it to knew it wasn't a human doing it!
  | 
  | Some of Azure's voices are better than others, and the TTS web
  | app has a few minor bugs, but overall I was really pleased with
  | the whole experience.
 
  | refulgentis wrote:
  | > I also didn't realize Microsoft uses the best offerings for
  | free TTS on edge so it blows google's default tts voices away.
  | 
  | This sounds really interesting - can you share a bit more? I'm
  | behind in this space, my parser got all jammed up, something
  | like: "Microsoft uses [the best offerings for free TTS](as in
  | FOSS libraries, or free as in beer SaaS?) [on edge](Edge
  | browser, or on the edge as in client's computer?)(Is the
  | implication that all TTS on the client's computer blows
  | Google's default TTS voices away?)"
 
    | GranPC wrote:
    | I believe they mean that the free TTS feature in Microsoft
    | Edge uses their best technology, and that said tech is better
    | than Google's default offering.
 
    | og_kalu wrote:
    | The top voices you'd pay for on Azure's TTS services can be
    | used for free to read web page(and PDF) text on Microsoft
    | Edge. I don't mean Open source.
    | 
    | This is not the case with Google
 
      | wg0 wrote:
      | I didn't know that. Edge is too good. Just downloaded and
      | such features are great.
 
  | ShamelessC wrote:
  | > public api's and/or user interfaces
  | 
  | sigh. Google used to release _some_ models. Guess the fun early
  | days are coming to an end.
 
    | Legend2440 wrote:
    | Google is a business and this is clearly a valuable product.
 
      | ShamelessC wrote:
      | Sure, but there was a time not too long ago when companies
      | were still in the "good will" phase of handing out even
      | highly valuable models like CLIP, guided-diffusion, etc.
      | Come to think, it was mostly OpenAI doing this. And they
      | kinda still do? But far more selectively. I'm just
      | preemptively romanticizing that.
 
      | rasz wrote:
      | Product is something you sell to make money. The only real
      | Google product is users sold to advertisers.
 
        | vore wrote:
        | Uh, what about all of their paid cloud offerings?
 
        | rasz wrote:
        | Distraction. Generated whole 1% of overall profit last
        | quarter, and that was the first time it didnt lose money.
        | https://www.cnbc.com/2023/04/25/googles-cloud-business-
        | turns...
 
        | jsnell wrote:
        | Google's non-advertising revenue in the latest quarter
        | was about $15 billion. Is that significant amount of non-
        | ads product revenue? At least that is higher than the
        | revenue of any of IBM, HP, Oracle, Intel, Cisco, Netflix,
        | Broadcom, Qualcomm, or Salesforce in that same quarter.
        | 
        | I think their non-ads businesses alone would be the 6th
        | largest US tech company by revenue. (Amazon, Apple,
        | Microsoft, the ads business of Alphabet, Meta. Am I
        | forgetting something?)
 
        | rasz wrote:
        | Revenue is easy when you lose money on every dollar. Last
        | quarter Ads printed $21B of income, rest was a loss
        | except cloud not losing hundreds of $millions for the
        | very first time.
        | 
        | https://abc.xyz/assets/investor/static/pdf/2023Q1_alphabe
        | t_e...
 
      | joezydeco wrote:
      | [flagged]
 
        | vore wrote:
        | I don't want to defend Google's business practices, but
        | this is such a trite comment someone always feels
        | compelled to post on anything about Google, including
        | even a research paper, apparently.
 
        | joezydeco wrote:
        | I'll argue it's not trite. It's a concise compilation of
        | the thousands of teeth-gnashing comments here on HN and
        | all over the internet whenever Google randomly drowns
        | another one of its children.
        | 
        | Just fucking stay away from Google products. Period.
 
        | relativ575 wrote:
        | First of all it isn't a product. It's a f*king research
        | paper. Like dozens other showing up on HN every day. Most
        | of them never becomes a product.
        | 
        | Second of all, by whining nauseously you drown out
        | discussions on the merits of the technology, and chase
        | people away. I hardly read Google news on HN now
        | precisely because of that reason. Imagine if "Attention
        | is all your need" came out now? [0]
        | 
        | Save your complaint for when Google makes it a product.
        | 
        | [0] - https://news.ycombinator.com/item?id=15938082
 
        | serf wrote:
        | >Save your complaint for when Google makes it a product.
        | 
        | or save yourself the trouble and find alternatives to
        | big-G.
        | 
        | It's entirely their own fault that people now view all
        | Google news as temporary and fleeting. People don't want
        | to put time into things that'll get thrown away in a
        | year.
        | 
        | Reading G research papers seems like a shortcut to me,
        | know what will be thrown away in 2 years before it's a
        | valid product in 1 year and someone gets huckleberry'd
        | into devoting time and effort into implementing the dead-
        | product-walking API.
 
        | signatoremo wrote:
        | > It's entirely their own fault that people now view all
        | Google news as temporary and fleeting. People don't want
        | to put time into things that'll get thrown away in a
        | year.
        | 
        | Most of research don't become their own products, from
        | Google or anyone else. As a research project they still
        | have values, unless you are saying Google research is
        | garbage because they get into the habit of canceling
        | their products.
        | 
        | > or save yourself the trouble and find alternatives to
        | big-G
        | 
        | Totally valid point. No need to complain about it in a
        | post about Google research though. It's tiresome.
 
        | glimshe wrote:
        | It's a very relevant comment. It tells you to not rely,
        | or expect further development, on any new Google
        | technology, even seemingly good ones, as it can go to the
        | graveyard like many others.
 
        | georgemcbay wrote:
        | I don't bother to post the comment, but the high
        | likelihood of any Google project/product being killed
        | within a year or two is absolutely the first thought I
        | have whenever a new Google project/product is announced
        | (not because of HN posts, but because of their history),
        | so good job on that Google.
 
    | og_kalu wrote:
    | Ha i'm not even asking for code/model releases. It's just a
    | bit funny that what you can *pay* google to use is so far
    | behind what they have up and running collecting dust.
 
      | ShamelessC wrote:
      | Also true.
 
      | Raed667 wrote:
      | I'm speculating here, but for me it looks like the product
      | (R&D) teams are not working closely with the research
      | teams.
      | 
      | Even the demo website is on Github Pages instead of a
      | Google domain/blog.
 
| asutekku wrote:
| The most impressive part of this is that they are seemingly able
| to produce 30 seconds of TTS with just 3 seconds of source
| material. That is super cool and honestly much more further in
| the curve that I expected it to be.
 
| tagyro wrote:
| I've wasted (counting) about 300 seconds of my life listening to
| these audio files and they all sound and seem fake...
 
  | svantana wrote:
  | I found that in my (high quality) studio monitors, the audio
  | sounded fine and hard to distinguish from 24kHz wav. But in
  | headphones, the artifacts were pretty obvious. So probably some
  | reverberation will do a lot to cover up artifacts. In the
  | paper, they only do a subjective comparison between the
  | generated audio and the _soundstream-encoded_ original audio,
  | which seems a bit disingenuous. Listening to soundstream audio
  | in headphones, I can hear those same artifacts.
 
  | jeffbee wrote:
  | Did you read the paper? They intentionally steered the quality
  | to ensure they sound fake. Their generated speech is "very easy
  | to detect" according to the reference at the end of the paper.
 
  | tagyro wrote:
  | just to be clear, one could mistake them for some (voice) actor
  | reading a book (maybe) but even to my untrained ear they sound
  | fake and artificial.
  | 
  | Am i missing something?
 
    | kvn8888 wrote:
    | It's meant to sound artificial. The focus is on speed and
    | consistency
 
| willemmerson wrote:
| I don't have anything intelligent to say about this but it's ALOT
| of fun making all the samples play at the same time - sort of
| like the HTML version of Ableton Live.
 
| [deleted]
 
| anigbrowl wrote:
| Good for fraudsters and spammers, bad for anyone who ever hoped
| to make a living from voice acting. I'm perplexed by AI
| technologists' seemingly incessant drive to automate away the
| existence of artistic performers.
 
  | croes wrote:
  | Why spare artists if everyone else gets replaced by technology?
 
    | anigbrowl wrote:
    | They don't, otherwise there would be many former CEOs living
    | in tents. In reality, those who control large amounts of
    | capital are quite willing (and increasingly, say so in the
    | open) to to deprive others of their livelihoods, homes, and
    | ability to feed themselves in order to realize a marginal
    | increase in their own wealth.
 
  | Legend2440 wrote:
  | You are being deliberately pessimistic. There are a million
  | fantastic, practical uses for text-to-speech.
 
    | anigbrowl wrote:
    | I am not. The use cases like interactive assistants for the
    | blind will generate very little commercial activity compared
    | to the uses (and abuses) for entertainment and marketing
    | purposes. A good example of this from the real world is the
    | absence of cheap/open ASL interpretation for deaf people.
 
      | Legend2440 wrote:
      | Imagine having an app on your phone that turns any ebook
      | into an audiobook.
      | 
      | Imagine replacing crappy phone menus with polite virtual
      | assistants that actually understand what you're saying.
      | 
      | Imagine an AI language tutor that speaks every language in
      | the world fluently. Or a universal speech-to-speech
      | translator.
      | 
      | And that's just off the top of my head. Clever people will
      | come up with a lot more uses, I'm sure.
 
        | anigbrowl wrote:
        | I don't need your help imagining use cases; I've been in
        | this field a lot longer than you, and have talked up the
        | technological possibilities of AI-powered TTS here for
        | *years. I understand the technology very well and am
        | bullish on it. What I'm saying is that too much of the
        | effort is being spent in solving the wrong problems.
        | Please try reading what I wrote instead of your imaginary
        | subtext.
 
      | signatoremo wrote:
      | Ever notice big huge font on the phone of older people? So
      | big that a screen may only contain a few lines of text. Or
      | that people has to pull out their reading glasses every
      | time they check their phone? Text to speech is a godsend in
      | that case. Enormous benefits to an increasingly older
      | population.
 
        | anigbrowl wrote:
        | 'helping blind people' was literally the first use case I
        | mentioned. Maybe you should have read the comment before
        | reacting to it.
 
        | signatoremo wrote:
        | Huh? How big is the blind group compared to the older
        | population?
        | 
        | You are saying it's not economical to use tech to speech
        | to support blind people. I'm saying the benefits are huge
        | for older population. It isn't just for fraudsters or
        | spammers as you claim.
 
        | anigbrowl wrote:
        | No, I'm not saying that at all. I'm saying the resources
        | invested in helping people will be dwarfed by those
        | invested in crap designed to exploit them economically or
        | criminally.
 
        | signatoremo wrote:
        | Set asides the fact that you have absolutely no proof of
        | that claim, the criminal world is tiny compared to the
        | people who benefit from TTS (God forbid if that isn't the
        | case). Encryption, as an example, is hugely beneficial to
        | the regular people despite being used or exploited
        | extensively in the shady and questionable activities.
 
  | wg0 wrote:
  | LLMs aren't great and can't be relied upon in business setting
  | or at least I would not.
  | 
  | But think open world games. GTA VII for example where all NPCs
  | have their dialogs auto generated in real time but also
  | converted to audio in real time.
  | 
  | That's going to be a world which would be a lot more
  | spontaneous with lot less effort.
  | 
  | Right now, If memory serves me right, GTA V dialogs alone are
  | 5000 pages or more, hand written.
 
    | anigbrowl wrote:
    | That's all true, but I think it's a pity that the jobs that
    | currently exist for voice artists will disappear. Gamers and
    | consumers will have somewhat better interactive experiences,
    | which is good. Indie game developers will also be able to put
    | out games with lower budgets, which is nice for them. But the
    | market for voice acting work is largely going to dry up and
    | blow away for people who are not already at the top of that
    | field. People who could previously have made a modest but
    | sufficient living as voice performers will be replaced by
    | computer-generated voices. It will be almost impossible to
    | make a living in that field within 5 years.
 
      | wg0 wrote:
      | Generative models around images are nothing new and have
      | been around for a while already. But even today, if you
      | really want creative control and expression, you need a
      | designer that's good with Photoshop or Illustrator etc.
      | 
      | This is applicable to LLMs as well. You can get it to write
      | plausible BS but if you really want a rooted in reality,
      | well articulated write up about something, a human has to
      | be taken onboard.
      | 
      | This equally extends to voice over. If you really want
      | expressive and creative control to put some outstanding
      | rendering of something, AI isn't going to cut it.
 
        | anigbrowl wrote:
        | This is only true if you assume AI isn't going to keep
        | improving. It gets significantly better on a quarterly
        | basis, far faster than the time it takes for an actor to
        | develop their craft and career. The output quality of
        | todays' cutting edge models would have been science
        | fiction only 2-3 years ago.
 
        | wg0 wrote:
        | I'm not so sure about the future. Such models, all the
        | models don't have a well understood input output mapping
        | and that's going to be a problem for a very long time.
 
___________________________________________________________________
(page generated 2023-07-16 23:00 UTC)