proxy70

	[HN Gopher] SoundStorm: Efficient Parallel Audio Generation ___________________________________________________________________ SoundStorm: Efficient Parallel Audio Generation Author : sh_tomer Score : 194 points Date : 2023-07-16 16:53 UTC (6 hours ago)
	web link (google-research.github.io)
	w3m dump (google-research.github.io)
	\| binary132 wrote: \| When people wax eloquent about how the artisans will just find \| something new to do for work, what they fail to mention is that \| the new work is often a menial and lower-paid job. When Amazon \| puts mom and pop shops out of business, they don't go start new \| businesses, they go get jobs at Wal-Mart. \| qwertox wrote: \| In CGI there were always these milestones which I observed \| getting reached. Like trees with leaves finally looking close to \| realistic, wind blowing in grass looking almost realistic, hair, \| jelly, and it were usually Pixar shorts pointing out what they \| have been focusing on and then seeing it applied to their movies. \| \| Then mocap, mapping digital faces on real actors which was first \| mind-blowing to see in Pirates of the Caribbean, then the apes in \| one of the Planet of the Apes movies... So much in the CGI \| industry has already reached a point where the hardest problems \| seem to have been solved. \| \| When I now clicked play on the first Synthesized Dialoge from \| Dialogue Synthesis "Where did you go last summer? \| I went to \| Greece, it was amazing.", I was blown away. It's as if we've now \| reached one of those milestones where a problem appears to be \| fixed or cracked. Machines will be able to really sound like \| humans, indistinguishable from them. \| \| 10-5 years ago, if you wanted to deal with TTS, the best option \| you had was to let your Android phone render a TTS into an audio \| file, because everything else sounded really bad. Specially Open \| Source stuff sounded absolutely horrible. \| \| So how long will it be until we will be able to download \| something of this quality onto a future-gen Raspberry Pi which \| can do some AI processing, where we make an HTTP call and it \| starts speaking through the audio out in a perfect voice without \| relying on the cloud? 5 years? \| bckr wrote: \| I would bet 2 years tops \| amelius wrote: \| Another question, how long until we have systems that can sing \| 10 octaves and we don't need/want any actual human singers \| anymore? \| jayd16 wrote: \| People like to song along though. \| tialaramex wrote: \| People like playing drums too, but a drum machine means \| that if you're not any good at it or too busy but you need \| drum sounds you can have drum sounds. \| \| There are rights issues if the result is it replaces a \| particular singer, if you made it so that Sneaker Pimps can \| fire Kelli but still have her voice on subsequent songs \| that's a problem. But suppose you're a bedroom musician, \| and you realise you've got a piece that really wants \| somebody with a different voice than yours to make it work \| - you _can_ pay someone, but technology like this offers a \| cheaper, easier option. \| ttul wrote: \| As a choral singer, if there's an app that one day allows me \| to sing with a fake choir of extremely good singers, I would \| enjoy doing that all day long. And it would allow my actual \| choir to practice way more, making our performances far \| better. \| nraford wrote: \| This exists right now! \| \| Not as an app exactly, but you should check out Holly \| Herndon and Mat Dryhurt's suite of tools called "Holly \| Plus": \| \| https://holly.plus/ \| \| I'm pretty sure you can access their model somehow and even \| train your own voice using their "spawning" approach. \| \| She did an awesome TED talk demonstrating this: \| \| https://www.ted.com/talks/holly_herndon_what_if_you_could_s \| i... \| \| Here's a cool example, using Dolly Parton's song "Jolene": \| \| https://www.youtube.com/watch?v=kPAEMUzDxuo \| \| I don't think it's quite at the level of consumer use yet, \| but I know they're working on it. Definitely check it out. \| JonathanFly wrote: \| >So how long will it be until we will be able to download \| something of this quality onto a future-gen Raspberry Pi which \| can do some AI processing, where we make an HTTP call and it \| starts speaking through the audio out in a perfect voice \| without relying on the cloud? \| \| 5 years? It's probably possible roughly whenever the larger \| Whisper models can run on it. Probably the next Raspberry Pi, \| running quantized or optimized versions of some audio model. \| \| It may be almost possible right now if you tried really realy \| hard, and you used a small model fine-tuned on a single voice, \| instead of something larger and more general purpose that can \| do any voice. I think whisper-tiny works on a Pi on real time, \| right? And that's not leveraging the GPU on the Pi. \| (https://github.com/ggerganov/whisper.cpp/discussions/166) \| \| Edit: looks like medium is 30x slower on the Pi than tiny \| model, so I may have been overly optimistic. I didn't realize \| Whisper tiny was that much faster than medium. \| \| This method works pretty well with Tortoise, letting you use \| the super fast Tortoise quality settings but get quality \| similar to the larger models. Fine-tuning the whole thing on \| just one voice removes a lot of the cool capabilities of \| course. With Tortoise, that would still be way too slow for a \| Pi but potentially that same strategy could work with faster \| models like SoundStorm. \| \| In terms of quality there's still a lot of room to go with long \| term coherence, like long audio segments. When a real person \| reads an audiobook the words at the top the page have a pretty \| big impact on how many words at the bottom the page are read. \| And there can be some impact at any distance, page 10 to page \| 300. When you try audiobooks on super high end TTS models and \| listen carefully you really notice the mismatch. It's like the \| reader recorded the paragraphs out of order, or a video game \| voice lines where you can tell the actors recorded all the \| lines separately, and were not reacting to each other's \| performance. \| \| You can bump the context windows, a minute, two minutes. That's \| gonna get you closer and probably good enough for some books. \| In the short term a human could simply adjust all the all the \| audio samples and manually tweak things to sound correct. So \| this will enable fan-created audiobooks where they take the \| time to get it right. But for fully automated books the \| mismatch drives me nuts. The performance is just soooo close \| for certain segments that when you get a tonal mismatch it \| hurts. \| nine_k wrote: \| In you need a really compact form factor, you can buy a Jetson \| right now and run more complex models on it. It's pricey \| though. \| JonathanFly wrote: \| Interesting that SoundStorm was trained to produce dialog between \| two people using transcripts annotated with '\|' marking changes \| in voice. But the exact same '\|' characters seem to mostly work \| in the Bark model out of the box and also produce a dialog? \| \| Maybe a third or a bit more of Bark outputs are a dialog person \| talking to _themselves_ -- and it often misses a voice change. \| But the pipe characters do reliably produce audio that sounds \| like a _dialog_ in the performance style. \| \| https://twitter.com/jonathanfly/status/1675987073893904386 \| \| Is there some text-audio data somewhere in the training data that \| uses \| for voice changes? \| \| Amusingly, Bark tends to render the SoundStorm prompts \| sarcastically. Not sure if that's a difference in style in the \| models, or just Google cherry picking the more straightforward \| line readings as the featured samples. \| og_kalu wrote: \| The creators won't say as far as i know but bark looks to be \| trained on lot of youtube corpora (rather than typical ML audio \| datasets) where audio may have transcripts like that and why \| stuff like [laughs] work \| neilv wrote: \| In the future, will children think it's normal to talk like, \| "Hey, what up, Youtube! ... Be sure to like and subscribe! \| ... Smash that like button! ... Let me know in the comments \| down below!"? \| \| I wonder how ML trained on the tone transitions to a \| sponsored segment dripping with secret shame... would infect \| general speech. \| JonathanFly wrote: \| Yeah I often try to think about what might be in a YouTube \| caption when finding prompts that work in Bark. But pipe \| character isn't one I remember seeing on YouTube. Maybe it's \| part of some other audio dataset though. Or maybe it's on \| YouTube but only in non English videos. \| butz wrote: \| With all recent advances, are there any decent TTS voices for \| Linux that are not complicated to set up for regular user? \| elAhmo wrote: \| This is nothing short of amazing. It is exciting, a bit scary as \| well, what the future will bring. \| \| It just makes me sad that I cannot open this page on Safari. It \| will not play a single audio, yet Chrome plays it fine. So here \| we are, able to generate audio, video, code, do amazing things \| with AI, but a simple website that has text and audio is not \| working on the most popular laptop out there. \| mg wrote: \| I wonder if work marketplaces like UpWork and Fiverr will adapt \| quickly enough to this new situation, where many of their \| services, which in the past were done by humans, can now be done \| by software. \| \| Their current marketplace interface seems inadequate for this. \| Instead of contacting a human and then wait for them to finish \| the work, buyers will want to get results right away. \| \| Therefore they will have to change their platform to work like an \| app store. Where the sellers connect their services and buyers \| can use these services. \| seydor wrote: \| > where many of their services, which in the past were done by \| humans, can now be done by software. \| \| Their users are already using AI to do the work that they are \| supposed to do. i think that's fine \| throw47474777j wrote: \| Why wouldn't people just use existing software markets? \| mg wrote: \| For example? \| throw47474777j wrote: \| App Stores, the web, etc. How else does software as a \| service get sold? It's not a new thing. Probably a lot of \| these things will just end up as features in existing \| systems. \| mg wrote: \| Existing appstores like the ones on iOS and Android \| mostly target casual use cases, mobile devices and on- \| device software. Not "buy once" experiences for work via \| software as a service. They also do not offer a unified \| experience. Two "text-to-speach" apps could have \| completely different user interfaces. \| \| The web does not have good discovery and reputation \| management and also does not provide a unified interface. \| That is why market places like Booking.com, Amazon, \| Spotify etc have become so big. \| Legend2440 wrote: \| Why does everybody focus on "how will this replace humans?" \| It's just a really good text-to-speech. \| pjmlp wrote: \| Maybe because I no longer hear friendly human voices on train \| stations, rather computer generated train announcements? \| \| While those people are now looking for jobs elsewhere. \| Legend2440 wrote: \| Fantastic! That's a massive efficency gain. \| \| We will not run out of productive things to do with our \| time. Labor force participation has stayed in 60-70% \| despite centuries of automation. \| pjmlp wrote: \| Lovely capitalism. \| relativ575 wrote: \| Announcements often get played repeatedly -- "Train 101 to \| Lisbon is now on track 5". Why do you want to torture \| station's workers with that? \| \| Instead, make an effort to start a conversation with your \| fellow travelers, or graciously respond to such effort from \| them. Apologize if you already do. \| pjmlp wrote: \| Better a tortured job that puts food on the table than \| none at all. \| cpill wrote: \| tell that to the kids in Nike sweat shops \| ImHereToVote wrote: \| Personally I can't wait for all the streets to be lined with \| the homeless like in SF. So good. \| akaij wrote: \| It's kinda sad to see you believe that this is the \| inevitable outcome. \| pjmlp wrote: \| Well, if we imagine that the only thing that will be left \| are physical jobs that can't be done by computers. \| \| At least until they get clever enough to start a \| transformers line factory. \| Legend2440 wrote: \| This is the lump of labor fallacy. It's not about "what \| jobs will be left", it's about the new jobs we'll invent \| with all the time we'll have on our hands. \| \| There was never a fixed number of jobs, there's a fixed \| number of workers. \| pjmlp wrote: \| Well, we can also return to feudalism. \| PhasmaFelis wrote: \| Because it _will_ replace humans, and that 's worth thinking \| about? \| nwoli wrote: \| Seems like we wouldn't be far at all from just correlating this \| to face movement (including subtle iris movement and blinks, not \| just the mouth). As long as you clearly label it as CGI it's \| harmless and I'm excited for the day to come. Might be quite fun \| to chat with a little buddy this way \| og_kalu wrote: \| It's good that Bing, Bard are using the latest Microsoft, Google \| Cloud offerings but it would be nice to see these speech advances \| (along with audio palm - https://google- \| research.github.io/seanet/audiopalm/examples/ etc) hit public \| api's and/or user interfaces. \| \| Bard's TTS is alright but it's clearly behind. \| \| On that note, Bing's English/Korean TTS is really good. I also \| didn't realize Microsoft uses the best offerings for free TTS on \| edge so it blows google's default tts voices away. \| jameszhao00 wrote: \| Have you tried Google Cloud Studio voices? \| \| https://cloud.google.com/text-to-speech/docs/wavenet#studio_... \| og_kalu wrote: \| Yes. I'm not saying Google's Top Cloud offerings are bad \| although i still think microsoft's stuff is better. \| \| Just that \| \| 1. It's behind their current sota research \| \| 2. You can only use those voices extensively by paying for \| it. Microsoft offers their best stuff on edge for free. So \| for reading aloud a pdf or web page, microsoft is far better. \| jameszhao00 wrote: \| By "SOTA" tts I think you mean LLM based TTS? With sound \| and language tokens trained GPT style? \| \| Without going into too much details, imo they're not really \| usable right now for TTS use cases. \| skybrian wrote: \| It's disappointing, but I wouldn't expect research \| algorithms to be available immediately unless they held it \| back until the product is ready. I guess Apple would do \| that? \| GordonS wrote: \| I used Azure TTS for a product demo voice-over recently, and \| nobody I showed it to knew it wasn't a human doing it! \| \| Some of Azure's voices are better than others, and the TTS web \| app has a few minor bugs, but overall I was really pleased with \| the whole experience. \| refulgentis wrote: \| > I also didn't realize Microsoft uses the best offerings for \| free TTS on edge so it blows google's default tts voices away. \| \| This sounds really interesting - can you share a bit more? I'm \| behind in this space, my parser got all jammed up, something \| like: "Microsoft uses [the best offerings for free TTS](as in \| FOSS libraries, or free as in beer SaaS?) [on edge](Edge \| browser, or on the edge as in client's computer?)(Is the \| implication that all TTS on the client's computer blows \| Google's default TTS voices away?)" \| GranPC wrote: \| I believe they mean that the free TTS feature in Microsoft \| Edge uses their best technology, and that said tech is better \| than Google's default offering. \| og_kalu wrote: \| The top voices you'd pay for on Azure's TTS services can be \| used for free to read web page(and PDF) text on Microsoft \| Edge. I don't mean Open source. \| \| This is not the case with Google \| wg0 wrote: \| I didn't know that. Edge is too good. Just downloaded and \| such features are great. \| ShamelessC wrote: \| > public api's and/or user interfaces \| \| sigh. Google used to release _some_ models. Guess the fun early \| days are coming to an end. \| Legend2440 wrote: \| Google is a business and this is clearly a valuable product. \| ShamelessC wrote: \| Sure, but there was a time not too long ago when companies \| were still in the "good will" phase of handing out even \| highly valuable models like CLIP, guided-diffusion, etc. \| Come to think, it was mostly OpenAI doing this. And they \| kinda still do? But far more selectively. I'm just \| preemptively romanticizing that. \| rasz wrote: \| Product is something you sell to make money. The only real \| Google product is users sold to advertisers. \| vore wrote: \| Uh, what about all of their paid cloud offerings? \| rasz wrote: \| Distraction. Generated whole 1% of overall profit last \| quarter, and that was the first time it didnt lose money. \| https://www.cnbc.com/2023/04/25/googles-cloud-business- \| turns... \| jsnell wrote: \| Google's non-advertising revenue in the latest quarter \| was about $15 billion. Is that significant amount of non- \| ads product revenue? At least that is higher than the \| revenue of any of IBM, HP, Oracle, Intel, Cisco, Netflix, \| Broadcom, Qualcomm, or Salesforce in that same quarter. \| \| I think their non-ads businesses alone would be the 6th \| largest US tech company by revenue. (Amazon, Apple, \| Microsoft, the ads business of Alphabet, Meta. Am I \| forgetting something?) \| rasz wrote: \| Revenue is easy when you lose money on every dollar. Last \| quarter Ads printed $21B of income, rest was a loss \| except cloud not losing hundreds of $millions for the \| very first time. \| \| https://abc.xyz/assets/investor/static/pdf/2023Q1_alphabe \| t_e... \| joezydeco wrote: \| [flagged] \| vore wrote: \| I don't want to defend Google's business practices, but \| this is such a trite comment someone always feels \| compelled to post on anything about Google, including \| even a research paper, apparently. \| joezydeco wrote: \| I'll argue it's not trite. It's a concise compilation of \| the thousands of teeth-gnashing comments here on HN and \| all over the internet whenever Google randomly drowns \| another one of its children. \| \| Just fucking stay away from Google products. Period. \| relativ575 wrote: \| First of all it isn't a product. It's a fking research \| paper. Like dozens other showing up on HN every day. Most \| of them never becomes a product. \| \| Second of all, by whining nauseously you drown out \| discussions on the merits of the technology, and chase \| people away. I hardly read Google news on HN now \| precisely because of that reason. Imagine if "Attention \| is all your need" came out now? [0] \| \| Save your complaint for when Google makes it a product. \| \| [0] - https://news.ycombinator.com/item?id=15938082 \| serf wrote: \| >Save your complaint for when Google makes it a product. \| \| or save yourself the trouble and find alternatives to \| big-G. \| \| It's entirely their own fault that people now view all \| Google news as temporary and fleeting. People don't want \| to put time into things that'll get thrown away in a \| year. \| \| Reading G research papers seems like a shortcut to me, \| know what will be thrown away in 2 years before it's a \| valid product in 1 year and someone gets huckleberry'd \| into devoting time and effort into implementing the dead- \| product-walking API. \| signatoremo wrote: \| > It's entirely their own fault that people now view all \| Google news as temporary and fleeting. People don't want \| to put time into things that'll get thrown away in a \| year. \| \| Most of research don't become their own products, from \| Google or anyone else. As a research project they still \| have values, unless you are saying Google research is \| garbage because they get into the habit of canceling \| their products. \| \| > or save yourself the trouble and find alternatives to \| big-G \| \| Totally valid point. No need to complain about it in a \| post about Google research though. It's tiresome. \| glimshe wrote: \| It's a very relevant comment. It tells you to not rely, \| or expect further development, on any new Google \| technology, even seemingly good ones, as it can go to the \| graveyard like many others. \| georgemcbay wrote: \| I don't bother to post the comment, but the high \| likelihood of any Google project/product being killed \| within a year or two is absolutely the first thought I \| have whenever a new Google project/product is announced \| (not because of HN posts, but because of their history), \| so good job on that Google. \| og_kalu wrote: \| Ha i'm not even asking for code/model releases. It's just a \| bit funny that what you can pay* google to use is so far \| behind what they have up and running collecting dust. \| ShamelessC wrote: \| Also true. \| Raed667 wrote: \| I'm speculating here, but for me it looks like the product \| (R&D) teams are not working closely with the research \| teams. \| \| Even the demo website is on Github Pages instead of a \| Google domain/blog. \| asutekku wrote: \| The most impressive part of this is that they are seemingly able \| to produce 30 seconds of TTS with just 3 seconds of source \| material. That is super cool and honestly much more further in \| the curve that I expected it to be. \| tagyro wrote: \| I've wasted (counting) about 300 seconds of my life listening to \| these audio files and they all sound and seem fake... \| svantana wrote: \| I found that in my (high quality) studio monitors, the audio \| sounded fine and hard to distinguish from 24kHz wav. But in \| headphones, the artifacts were pretty obvious. So probably some \| reverberation will do a lot to cover up artifacts. In the \| paper, they only do a subjective comparison between the \| generated audio and the _soundstream-encoded_ original audio, \| which seems a bit disingenuous. Listening to soundstream audio \| in headphones, I can hear those same artifacts. \| jeffbee wrote: \| Did you read the paper? They intentionally steered the quality \| to ensure they sound fake. Their generated speech is "very easy \| to detect" according to the reference at the end of the paper. \| tagyro wrote: \| just to be clear, one could mistake them for some (voice) actor \| reading a book (maybe) but even to my untrained ear they sound \| fake and artificial. \| \| Am i missing something? \| kvn8888 wrote: \| It's meant to sound artificial. The focus is on speed and \| consistency \| willemmerson wrote: \| I don't have anything intelligent to say about this but it's ALOT \| of fun making all the samples play at the same time - sort of \| like the HTML version of Ableton Live. \| [deleted] \| anigbrowl wrote: \| Good for fraudsters and spammers, bad for anyone who ever hoped \| to make a living from voice acting. I'm perplexed by AI \| technologists' seemingly incessant drive to automate away the \| existence of artistic performers. \| croes wrote: \| Why spare artists if everyone else gets replaced by technology? \| anigbrowl wrote: \| They don't, otherwise there would be many former CEOs living \| in tents. In reality, those who control large amounts of \| capital are quite willing (and increasingly, say so in the \| open) to to deprive others of their livelihoods, homes, and \| ability to feed themselves in order to realize a marginal \| increase in their own wealth. \| Legend2440 wrote: \| You are being deliberately pessimistic. There are a million \| fantastic, practical uses for text-to-speech. \| anigbrowl wrote: \| I am not. The use cases like interactive assistants for the \| blind will generate very little commercial activity compared \| to the uses (and abuses) for entertainment and marketing \| purposes. A good example of this from the real world is the \| absence of cheap/open ASL interpretation for deaf people. \| Legend2440 wrote: \| Imagine having an app on your phone that turns any ebook \| into an audiobook. \| \| Imagine replacing crappy phone menus with polite virtual \| assistants that actually understand what you're saying. \| \| Imagine an AI language tutor that speaks every language in \| the world fluently. Or a universal speech-to-speech \| translator. \| \| And that's just off the top of my head. Clever people will \| come up with a lot more uses, I'm sure. \| anigbrowl wrote: \| I don't need your help imagining use cases; I've been in \| this field a lot longer than you, and have talked up the \| technological possibilities of AI-powered TTS here for \| *years. I understand the technology very well and am \| bullish on it. What I'm saying is that too much of the \| effort is being spent in solving the wrong problems. \| Please try reading what I wrote instead of your imaginary \| subtext. \| signatoremo wrote: \| Ever notice big huge font on the phone of older people? So \| big that a screen may only contain a few lines of text. Or \| that people has to pull out their reading glasses every \| time they check their phone? Text to speech is a godsend in \| that case. Enormous benefits to an increasingly older \| population. \| anigbrowl wrote: \| 'helping blind people' was literally the first use case I \| mentioned. Maybe you should have read the comment before \| reacting to it. \| signatoremo wrote: \| Huh? How big is the blind group compared to the older \| population? \| \| You are saying it's not economical to use tech to speech \| to support blind people. I'm saying the benefits are huge \| for older population. It isn't just for fraudsters or \| spammers as you claim. \| anigbrowl wrote: \| No, I'm not saying that at all. I'm saying the resources \| invested in helping people will be dwarfed by those \| invested in crap designed to exploit them economically or \| criminally. \| signatoremo wrote: \| Set asides the fact that you have absolutely no proof of \| that claim, the criminal world is tiny compared to the \| people who benefit from TTS (God forbid if that isn't the \| case). Encryption, as an example, is hugely beneficial to \| the regular people despite being used or exploited \| extensively in the shady and questionable activities. \| wg0 wrote: \| LLMs aren't great and can't be relied upon in business setting \| or at least I would not. \| \| But think open world games. GTA VII for example where all NPCs \| have their dialogs auto generated in real time but also \| converted to audio in real time. \| \| That's going to be a world which would be a lot more \| spontaneous with lot less effort. \| \| Right now, If memory serves me right, GTA V dialogs alone are \| 5000 pages or more, hand written. \| anigbrowl wrote: \| That's all true, but I think it's a pity that the jobs that \| currently exist for voice artists will disappear. Gamers and \| consumers will have somewhat better interactive experiences, \| which is good. Indie game developers will also be able to put \| out games with lower budgets, which is nice for them. But the \| market for voice acting work is largely going to dry up and \| blow away for people who are not already at the top of that \| field. People who could previously have made a modest but \| sufficient living as voice performers will be replaced by \| computer-generated voices. It will be almost impossible to \| make a living in that field within 5 years. \| wg0 wrote: \| Generative models around images are nothing new and have \| been around for a while already. But even today, if you \| really want creative control and expression, you need a \| designer that's good with Photoshop or Illustrator etc. \| \| This is applicable to LLMs as well. You can get it to write \| plausible BS but if you really want a rooted in reality, \| well articulated write up about something, a human has to \| be taken onboard. \| \| This equally extends to voice over. If you really want \| expressive and creative control to put some outstanding \| rendering of something, AI isn't going to cut it. \| anigbrowl wrote: \| This is only true if you assume AI isn't going to keep \| improving. It gets significantly better on a quarterly \| basis, far faster than the time it takes for an actor to \| develop their craft and career. The output quality of \| todays' cutting edge models would have been science \| fiction only 2-3 years ago. \| wg0 wrote: \| I'm not so sure about the future. Such models, all the \| models don't have a well understood input output mapping \| and that's going to be a problem for a very long time. ___________________________________________________________________ (page generated 2023-07-16 23:00 UTC)