|
| binary132 wrote:
| When people wax eloquent about how the artisans will just find
| something new to do for work, what they fail to mention is that
| the new work is often a menial and lower-paid job. When Amazon
| puts mom and pop shops out of business, they don't go start new
| businesses, they go get jobs at Wal-Mart.
| qwertox wrote:
| In CGI there were always these milestones which I observed
| getting reached. Like trees with leaves finally looking close to
| realistic, wind blowing in grass looking almost realistic, hair,
| jelly, and it were usually Pixar shorts pointing out what they
| have been focusing on and then seeing it applied to their movies.
|
| Then mocap, mapping digital faces on real actors which was first
| mind-blowing to see in Pirates of the Caribbean, then the apes in
| one of the Planet of the Apes movies... So much in the CGI
| industry has already reached a point where the hardest problems
| seem to have been solved.
|
| When I now clicked play on the first Synthesized Dialoge from
| Dialogue Synthesis "Where did you go last summer? | I went to
| Greece, it was amazing.", I was blown away. It's as if we've now
| reached one of those milestones where a problem appears to be
| fixed or cracked. Machines will be able to really sound like
| humans, indistinguishable from them.
|
| 10-5 years ago, if you wanted to deal with TTS, the best option
| you had was to let your Android phone render a TTS into an audio
| file, because everything else sounded really bad. Specially Open
| Source stuff sounded absolutely horrible.
|
| So how long will it be until we will be able to download
| something of this quality onto a future-gen Raspberry Pi which
| can do some AI processing, where we make an HTTP call and it
| starts speaking through the audio out in a perfect voice without
| relying on the cloud? 5 years?
| bckr wrote:
| I would bet 2 years tops
| amelius wrote:
| Another question, how long until we have systems that can sing
| 10 octaves and we don't need/want any actual human singers
| anymore?
| jayd16 wrote:
| People like to song along though.
| tialaramex wrote:
| People like playing drums too, but a drum machine means
| that if you're not any good at it or too busy but you need
| drum sounds you can have drum sounds.
|
| There are rights issues if the result is it replaces a
| particular singer, if you made it so that Sneaker Pimps can
| fire Kelli but still have her voice on subsequent songs
| that's a problem. But suppose you're a bedroom musician,
| and you realise you've got a piece that really wants
| somebody with a different voice than yours to make it work
| - you _can_ pay someone, but technology like this offers a
| cheaper, easier option.
| ttul wrote:
| As a choral singer, if there's an app that one day allows me
| to sing with a fake choir of extremely good singers, I would
| enjoy doing that all day long. And it would allow my actual
| choir to practice way more, making our performances far
| better.
| nraford wrote:
| This exists right now!
|
| Not as an app exactly, but you should check out Holly
| Herndon and Mat Dryhurt's suite of tools called "Holly
| Plus":
|
| https://holly.plus/
|
| I'm pretty sure you can access their model somehow and even
| train your own voice using their "spawning" approach.
|
| She did an awesome TED talk demonstrating this:
|
| https://www.ted.com/talks/holly_herndon_what_if_you_could_s
| i...
|
| Here's a cool example, using Dolly Parton's song "Jolene":
|
| https://www.youtube.com/watch?v=kPAEMUzDxuo
|
| I don't think it's quite at the level of consumer use yet,
| but I know they're working on it. Definitely check it out.
| JonathanFly wrote:
| >So how long will it be until we will be able to download
| something of this quality onto a future-gen Raspberry Pi which
| can do some AI processing, where we make an HTTP call and it
| starts speaking through the audio out in a perfect voice
| without relying on the cloud?
|
| 5 years? It's probably possible roughly whenever the larger
| Whisper models can run on it. Probably the next Raspberry Pi,
| running quantized or optimized versions of some audio model.
|
| It may be almost possible right now if you tried really realy
| hard, and you used a small model fine-tuned on a single voice,
| instead of something larger and more general purpose that can
| do any voice. I think whisper-tiny works on a Pi on real time,
| right? And that's not leveraging the GPU on the Pi.
| (https://github.com/ggerganov/whisper.cpp/discussions/166)
|
| Edit: looks like medium is 30x slower on the Pi than tiny
| model, so I may have been overly optimistic. I didn't realize
| Whisper tiny was that much faster than medium.
|
| This method works pretty well with Tortoise, letting you use
| the super fast Tortoise quality settings but get quality
| similar to the larger models. Fine-tuning the whole thing on
| just one voice removes a lot of the cool capabilities of
| course. With Tortoise, that would still be way too slow for a
| Pi but potentially that same strategy could work with faster
| models like SoundStorm.
|
| In terms of quality there's still a lot of room to go with long
| term coherence, like long audio segments. When a real person
| reads an audiobook the words at the top the page have a pretty
| big impact on how many words at the bottom the page are read.
| And there can be some impact at any distance, page 10 to page
| 300. When you try audiobooks on super high end TTS models and
| listen carefully you really notice the mismatch. It's like the
| reader recorded the paragraphs out of order, or a video game
| voice lines where you can tell the actors recorded all the
| lines separately, and were not reacting to each other's
| performance.
|
| You can bump the context windows, a minute, two minutes. That's
| gonna get you closer and probably good enough for some books.
| In the short term a human could simply adjust all the all the
| audio samples and manually tweak things to sound correct. So
| this will enable fan-created audiobooks where they take the
| time to get it right. But for fully automated books the
| mismatch drives me nuts. The performance is just soooo close
| for certain segments that when you get a tonal mismatch it
| hurts.
| nine_k wrote:
| In you need a really compact form factor, you can buy a Jetson
| right now and run more complex models on it. It's pricey
| though.
| JonathanFly wrote:
| Interesting that SoundStorm was trained to produce dialog between
| two people using transcripts annotated with '|' marking changes
| in voice. But the exact same '|' characters seem to mostly work
| in the Bark model out of the box and also produce a dialog?
|
| Maybe a third or a bit more of Bark outputs are a dialog person
| talking to _themselves_ -- and it often misses a voice change.
| But the pipe characters do reliably produce audio that sounds
| like a _dialog_ in the performance style.
|
| https://twitter.com/jonathanfly/status/1675987073893904386
|
| Is there some text-audio data somewhere in the training data that
| uses | for voice changes?
|
| Amusingly, Bark tends to render the SoundStorm prompts
| sarcastically. Not sure if that's a difference in style in the
| models, or just Google cherry picking the more straightforward
| line readings as the featured samples.
| og_kalu wrote:
| The creators won't say as far as i know but bark looks to be
| trained on lot of youtube corpora (rather than typical ML audio
| datasets) where audio may have transcripts like that and why
| stuff like [laughs] work
| neilv wrote:
| In the future, will children think it's normal to talk like,
| "Hey, what up, Youtube! ... Be sure to like and subscribe!
| ... Smash that like button! ... Let me know in the comments
| down below!"?
|
| I wonder how ML trained on the tone transitions to a
| sponsored segment dripping with secret shame... would infect
| general speech.
| JonathanFly wrote:
| Yeah I often try to think about what might be in a YouTube
| caption when finding prompts that work in Bark. But pipe
| character isn't one I remember seeing on YouTube. Maybe it's
| part of some other audio dataset though. Or maybe it's on
| YouTube but only in non English videos.
| butz wrote:
| With all recent advances, are there any decent TTS voices for
| Linux that are not complicated to set up for regular user?
| elAhmo wrote:
| This is nothing short of amazing. It is exciting, a bit scary as
| well, what the future will bring.
|
| It just makes me sad that I cannot open this page on Safari. It
| will not play a single audio, yet Chrome plays it fine. So here
| we are, able to generate audio, video, code, do amazing things
| with AI, but a simple website that has text and audio is not
| working on the most popular laptop out there.
| mg wrote:
| I wonder if work marketplaces like UpWork and Fiverr will adapt
| quickly enough to this new situation, where many of their
| services, which in the past were done by humans, can now be done
| by software.
|
| Their current marketplace interface seems inadequate for this.
| Instead of contacting a human and then wait for them to finish
| the work, buyers will want to get results right away.
|
| Therefore they will have to change their platform to work like an
| app store. Where the sellers connect their services and buyers
| can use these services.
| seydor wrote:
| > where many of their services, which in the past were done by
| humans, can now be done by software.
|
| Their users are already using AI to do the work that they are
| supposed to do. i think that's fine
| throw47474777j wrote:
| Why wouldn't people just use existing software markets?
| mg wrote:
| For example?
| throw47474777j wrote:
| App Stores, the web, etc. How else does software as a
| service get sold? It's not a new thing. Probably a lot of
| these things will just end up as features in existing
| systems.
| mg wrote:
| Existing appstores like the ones on iOS and Android
| mostly target casual use cases, mobile devices and on-
| device software. Not "buy once" experiences for work via
| software as a service. They also do not offer a unified
| experience. Two "text-to-speach" apps could have
| completely different user interfaces.
|
| The web does not have good discovery and reputation
| management and also does not provide a unified interface.
| That is why market places like Booking.com, Amazon,
| Spotify etc have become so big.
| Legend2440 wrote:
| Why does everybody focus on "how will this replace humans?"
| It's just a really good text-to-speech.
| pjmlp wrote:
| Maybe because I no longer hear friendly human voices on train
| stations, rather computer generated train announcements?
|
| While those people are now looking for jobs elsewhere.
| Legend2440 wrote:
| Fantastic! That's a massive efficency gain.
|
| We will not run out of productive things to do with our
| time. Labor force participation has stayed in 60-70%
| despite centuries of automation.
| pjmlp wrote:
| Lovely capitalism.
| relativ575 wrote:
| Announcements often get played repeatedly -- "Train 101 to
| Lisbon is now on track 5". Why do you want to torture
| station's workers with that?
|
| Instead, make an effort to start a conversation with your
| fellow travelers, or graciously respond to such effort from
| them. Apologize if you already do.
| pjmlp wrote:
| Better a tortured job that puts food on the table than
| none at all.
| cpill wrote:
| tell that to the kids in Nike sweat shops
| ImHereToVote wrote:
| Personally I can't wait for all the streets to be lined with
| the homeless like in SF. So good.
| akaij wrote:
| It's kinda sad to see you believe that this is the
| inevitable outcome.
| pjmlp wrote:
| Well, if we imagine that the only thing that will be left
| are physical jobs that can't be done by computers.
|
| At least until they get clever enough to start a
| transformers line factory.
| Legend2440 wrote:
| This is the lump of labor fallacy. It's not about "what
| jobs will be left", it's about the new jobs we'll invent
| with all the time we'll have on our hands.
|
| There was never a fixed number of jobs, there's a fixed
| number of workers.
| pjmlp wrote:
| Well, we can also return to feudalism.
| PhasmaFelis wrote:
| Because it _will_ replace humans, and that 's worth thinking
| about?
| nwoli wrote:
| Seems like we wouldn't be far at all from just correlating this
| to face movement (including subtle iris movement and blinks, not
| just the mouth). As long as you clearly label it as CGI it's
| harmless and I'm excited for the day to come. Might be quite fun
| to chat with a little buddy this way
| og_kalu wrote:
| It's good that Bing, Bard are using the latest Microsoft, Google
| Cloud offerings but it would be nice to see these speech advances
| (along with audio palm - https://google-
| research.github.io/seanet/audiopalm/examples/ etc) hit public
| api's and/or user interfaces.
|
| Bard's TTS is alright but it's clearly behind.
|
| On that note, Bing's English/Korean TTS is really good. I also
| didn't realize Microsoft uses the best offerings for free TTS on
| edge so it blows google's default tts voices away.
| jameszhao00 wrote:
| Have you tried Google Cloud Studio voices?
|
| https://cloud.google.com/text-to-speech/docs/wavenet#studio_...
| og_kalu wrote:
| Yes. I'm not saying Google's Top Cloud offerings are bad
| although i still think microsoft's stuff is better.
|
| Just that
|
| 1. It's behind their current sota research
|
| 2. You can only use those voices extensively by paying for
| it. Microsoft offers their best stuff on edge for free. So
| for reading aloud a pdf or web page, microsoft is far better.
| jameszhao00 wrote:
| By "SOTA" tts I think you mean LLM based TTS? With sound
| and language tokens trained GPT style?
|
| Without going into too much details, imo they're not really
| usable right now for TTS use cases.
| skybrian wrote:
| It's disappointing, but I wouldn't expect research
| algorithms to be available immediately unless they held it
| back until the product is ready. I guess Apple would do
| that?
| GordonS wrote:
| I used Azure TTS for a product demo voice-over recently, and
| nobody I showed it to knew it wasn't a human doing it!
|
| Some of Azure's voices are better than others, and the TTS web
| app has a few minor bugs, but overall I was really pleased with
| the whole experience.
| refulgentis wrote:
| > I also didn't realize Microsoft uses the best offerings for
| free TTS on edge so it blows google's default tts voices away.
|
| This sounds really interesting - can you share a bit more? I'm
| behind in this space, my parser got all jammed up, something
| like: "Microsoft uses [the best offerings for free TTS](as in
| FOSS libraries, or free as in beer SaaS?) [on edge](Edge
| browser, or on the edge as in client's computer?)(Is the
| implication that all TTS on the client's computer blows
| Google's default TTS voices away?)"
| GranPC wrote:
| I believe they mean that the free TTS feature in Microsoft
| Edge uses their best technology, and that said tech is better
| than Google's default offering.
| og_kalu wrote:
| The top voices you'd pay for on Azure's TTS services can be
| used for free to read web page(and PDF) text on Microsoft
| Edge. I don't mean Open source.
|
| This is not the case with Google
| wg0 wrote:
| I didn't know that. Edge is too good. Just downloaded and
| such features are great.
| ShamelessC wrote:
| > public api's and/or user interfaces
|
| sigh. Google used to release _some_ models. Guess the fun early
| days are coming to an end.
| Legend2440 wrote:
| Google is a business and this is clearly a valuable product.
| ShamelessC wrote:
| Sure, but there was a time not too long ago when companies
| were still in the "good will" phase of handing out even
| highly valuable models like CLIP, guided-diffusion, etc.
| Come to think, it was mostly OpenAI doing this. And they
| kinda still do? But far more selectively. I'm just
| preemptively romanticizing that.
| rasz wrote:
| Product is something you sell to make money. The only real
| Google product is users sold to advertisers.
| vore wrote:
| Uh, what about all of their paid cloud offerings?
| rasz wrote:
| Distraction. Generated whole 1% of overall profit last
| quarter, and that was the first time it didnt lose money.
| https://www.cnbc.com/2023/04/25/googles-cloud-business-
| turns...
| jsnell wrote:
| Google's non-advertising revenue in the latest quarter
| was about $15 billion. Is that significant amount of non-
| ads product revenue? At least that is higher than the
| revenue of any of IBM, HP, Oracle, Intel, Cisco, Netflix,
| Broadcom, Qualcomm, or Salesforce in that same quarter.
|
| I think their non-ads businesses alone would be the 6th
| largest US tech company by revenue. (Amazon, Apple,
| Microsoft, the ads business of Alphabet, Meta. Am I
| forgetting something?)
| rasz wrote:
| Revenue is easy when you lose money on every dollar. Last
| quarter Ads printed $21B of income, rest was a loss
| except cloud not losing hundreds of $millions for the
| very first time.
|
| https://abc.xyz/assets/investor/static/pdf/2023Q1_alphabe
| t_e...
| joezydeco wrote:
| [flagged]
| vore wrote:
| I don't want to defend Google's business practices, but
| this is such a trite comment someone always feels
| compelled to post on anything about Google, including
| even a research paper, apparently.
| joezydeco wrote:
| I'll argue it's not trite. It's a concise compilation of
| the thousands of teeth-gnashing comments here on HN and
| all over the internet whenever Google randomly drowns
| another one of its children.
|
| Just fucking stay away from Google products. Period.
| relativ575 wrote:
| First of all it isn't a product. It's a f*king research
| paper. Like dozens other showing up on HN every day. Most
| of them never becomes a product.
|
| Second of all, by whining nauseously you drown out
| discussions on the merits of the technology, and chase
| people away. I hardly read Google news on HN now
| precisely because of that reason. Imagine if "Attention
| is all your need" came out now? [0]
|
| Save your complaint for when Google makes it a product.
|
| [0] - https://news.ycombinator.com/item?id=15938082
| serf wrote:
| >Save your complaint for when Google makes it a product.
|
| or save yourself the trouble and find alternatives to
| big-G.
|
| It's entirely their own fault that people now view all
| Google news as temporary and fleeting. People don't want
| to put time into things that'll get thrown away in a
| year.
|
| Reading G research papers seems like a shortcut to me,
| know what will be thrown away in 2 years before it's a
| valid product in 1 year and someone gets huckleberry'd
| into devoting time and effort into implementing the dead-
| product-walking API.
| signatoremo wrote:
| > It's entirely their own fault that people now view all
| Google news as temporary and fleeting. People don't want
| to put time into things that'll get thrown away in a
| year.
|
| Most of research don't become their own products, from
| Google or anyone else. As a research project they still
| have values, unless you are saying Google research is
| garbage because they get into the habit of canceling
| their products.
|
| > or save yourself the trouble and find alternatives to
| big-G
|
| Totally valid point. No need to complain about it in a
| post about Google research though. It's tiresome.
| glimshe wrote:
| It's a very relevant comment. It tells you to not rely,
| or expect further development, on any new Google
| technology, even seemingly good ones, as it can go to the
| graveyard like many others.
| georgemcbay wrote:
| I don't bother to post the comment, but the high
| likelihood of any Google project/product being killed
| within a year or two is absolutely the first thought I
| have whenever a new Google project/product is announced
| (not because of HN posts, but because of their history),
| so good job on that Google.
| og_kalu wrote:
| Ha i'm not even asking for code/model releases. It's just a
| bit funny that what you can *pay* google to use is so far
| behind what they have up and running collecting dust.
| ShamelessC wrote:
| Also true.
| Raed667 wrote:
| I'm speculating here, but for me it looks like the product
| (R&D) teams are not working closely with the research
| teams.
|
| Even the demo website is on Github Pages instead of a
| Google domain/blog.
| asutekku wrote:
| The most impressive part of this is that they are seemingly able
| to produce 30 seconds of TTS with just 3 seconds of source
| material. That is super cool and honestly much more further in
| the curve that I expected it to be.
| tagyro wrote:
| I've wasted (counting) about 300 seconds of my life listening to
| these audio files and they all sound and seem fake...
| svantana wrote:
| I found that in my (high quality) studio monitors, the audio
| sounded fine and hard to distinguish from 24kHz wav. But in
| headphones, the artifacts were pretty obvious. So probably some
| reverberation will do a lot to cover up artifacts. In the
| paper, they only do a subjective comparison between the
| generated audio and the _soundstream-encoded_ original audio,
| which seems a bit disingenuous. Listening to soundstream audio
| in headphones, I can hear those same artifacts.
| jeffbee wrote:
| Did you read the paper? They intentionally steered the quality
| to ensure they sound fake. Their generated speech is "very easy
| to detect" according to the reference at the end of the paper.
| tagyro wrote:
| just to be clear, one could mistake them for some (voice) actor
| reading a book (maybe) but even to my untrained ear they sound
| fake and artificial.
|
| Am i missing something?
| kvn8888 wrote:
| It's meant to sound artificial. The focus is on speed and
| consistency
| willemmerson wrote:
| I don't have anything intelligent to say about this but it's ALOT
| of fun making all the samples play at the same time - sort of
| like the HTML version of Ableton Live.
| [deleted]
| anigbrowl wrote:
| Good for fraudsters and spammers, bad for anyone who ever hoped
| to make a living from voice acting. I'm perplexed by AI
| technologists' seemingly incessant drive to automate away the
| existence of artistic performers.
| croes wrote:
| Why spare artists if everyone else gets replaced by technology?
| anigbrowl wrote:
| They don't, otherwise there would be many former CEOs living
| in tents. In reality, those who control large amounts of
| capital are quite willing (and increasingly, say so in the
| open) to to deprive others of their livelihoods, homes, and
| ability to feed themselves in order to realize a marginal
| increase in their own wealth.
| Legend2440 wrote:
| You are being deliberately pessimistic. There are a million
| fantastic, practical uses for text-to-speech.
| anigbrowl wrote:
| I am not. The use cases like interactive assistants for the
| blind will generate very little commercial activity compared
| to the uses (and abuses) for entertainment and marketing
| purposes. A good example of this from the real world is the
| absence of cheap/open ASL interpretation for deaf people.
| Legend2440 wrote:
| Imagine having an app on your phone that turns any ebook
| into an audiobook.
|
| Imagine replacing crappy phone menus with polite virtual
| assistants that actually understand what you're saying.
|
| Imagine an AI language tutor that speaks every language in
| the world fluently. Or a universal speech-to-speech
| translator.
|
| And that's just off the top of my head. Clever people will
| come up with a lot more uses, I'm sure.
| anigbrowl wrote:
| I don't need your help imagining use cases; I've been in
| this field a lot longer than you, and have talked up the
| technological possibilities of AI-powered TTS here for
| *years. I understand the technology very well and am
| bullish on it. What I'm saying is that too much of the
| effort is being spent in solving the wrong problems.
| Please try reading what I wrote instead of your imaginary
| subtext.
| signatoremo wrote:
| Ever notice big huge font on the phone of older people? So
| big that a screen may only contain a few lines of text. Or
| that people has to pull out their reading glasses every
| time they check their phone? Text to speech is a godsend in
| that case. Enormous benefits to an increasingly older
| population.
| anigbrowl wrote:
| 'helping blind people' was literally the first use case I
| mentioned. Maybe you should have read the comment before
| reacting to it.
| signatoremo wrote:
| Huh? How big is the blind group compared to the older
| population?
|
| You are saying it's not economical to use tech to speech
| to support blind people. I'm saying the benefits are huge
| for older population. It isn't just for fraudsters or
| spammers as you claim.
| anigbrowl wrote:
| No, I'm not saying that at all. I'm saying the resources
| invested in helping people will be dwarfed by those
| invested in crap designed to exploit them economically or
| criminally.
| signatoremo wrote:
| Set asides the fact that you have absolutely no proof of
| that claim, the criminal world is tiny compared to the
| people who benefit from TTS (God forbid if that isn't the
| case). Encryption, as an example, is hugely beneficial to
| the regular people despite being used or exploited
| extensively in the shady and questionable activities.
| wg0 wrote:
| LLMs aren't great and can't be relied upon in business setting
| or at least I would not.
|
| But think open world games. GTA VII for example where all NPCs
| have their dialogs auto generated in real time but also
| converted to audio in real time.
|
| That's going to be a world which would be a lot more
| spontaneous with lot less effort.
|
| Right now, If memory serves me right, GTA V dialogs alone are
| 5000 pages or more, hand written.
| anigbrowl wrote:
| That's all true, but I think it's a pity that the jobs that
| currently exist for voice artists will disappear. Gamers and
| consumers will have somewhat better interactive experiences,
| which is good. Indie game developers will also be able to put
| out games with lower budgets, which is nice for them. But the
| market for voice acting work is largely going to dry up and
| blow away for people who are not already at the top of that
| field. People who could previously have made a modest but
| sufficient living as voice performers will be replaced by
| computer-generated voices. It will be almost impossible to
| make a living in that field within 5 years.
| wg0 wrote:
| Generative models around images are nothing new and have
| been around for a while already. But even today, if you
| really want creative control and expression, you need a
| designer that's good with Photoshop or Illustrator etc.
|
| This is applicable to LLMs as well. You can get it to write
| plausible BS but if you really want a rooted in reality,
| well articulated write up about something, a human has to
| be taken onboard.
|
| This equally extends to voice over. If you really want
| expressive and creative control to put some outstanding
| rendering of something, AI isn't going to cut it.
| anigbrowl wrote:
| This is only true if you assume AI isn't going to keep
| improving. It gets significantly better on a quarterly
| basis, far faster than the time it takes for an actor to
| develop their craft and career. The output quality of
| todays' cutting edge models would have been science
| fiction only 2-3 years ago.
| wg0 wrote:
| I'm not so sure about the future. Such models, all the
| models don't have a well understood input output mapping
| and that's going to be a problem for a very long time.
___________________________________________________________________
(page generated 2023-07-16 23:00 UTC) |