LB’S NOTE: A few weeks ago, Fake Video, made possible by comparatively easy to use software that can replace people’s faces and expressions with those of others without an overt sign of doctoring, was the latest Big Scary Thing.
Now, however, we’re talking about something even more fascinating – to me anyway – AI voice replacement so natural that, at the very least, an awful lot of actors may soon be out of jobs. TVWriter™ Bob Tinsley tells the tale (in his own voice).
by Bob Tinsley
I’ve recently begun following the Voice First movement and the influence of AI in publishing.
Because of the boom in the audiobook market writers often make more money from their audiobooks than they do from their print and ebook books combined, and that divide increases every day. Fewer units sold, but bigger profit margins.
This is leading to an increasing preference for releasing audiobooks either before the print release or simultaneous with. Usually, because of production times (see below) audiobook releases lag print and ebook by a couple of months.
The increasing use of voice interfaces with computers and web entities
like Alexa, Siri, and Google Assistant has led to voice SEO.
If you ask Alexa to find books by a given author it won’t find anything
unless that author has audiobooks — in Amazon’s environment! Same
with Siri and Google Assistant. So if you don’t have voice products out there, you don’t get found.
So, Voice First. Of course this narrow-search will probably get corrected somewhere down the line, but the smart ones aren’t waiting for that.
Fitting nicely with this is the disruptive role of AI.
Joanna Penn, who runs a high-six-figure, one-person publishing business and is always looking to the future to find ways to keep that business growing, is fascinated with the role of AI in voice emulation.
She went to a website called lyrebird.ai, gave them a 45-second sample of her voice, and the AI gave her back a generated file of a different text in her
voice with her inflections. She could tell it wasn’t really her voice, but it was close. If their service were commercially ready she said she’d be using it to produce her podcasts now.
Amazon Web Services has a service called Polly. It will convert any text file into a mp3 file. I’ve been playing with it for a couple of days. It converted a 4,400 word short story into a mp3 file in a fraction of a second.
The voice (Polly has, at present, four female voices and three male voices available in 26 languages) is still robotic but better than any widely available text-to-speech engine currently available. The tone of these voices actually rises at the end of a question.
Not only is it fast, it’s essentially free. You can convert over 800,000 words for free per month Over that, it costs an extremely reasonable $4.00 or so per each additional 160,000 words.
Couple this info with Voice First. Producing a 100,000 word book today
will cost between $1,800 to $2,600 and take a couple of months, depending on how busy the narrator is. Polly will do it in minutes for nothing.
I am experimenting with having Polly narrate another of my stories with several voices. I divided my story into blocks, or scripts, for different voices then ran the blocks through Polly using two female voices and two male voices.
A 3,500 word story took me about an hour to divide the text into four scripts and less than a second to convert each script. I need more than two male voices, but I can frequency shift the ones I have to get the others.
I loaded the mp3 files into Audacity and greatly improved intelligibility just by doing some amplifying, normalizing, and equalizing. So far, I’d say I’ve got about 2 hours into the project. Next comes production and editing. I figure I’ll probably come out under 6 hours in two or three days. My Wild Bill production with real live actors for Drift & Ramble took a month and around 20 hours for production.
Now no one will mistake this for a human dramatic reading, but the
advantages are obvious. A lot of people won’t be able to stand it for very long, but there are also a lot of people out there who listen at 1.5 to 3 times normal speed, want content, and don’t care what it sounds like as long as it’s intelligible. This technique seems perfect for that audience.
It’s also perfect for creators who want to start a podcast, but are mic-shy,
don’t have the training, don’t have a quiet recording space, don’t
have the equipment.
Write out your podcast, run it through Polly, and post it. Zero time investment past the writing which our (not really so mythical) was going to
I’m thinking that as time goes by, not only will audiences become more used to listening to generated voices, those generated voices will improve until they are indistinguishable from human voices. When they do, audio performances of all kinds will have an exciting and potentially very profitable new twist.
We aren’t there yet, but we will be. A whole new generation of creative dreamers will have the best chance in history to share their dreams (and not go broke doing it).
(Admission: I stole that last paragraph from Gene Roddenberry and TVWriter™’s Larry Brody. It’s something Roddenberry said to LB at a script meeting a ton of years ago. Wonder what the Great Bird of the Universe will sound like via AI.)
Bob Tinsley is an artist, writer, boataholic and a new pro in the field of Audio Drama. In other words, he’s an expert in finding new marketplaces, as he’s showing us here.