






Grace Ebert examines the growing use of AI technology in audiobook generation. While cost-effective, can it replicate the human experience of speech? It seems to be getting closer to the “real thing”
Note: This narration is read by a DeepZen narrator.
“Hi, I’m Kellie. I’m a DeepZen narrator. You may think that I’m a human, but my voice is generated by DeepZen technology. Our tech allows me to showcase seven different emotions.”
This friendly voice rings from a small play button on the homepage of the AI company DeepZen, and it does, in fact, uncannily resemble that of a voiceover artist’s live recording. The clip proceeds with Kellie performing her septet of impressions, allowing her voice to rise to an exuberant pitch as she declares her happiness before lowering to convey the irritation she’s conferring from the text. Her synthetic interpretation is undoubtedly surreal the longer you listen, and this isn’t surprising considering the voice is a clone produced by replicating recordings of a person speaking.
Bots like Kellie are some of the newest narrators venturing into the publishing landscape, and if the emotional sampling on the site holds up, her interpretation of a text, either fiction or non, wouldn’t necessarily elicit questions from listeners about the nature of the voice emanating from their Airpods.
A long way from their cassette-tape predecessors, audiobooks have seen rapid growth in recent years and are part of a market projected to reach a value of $15 billion by 2027. They certainly buoyed the publishing industry at the onset of the COVID-19 pandemic when shipping delays were indeterminate and now, too, when we’re experiencing supply chain backlogs. And beyond audiobooks’ value for publishers and consumers, the medium also offers greater accessibility for people who are unable to read from a page, a concept that myriad newspapers and magazines like The New York Times, The Washington Post, and this publication have embraced by offering audio versions of their content, as well.
Historically, though, these forms have been costly to produce, entirely because they’re a deeply human endeavor. Practiced voiceover artists and actors spend hours recording in the studio followed by an editing process that ends up costing publishers an average of $5,000 to $10,000 per title. It’s no wonder that companies are gravitating toward AI models that greatly reduce costs, especially when services easily integrate into established publishing systems. The Washington Post’s process using Amazon Polly, for example, is as follows: “When an article is ready for publication, the written content management system (CMS) publishes the text article and simultaneously sends the text to the audio CMS, where the article text is processed by Amazon Polly to produce an audio recording of the article. The audio is delivered as an mp3 and published in conjunction with the written portion of the article.” Seamless.
There are plenty of reasons to justify the original price tag for audiobooks in particular, one being that the experience of listening to a novel or memoir narrated by a celebrity or the author herself is unparalleled. Hearing Patti Smith’s gritty voice describe drinking black coffee in the same Greenwich Village cafe enhances the overt coolness of M Trainand having Levar Burton narrate Astrophysics for Young People in a Hurry creates a profound mix of nostalgia, comfort, and childlike wonder. When the narrator has the power to augment a work like Smith and Burton, what happens when a bot like Kellie is producing our listening experience?
As I mentioned, it’s not immediately apparent that Kellie’s performance is manufactured, but what narrations like hers might lack for seasoned listeners is the art. Veteran Washington Post reviewer Katherine Powers writes, “Your own imagination and interpretation have more independence when reading a book yourself than when a narrator’s voice controls the text. Audiobooks could be said to be a species of translation: Although true to the words, they are different in character from the original, the printed page.” For Powers, it’s the oral performance of the text, the breathy pauses, and refined intonations, that provide new insight and nuance for the listener. This notion is rooted in our brain’s chemistry as writer Jane Alison explains in her book on the patterns of narrative: “Neural activity registering sound is about the same whether a word is read silently or aloud; a part of the brain called Broca’s area generates the ‘sound’ of the word internally.” In other words, when our brains process language, we both envision an image and “hear” the text as if it were audible, whether it is or not. If you’re reading the printed version of this article, for example, notice how you’ve had an internal dialogue running this entire time, or when you hear the word “orange,” you also visualize it as type. This makes translation an apt comparison, considering our sensory processing is different when we’re scanning a page ourselves versus when a text is filtered through a voice that isn’t produced in our own brains.

Source, ThoughtCo / Gary Ferster.
Valuing this kind of performance or transposition is not new. As Rand Faris writes in a piece on spoken word, “Poetry does not have a particular sound. That’s the beauty of it. Its sound is elastic and very personal to the reader—just as a cover of a song, to the original version.” The same is true, of course, of prose. In the case of audiobooks, a narration diverges from Faris’s elasticity and provides solidity to the words printed on the page, transposing each phrase into a newly formed structure for the listener to nestle into. This is apparent with performances from Smith and Burton, which provide distinct interpretations that certainly determine the emotional impact of the text.
When Kellie is narrating, we’re hearing an algorithmic logic determining cadence, pause, and ultimately, the effect of the work. Her voice is generated using Text to Speech technology to convert the written forms to an audible counterpart and then Natural Language Processing to add emotion and expression. “Human language is separated into fragments so that the grammatical structure of sentences and the meaning of words can be analysed and understood in context,” DeepZen shares. “This helps computers read and understand spoken or written text in the same way as humans.” The structure we’re finding ourselves enmeshed in when listening to audio by Kellie or another bot, although rooted in a real person’s performance, still remains patently inhuman.
It’s understandable that this illusion makes people uncomfortable, especially since there are valid ethical arguments against programming an AI to replicate someone else’s voice—this controversy surrounds the director’s choice to do so in Roadrunner: A Film About Anthony Bourdain. There are also questions about the future of human creativity when bots are able to write an entire article and co-author book reviews, and the latter seems more closely tied to most readers’ rejections of AI-narrated audiobooks, especially when they are (presumably) created with consent and licensing agreements.
Some platforms, like Audible, are reluctant to utilize AI, with its self-publishing branch stating, “Your submitted audiobook must be narrated by a human. TTS recordings are not allowed. Audible listeners choose audiobooks for the performance of the material, as well as the story. To meet that expectation, your audiobook must be recorded by a human.” Because Audible controls a sizable portion of the market, there’s clear resistance to the idea that a bot has the ability to create a worthy translation of a text.
Here, it’s the human investment, the subtle creative interpretations, and the singular voices that make a particular performance meaningful, a long-held understanding in literature. Famed translator Margaret Jull Costa says about evaluating different iterations of the same work, “I usually use the analogy of the many different Hamlets one has seen over the years. They’re all Hamlet, but the best have invested every word with meaning and with their own self and life experience, too, and some you like more than others.” This applies, too, to audiobooks.

As with any other art form, readers will determine through their dollars and attention when an AI narration is acceptable (non-fiction titles and educational texts tend to be an easier sell). Sometimes, we do want the human connection, an impulse that’s proven even by the number of questions on Apple’s site about what to do when Siri’s infamous voice turns “robotic.” If a narration as jagged and mechanical as the reverted Siri is difficult to listen to with answers to questions like “how late is the grocery store open?” or “which theater is a movie playing at?”, then listening to a book-length work that we expect to be lyrical and poetic would be arduous.
Ultimately, though, a wholesale rejection of the technology isn’t helpful either, especially when companies like DeepZen offer convincing alternatives to more rigid narrators of years before. If an AI can help free up publishers’ budgets to make more titles more accessible through audiobooks, whether read by a bot or not, then that’s a worthy goal.
As for the art, we can return here to the enduring questions posed by Walter Benjamin about mechanical reproduction: “Even the most perfect reproduction of a work of art is lacking in one element: its presence in time and space, its unique existence at the place where it happens to be.” No bot that sounds like Smith or Burton will actually be one of those figures, and listening to their voices is often what drives us to those translations in the first place: hearing a work of art, whether their own or another artist’s, interpreted and translated into something new.




