I Cloned Myself With AI. She Fooled My Bank and My Family.

In our lifetime—WSJ

PERSONAL TECHNOLOGY: JOANNA STERN

Our columnist replaced herself with AI voice and video to see how humanlike the tech can be. The results were eerie.

Our columnist replaced herself with AI voice and video to see how humanlike the tech can be. The results were eerie.

The good news about AI Joanna: She never loses her voice, she has out­stand­ing pos­ture and not even a con­vert­ible dri­ving 120 mph through a tor­nado could mess up her hair.

The bad news: She can fool my fam­ily and trick my bank.

Maybe you’ve played around with chat­bots like Ope­nAI’s Chat­GPT and Google’s Bard, or im­age gen­er­a­tors like Dall-E. If you thought they blurred the line be­tween AI and hu­man in­tel­li­gence, you ain’t seen—or heard—noth­ing yet.

Over the past few months, I’ve been test­ing Syn­the­sia, a tool that cre­ates ar­ti­fi­cially in­tel­li­gent avatars from recorded video and au­dio (aka deep­fakes). Type in any­thing and your video avatar par­rots it back.

Since I do a lot of voice and video work, I thought this could make me more pro­duc­tive, and take away some of the drudgery. That’s the AI prom­ise, af­ter all. So I went to a stu­dio and recorded about 30 min­utes of video and nearly two hours of au­dio that Syn­the­sia would use to train my clone. A few weeks later, AI Joanna was ready.

Then I at­tempted the ul­ti­mate day off, Fer­ris Bueller style. Could AI me—paired with Chat­GPT-gen­er­ated text—re­place ac­tual me in videos, meet­ings and phone calls? It was…eye-open­ing or, dare I say, AI-open­ing. (Let’s just blame AI Joanna for my worst jokes.)

Even­tu­ally AI Joanna might write col­umns and host my videos. For now, she’s at her best il­lus­trat­ing the dou­ble-edged sword of gen­er­a­tive-AI voice and video tools.

My video avatar looks like an avatar.

Video is a lot of work. Hair, makeup, wardrobe, cam­eras, light­ing, mi­cro­phones. Syn­the­sia promises to erad­i­cate that work, and that’s why cor­po­ra­tions al­ready use it. You know those bor­ing com­pli­ance train­ing videos? Why pay ac­tors to star in a live-ac­tion ver­sion when AI can do it all? Syn­the­sia charges $1,000 a year to cre­ate and main­tain a cus­tom avatar, plus an ad­di­tional monthly sub­scrip­tion fee. It of­fers stock avatars for a lower monthly cost.

I asked Chat­GPT to gen­er­ate a Tik­Tok script about an iOS tip, writ­ten in the voice of Joanna Stern. I pasted it into Syn­the­sia, clicked “gen­er­ate” and sud­denly “I” was talk­ing. It was like look­ing at my re­flec­tion in a mir­ror, al­beit one that re­moves hand ges­tures and fa­cial ex­pres­sions. For quick sen­tences, the avatar can be quite con­vinc­ing. The longer the text, the more her bot na­ture comes through.

On Tik­Tok, where peo­ple have the at­ten­tion span of gold­fish, those com­puter-like at­trib­utes are less no­tice­able. Still, some quickly picked up on it. For the record, I would rather eat live eels than ut­ter the phrase “Tik­Tok fam” but AI me had no prob­lem with it.

The bot-ness got very obvious on work video calls. I downloaded clips of her saying common meeting remarks (“Hey everyone!” “Sorry, I was muted.”) then used software to pump them into Google Meet. Apparently AI Joanna’s perfect posture and lack of wit were dead giveaways.

All this will get better, though. Synthesia has some avatars in beta that can nod up and down, raise their eyebrows and more.

My AI voice sounds a lot like me. 

When my sister’s fish died, could I have called with condolences? Yes. On a phone interview with Snap CEO Evan Spiegel, could I have asked every question myself? Sure. But in both cases, my AI voice was a convincing stand-in. At first.

I didn’t use Synthesia’s voice clone for those calls. Instead, I used one generated by ElevenLabs, an AI speech-software developer. My producer Kenny Wassus gathered about 90 minutes of my voice from previous videos and we uploaded the files to the tool—no studio visit needed. In under two minutes, it cloned my voice. In ElevenLabs’s web-based tool, type in any text, click Generate, and within seconds “my” voice says it aloud. Creating a voice clone with ElevenLabs starts at $5 a month.

My sister, whom I call several times a week, said the bot sounded just like me, but noticed the bot didn’t pause to take breaths. When I called my dad and asked for his Social Security number, he only knew something was up because it sounded like a recording of me.  

The potential for misuse is real. 

The ElevenLabs voice was so good it fooled my Chase credit card’s voice biometric system.

I cued AI Joanna up with several things I knew Chase would ask, then dialed customer service. At the biometric step, when the automated system asked for my name and address, AI Joanna responded. Hearing my bot’s voice, the system recognized it as me and immediately connected to a representative. When our video intern called and did his best Joanna impression, the automated system asked for further verification.

A Chase spokeswoman said the bank uses voice biometrics, along with other tools, to verify callers are who they say they are. She added that the feature is meant for customers to quickly and securely identify themselves, but to complete transactions and other financial requests, customers must provide additional information.

What’s most worrying: ElevenLabs made a very good clone without much friction. All I had to do was click a button saying I had the “necessary rights or consents” to upload audio files and create the clone, and that I wouldn’t use it for fraudulent purposes.

That means anyone on the internet could take hours of my voice—or yours, or Joe Biden’s or Tom Brady’s—to save and use. The Federal Trade Commission is already warning about AI-voice related scams.

Synthesia requires that the audio and video include verbal consent, which I did when I filmed and recorded with the company.  

ElevenLabs only allows cloning in paid accounts, so any use of a cloned voice that breaks company policies can be traced to an account holder, company co-founder Mati Staniszewski told me. The company is working on an authentication tool so people can upload any audio to check if it was created using ElevenLabs technology.

Both systems allowed me to generate some horrible things in my voice, including death threats. 

A Synthesia spokesman said my account was designated for use with a news organization, which means it can say words and phrases that might otherwise be filtered. The company said its moderators flagged and deleted my problematic phrases later on. When my account was changed to the standard type, I was no longer able to generate those same phrases. 

Mr. Staniszewski said ElevenLabs can identify all content made with its software. If content breaches the company’s terms of service, he added, ElevenLabs can ban its originating account and, in case of law breaking, assist authorities.

This stuff is hard to spot. 

When I asked Hany Farid, a digital-forensics expert at the University of California, Berkeley, how we can spot synthetic audio and video, he had two words: good luck. 

“Not only can I generate this stuff, I can carpet-bomb the internet with it,” he said, adding that you can’t make everyone an AI detective.

Sure, my video clone is clearly not me, but it will only get better. And if my own parents and sister can’t really hear the difference in my voice, can I expect others to?

I got a bit of hope from hearing about the Adobe-led Content Authenticity Initiative. Over 1,000 media and tech companies, academics and more aim to create an embedded “nutrition label” for media. Photos, videos and audio on the internet might one day come with verifiable information attached. Synthesia is a member of the initiative.

I feel good about being a human. 

Unlike AI Joanna who never smiles, real Joanna had something to smile about after this. ChatGPT generated text lacking my personality and expertise. My video clone was lacking the things that make me me. And while my video producer likes using my AI voice in early edits to play with timing, my real voice has more energy, emotion and cadence.

Will AI get better at all of that? Absolutely. But I also plan to use these tools to afford me more time to be a real human. Meanwhile, I’m at least sitting up a lot straighter in meetings now.

1 thought on “I Cloned Myself With AI. She Fooled My Bank and My Family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s