Imagine a situation where someone phones an elderly relative and it appears to be a favourite grandchild on the call, begging for money to help out in an emergency. It sounds exactly like the child and the problem really sounds urgent; what should they do? The FBI suggests they might want to say something like "What does the donkey say to the aardvark?"
Despite rapid advances, the accuracy of generative AI in making images and videos is still far off the mark at times, but voice reproduction is already remarkably good. With the use of vocal cloning on the rise, to carry out scams and fraudulent claims, the US Federal Bureau of Investigation has issued a list of tips (via Ars Technica) to help protect yourself, including the advice that you should create a secret word or phrase that only you and your family know.
Vocal cloning is a process where various audio clips of a person speaking are used to train a generative AI model so that it can then be used to replicate that person's normal speech patterns, tone, timbre, and so on. For example, it was used to create Google's 'podcast hosts' for its NotebookLM system and one would be hard-pressed to identify that it's not real people speaking when you listen to it.
One doesn't need to be an expert in AI to see how such a thing can be misused for nefarious purposes. And even though one would need genuine clips of your voice and speech mannerisms to clone you, it does mean there is a chance that someone out there could attempt to use 'you' in order to carry off a scam of some kind.
This is where proof of humanity comes in—essentially it's a password or phrase between you and your family, or more accurately, an MFA (multi-factor authentication) system where your voice is one factor and the password is another. A generative AI system is most likely going to be stumped by the bizarre and non-sensical because it wouldn't fall in line with the usual lines of predictability when it comes to talking.
Asking "What does the donkey say to the aardvark?" might seem like the opening line to a joke but if you're expecting "Wednesday afternoon, in Nepal" as a reply, few generative systems are going to offer that reply. So when you don't get that response, you can be immediately suspicious as to the genuineness of the call.
It's not a foolproof system to catch an AI scam, as it does require all parties involved to remember the password, and the correct response, and be able to do it at a moment's notice. If my mother suddenly asked me about donkeys and aardvarks out of the blue, I think my first response would be 'Err…what?' and then the phone would get slammed down, every time.
And then there are the situations where it's not a family member on the phone, but the boss or your manager of the company you work for. Is one expected to have a unique password for everyone there, to ensure the business never falls for a scam?
AI voice cloning is arguably a natural extension of the usual email scam, where a higher-up appears to be asking you to send them money or sensitive information—if that person is publicly well-known and clips of them talking can be sourced, such scams would be relatively easy to carry out.
I don't think there's any perfect solution to counter any AI voice scam, unfortunately. Even if the top companies in artificial intelligence, such as OpenAI, Google, and Microsoft integrate systems that always make it obvious that it's an AI 'talking' and not a real person, there will always be people out there who are able to create a similar model that doesn't.
Hopefully, smarter folks than me are working on solving this thorny problem.