AI Voice Cloning Scams Are Here: How to Detect Synthetic Voices in 2026

In 2025, the FBI's Internet Crime Complaint Center (IC3) documented more than $25 million in losses attributed directly to grandparent voice cloning scams — a category that did not meaningfully exist before 2023. A grandchild's voice, synthesized from a three-second TikTok clip, calls a grandmother and says “I've been arrested. Please don't tell mom. Send bail money now.” The voice is indistinguishable to the human ear. This article explains how that technology works, what signals expose it, and how Glance detects synthetic audio before it becomes a crisis.

FBI IC3 2025

Voice cloning fraud losses exceeded $25 million in the United States in 2025. The average victim was 67 years old. The average loss per incident was $9,400. Recovery rate: under 4%.

The Voice Cloning Threat

Voice cloning is not science fiction. It is a commodity service available to anyone with a browser and a credit card. Platforms like ElevenLabs, Resemble AI, and open-source models like Tortoise-TTS and XTTS-v2 can produce a convincing voice clone from as little as three to ten seconds of source audio. The attacker's workflow is straightforward: find a public audio sample of the target (a voicemail greeting, a YouTube video, a social media story), upload it to a cloning API, and generate the desired script.

The resulting audio passes casual human scrutiny because it reproduces the unique acoustic fingerprint of the original speaker — their formant frequencies, prosodic rhythm, and breathiness profile. What it cannot perfectly reproduce is the stochastic noise floor of real laryngeal tissue, and that is where detection lives.

The scam almost always follows the same script: a fabricated emergency (arrest, accident, kidnapping), a request for immediate financial transfer via wire, crypto, or gift cards, and a plea for secrecy. The emotional pressure is designed to override rational decision-making. Speed is the weapon. The call rarely lasts more than four minutes.

How Synthetic Voice Works

Modern neural text-to-speech (TTS) systems are autoregressive transformer models trained on thousands of hours of human speech. Given a short reference clip, a conditioning encoder maps the speaker's acoustic identity into a latent embedding. During generation, the decoder samples from that embedding to produce mel-spectrogram frames, which are then converted to a waveform by a vocoder (HiFi-GAN, UnivNet, or similar).

The weak point in this pipeline is the vocoder. Current vocoders introduce characteristic artifacts in the 4–8 kHz range where human phonation produces irregular, aperiodic excitation. Real speech also contains micro-pauses of 20–60 ms between phonemes caused by articulatory transitions. Cloned speech tends toward unnaturally smooth coarticulation because the model optimizes for intelligibility, not biological accuracy.

Synthesis artifacts in the 4–8 kHz spectral band (vocoder signature)
Overly smooth formant transitions — real speech has chaotic phoneme boundaries
Missing breath noise: cloned audio has uniform inter-utterance silence instead of inhale/exhale cycles
Flat prosodic variance — emotion is simulated via pitch shift, not the complex subglottal pressure changes of genuine distress
Micro-clipping artifacts at sentence boundaries where the TTS engine resets its autoregressive state

Detection Signals

Reliable synthetic voice detection operates on four signal categories simultaneously. No single signal is definitive — attackers can mask some artifacts by applying light telephone-quality compression (which blurs high-frequency artifacts). The combination of all four is what produces reliable discrimination.

Spectral Envelope Flatness

Real speech has high spectral entropy in the 2–8 kHz band. Neural TTS vocoders produce a measurably smoother envelope. A flatness score above 0.78 in sustained vowel regions is strongly associated with synthesis.

Glottal Pulse Regularity

Jitter (cycle-to-cycle pitch variation) and shimmer (amplitude variation) are measurably reduced in cloned speech. Real voices have jitter values of 0.5–1.2%; TTS output typically falls below 0.3%.

Pause Distribution

Human speakers produce pauses following a log-normal distribution. TTS pauses cluster tightly around a programmed silence constant. A chi-squared test on pause duration histograms flags this pattern in under 200ms.

Frequency Anomalies

The transition band between the telephone codec cutoff (3.4 kHz) and full-spectrum audio reveals phase anomalies in synthesized speech that are not present in recorded real-world calls.

How Glance Detects It

Glance's Voice Guard feature runs entirely on-device using the Web Audio API and WebRTC. When an incoming call is routed through the Glance-protected number, the audio stream is analyzed in real time using a lightweight ONNX model (under 4 MB) that evaluates all four signal categories simultaneously at 50 ms intervals.

No audio is ever uploaded to Glance servers. The model runs locally, produces a synthetic-voice probability score (0–100) for each 50 ms frame, and triggers an alert if the rolling 3-second average exceeds 72. The alert presents as a banner on the protected user's screen with the message: “Possible synthetic voice detected. Proceed with caution.”

Because the model runs before the caller's words are processed semantically, it detects the forgery independently of what is being said. A perfectly scripted scam call is flagged the same way as a poorly scripted one. The deception signal lives in the physics of the voice, not the content of the message.

Voice Guard is available to Glance Pro and Family Circle subscribers. Setup takes under two minutes — no hardware required, no app download needed on the protected person's device.

Enable Voice Guard

What to Do If You Suspect a Clone

Technology is a backstop. The first line of defense is behavioral. If you receive an unexpected call from a family member claiming an emergency, follow this protocol before taking any financial action:

01Hang up immediately — do not let the caller pressure you to stay on the line.
02Call the family member back on a number you already have saved. Do not call back the number that called you.
03Ask for the family code word — a private phrase agreed on in advance that only real family members know.
04Contact another family member independently to verify the claimed situation.
05Never transfer money, purchase gift cards, or send cryptocurrency before verification is complete.

Frequently Asked Questions

Can voice cloning be done with only a few seconds of audio?

Yes. Modern TTS models like ElevenLabs and Tortoise-TTS can clone a voice from as little as three seconds of clean audio. Voicemails, YouTube videos, and social media clips are all viable sources for attackers.

Does Glance upload my audio to the cloud to analyze it?

No. Voice Guard runs a WebRTC-based analysis entirely on your device. No audio is ever transmitted to Glance servers. The detection model is downloaded once and runs locally in under 50 milliseconds per frame.

What should my family do right now to prepare?

Establish a family code word — a short phrase only your family knows. If someone calls claiming to be a family member in distress, ask for the code word before acting. This single step defeats virtually every grandparent voice cloning attack in use today.

Protect Your Family From Voice Cloning Scams

Voice Guard detects synthetic audio in real time, on-device, with zero cloud upload. Available on Glance Pro and Family Circle plans.

Try Voice Guard Free