Voice Identification In And Out Of The Gulag

In Aleksandr Solzhenitsyn’s novel, The First Circle, originally written in the 1950s, Soviet diplomat Innokenty Volodin makes an ill-advised phone call outside a Metro station. It is not long before he is rounded up by the authorities after his identity is partially confirmed by a “voiceprint”, the output of a hulking secret machine showing “frequency—across the tape; time—along the tape; and amplitude—by the density of the picture”.

So it is described by Major Roitman, a sleazy Stalinist principal investigator, who presides over a research lab of intellectual prisoners (the Gulag’s “first circle”). Anxious to win the approval of a stern high-ranking delegation, he offers up an underling: “And here is Candidate of Philological Sciences Rubin, the only person in the Soviet Union who can read visible speech”.

Solzhenitsyn did not make up the voiceprint, which now is usually called a spectrogram. In fact, he was probably involved with research conducted by the likes of the lowly Rubin and the unpleasant Roitman. Not only is the technical detail in this scene remarkably accurate, but Solzhenitsyn, a trained mathematician, met the characters on which this novel is based during his three year stint at Special Prison No. 16, a “sharashka” for advanced scientific research.

Only in the past year was an unabridged English translation of "The First Circle" published—the previous version was short 9 chapters, removed by Solzhenitsyn himself who hoped to skirt the ire of Soviet censors. Today, spectrograms, which MIT professor Victor Zue teaches his graduate students to read, are no secret.

Neither is the kind of automatic voice identification the “first circle” zeks, or prisoners, were studying. Tapes of Osama Bin Laden are analyzed for authenticity with software that has also been used to track down the likes of Juan Carlos Ramirez Abadia, a Colombian drug kingpin who repeatedly altered his facial features with plastic surgery but neglected to adjust the length of his vocal chords. In addition, many companies—mostly banks—are installing speaker verification systems to defend against fraud. In Dubai, police officers can call an automated fine-issuing system to report the license plates of polluters and the calls are screened by voice identification software to limit abuse.

Some things have not changed since the 1950s. The spectrogram, a pictorial representation of a Fourier transformation of an acoustic signal, is still the first step in automatic speaker verification. The Fourier transform, due to Joseph Fourier (also credited with discovering the greenhouse effect in 1824), decomposes vibrations in the air detected by a microphone into a set of frequencies. A pure tone vibrates at exactly one frequency. Combining frequencies gives more complex sounds. A typical Fourier transformation of a speech signal represents each 10 millisecond segment, or “frame”, with around 13 frequencies. Remarkably, neurons associated with hearing respond to specific frequencies, suggesting that our brains perform a kind of Fourier transformation as well.

Modern speaker verification involves statistical techniques for identifying characteristic features of a person’s voice. The most widely used method attempts to create a precise model for the kinds of individual frames a person generates. For the imaginative, think of each speech frame in an Osama Bin Laden tape as a point in 13-dimensional space. Laying out a few thousand such frames on 13-dimensional coordinate axes, patterns begin to emerge, small clusters of frames in particular regions. The statistical model, estimated from this data, can give the probability that some new frame was generated by Mr. Bin Laden.

While much active research in this field is with regards to higher-level features of speech—intonation, syllable duration, idiosyncratic word usage—the basic method is very hard to beat. In effect, it identifies a speaker by the unique geometry of the vocal chords, larynx, epiglottis, nasal cavity, mouth, tongue, teeth and lips that contribute to the particular sounds each person can make. For this reason, it is difficult to fool simply by faking an accent, and under good conditions, with a few minutes of training data and a few minutes of test speech, it is better than 99% accurate on the speaker verification task (are you who claim to be?)

Of course, as Solzhenitsyn would remind us, conditions are not always good. Different phones can be quite different and background noise (car stereos, fire trucks, chirping birds, angry neighbors) can overwhelm the signal. One of the biggest open problems in signal processing is source separation—isolating the specific contributions of different acoustic originators—identifying the musical line of the cello in a string quartet, for example. Thus, despite some 60 years of progress, no speaker verification technology can provide conclusive evidence for a conviction, only partial evidence for innocence. Solzhenitsyn, who died on August 3rd of this year, at age 89, would want us to remember this too.

Related articles

Comments

Know Science And Want To Write?

Donate or Buy SWAG

Books By Writers Here