The ideal goes something like this: Consumers at any location — a friend’s house, a car — could speak into a voice assistant or smart speaker and be instantly recognized by the device no matter who owns it. That consumer could make secure transactions or add items to a daily calendar or perform any of the other tasks possible in a world where voice recognition has become ubiquitous.
We are not there yet. But Todd Mozer, CEO of Sensory, which sells voice recognition technology, discussed that future, and how to get there, in a recent discussion with PYMNTS. The occasion was the release of Sensory’s TrulySecure Speaker Verification (TSSV) 2.0 biometric technology that, in this latest version, includes new embedded voice authentication features and accuracy improvements.
Voice recognition is making its way into more homes, with smart speakers the world’s “fastest-growing consumer technology segment,” according to a new report from Canalys, a technology market research firm. In the first quarter of 2018, Amazon, Google and other companies shipped 9 million smart speakers, a year-over-year growth rate of 201 percent.
Despite the popularity, the technology faces significant hurdles on its path toward ubiquity. Voice user interfaces today aren’t terribly discriminating about who can access them, for instance. With most systems, anybody within range can say the wake word and gain access to the features offered — and in some cases can even make purchases without any user authentication.
In general, vocal recognition can do a decent job with “one or five or 10 people” within the same household, Mozer said. “The next stage, a database of voices stored on the cloud that can make a match from millions of voices — I don’t think we are there yet.”
To get there, the technology must develop to a point where the device can better recognize, and more efficiently enroll, the various people who are speaking to it.
That’s the motivation behind some of the improvements in the newest version of TSSV. For example, it has what Sensory called a “built-in wake word capability for OS-based systems. Wake word creation is automated and fast once appropriate data is collected.” It also enables users to be “tracked seamlessly by their voice without specific passphrase requirements.”
A person can speak off a script for as little as 10 seconds to “train” the vocal assistant, though more time can lead to better results. “After 30 seconds, we start to get much better performance,” Mozer said.
How the newest version of TSSV will work in daily life depends on how Sensory’s customers use it. The technology embeds in client platforms, and it is possible they might, for example, add a facial recognition tool that, working through a client’s mobile app, could further serve to authenticate a banking customer. The company sells its technology to clients and does not profit off user data, Mozer said, and Sensory has no access to that information.
So, where might this latest version of TSSV, along with similar technology, take consumers over the coming years? What’s next for voice recognition? After all, various types of firms — not just banks, but car companies and others — are betting that voice and other biometric authentication tools will appeal to consumers who are becoming ever more digital and ever more aware of security in a world where commerce revolves around mobile and the web.
Well, to address that question, it pays to look at the limitations. Take automobiles, for instance, and the push to not only make vehicles better connected to the web, but employ software, cameras and other technology to secure access and monitor drivers for excessive tiredness and other traits. Voice recognition certainly has a role in those changes, but it cannot handle all the work, Mozer said.
He cautioned it would be unwise to rely only on voice recognition to enable a person to access a vehicle. Such a system likely would result in both false acceptances and false rejects — imagine if the driver had a voice altered by a cold, Mozer said, or if there was too much noise in the background. “I would recommend there be some sort of backup” in such a use case, he said, “like a keypad you can use if you get locked out.”
Indeed, given the ample “processing power” of modern cars and trucks, a system that combines facial and voice recognition would likely be more appealing and pragmatic, he said, than one that relies too heavily on voice.
For now, though, the focus is on all those little technological steps that will point toward the ideal, the point where smart speakers and related products can tell voices apart almost as well as people can.