Google Duplex AI makes realistic phone calls with real humans

Posted on Wednesday, May 09 2018 @ 11:25 CEST by Thomas De Maesschalck
GOOG logo
One of the more impressive demonstrations at Google's 2018 I/O developer conference was Duplex, a new Assistant feature that will make phone calls for you. This artificial intelligence system sounds nearly as real as an actual human being, it can schedule appointments and make reservations for you.

It will be interesting to see how this works in real-life, the examples provided on the Google blog are definitely very impressive. Google notes Duplex cannot carry out general conversations, this AI agent was specifically trained for very narrow domains. Here's a small snippet that highlights some of the work that went into this, to ensure the bot sounds natural:
We use a combination of a concatenative text to speech (TTS) engine and a synthesis TTS engine (using Tacotron and WaveNet) to control intonation depending on the circumstance.

The system also sounds more natural thanks to the incorporation of speech disfluencies (e.g. “hmm”s and “uh”s). These are added when combining widely differing sound units in the concatenative TTS or adding synthetic waits, which allows the system to signal in a natural way that it is still processing. (This is what people often do when they are gathering their thoughts.) In user studies, we found that conversations using these disfluencies sound more familiar and natural.

Also, it’s important for latency to match people’s expectations. For example, after people say something simple, e.g., “hello?”, they expect an instant response, and are more sensitive to latency. When we detect that low latency is required, we use faster, low-confidence models (e.g. speech recognition or endpointing). In extreme cases, we don’t even wait for our RNN, and instead use faster approximations (usually coupled with more hesitant responses, as a person would do if they didn’t fully understand their counterpart). This allows us to have less than 100ms of response latency in these situations. Interestingly, in some situations, we found it was actually helpful to introduce more latency to make the conversation feel more natural — for example, when replying to a really complex sentence.


About the Author

Thomas De Maesschalck

Thomas has been messing with computer since early childhood and firmly believes the Internet is the best thing since sliced bread. Enjoys playing with new tech, is fascinated by science, and passionate about financial markets. When not behind a computer, he can be found with running shoes on or lifting heavy weights in the weight room.



Loading Comments