In a traditional speech recognizer, the waveform spoken by a user is split into small consecutive slices or “frames” of 10 milliseconds of audio. Each frame is analyzed for its frequency content, and the resulting feature vector is passed through an acoustic model such as a DNN that outputs a probability distribution over all the phonemes (sounds) in the model. A Hidden Markov Model (HMM) helps to impose some temporal structure on this sequence of probability distributions. This is then combined with other knowledge sources such as a Pronunciation Model that links sequences of sounds to valid words in the target language and a Language Model that expresses how likely given word sequences are in that language. The recognizer then reconciles all this information to determine the sentence the user is speaking. If the user speaks the word “museum” for example – /m j u z i @ m/ in phonetic notation – it may be hard to tell where the /j/ sound ends and where the /u/ starts, but in truth the recognizer doesn’t care where exactly that transition happens: All it cares about is that these sounds were spoken.
Our improved acoustic models rely on Recurrent Neural Networks (RNN). RNNs have feedback loops in their topology, allowing them to model temporal dependencies: when the user speaks /u/ in the previous example, their articulatory apparatus is coming from a /j/ sound and from an /m/ sound before. Try saying it out loud – “museum” – it flows very naturally in one breath, and RNNs can capture that. The type of RNN used here is a Long Short-Term Memory (LSTM) RNN which, through memory cells and a sophisticated gating mechanism, memorizes information better than other RNNs. Adopting such models already improved the quality of our recognizer significantly.The next step was to train the models to recognize phonemes in an utterance without requiring them to make a prediction for each time instant. With Connectionist Temporal Classification, the models are trained to output a sequence of “spikes” that reveals the sequence of sounds in the waveform. They can do this in any way as long as the sequence is correct. The tricky part though was how to make this happen in real-time. After many iterations, we managed to train streaming, unidirectional, models that consume the incoming audio in larger chunks than conventional models, but do actual computations less often. With this, we drastically reduced computations and made the recognizer much faster. We also added artificial noise and reverberation to the training data, making the recognizer more robust to ambient noise. You can watch a model learning a sentence here.We now had a faster and more accurate acoustic model and were excited to launch it on real voice traffic. However, we had to solve another problem – the model was delaying its phoneme predictions by about 300 milliseconds: it had just learned it could make better predictions by listening further ahead in the speech signal! This was smart, but it would mean extra latency for our users, which was not acceptable. We solved this problem by training the model to output phoneme predictions much closer to the ground-truth timing of the speech.
Strategies you should adapt to benefit from voice search
Short-tail keywords have already diminished in importance and conversational search is bound to decrease their prominence even further. People don’t use voice search the same way they ordinarily type into a search engine. They ask more direct queries, to get more relevant answers. This is where long-tail keywords in your content come in handy. Using these keywords helps increase the chances of your content ranking in voice search engine result pages.
Focus content on answering FAQs
Your content should prioritise answering questions like ‘why’, ‘who’, ‘what’ and ‘how’. Your FAQs should be conversational in nature to answer these questions directly.
Consider the questions people are more likely to ask
Now that you already understand your target audience consider what type of questions they’ll mostly be asking when looking for your products and services. The focus should be on providing a direct and concise answer for better ranking. Your focus shouldn’t just be on the keyword, but how the question will be phrased including extra words that will be used in order to get a concise answer. Build your content around these queries.
Develop content with an informal tone
Unlike text search, voice search is not just direct – but often colloquial. Consider how people generally speak, to develop content to match their tone.
Try out voice search
The best way to understand how voice search will impact your website (and business) is by actually trying it out. Play around with voice search to find out how your competition is ranking. You’ll also have a chance to learn more about long-tail keywords, and how they boost your ranking.
There are so many ways you can capitalise on voice search to increase traffic to your website or even a brick and mortar store. If you’re a restaurant owner in New York, provide directions to your location and you’ll be surprised to find more people coming through the door. Whatever you do, make sure your SEO strategy incorporates voice search, in-so-doing boosting ranking on SERPs.