A New Way with Words: Understanding Google’s BERT Update

By 13/02/2020 February 27th, 2020 No Comments

Google’s recent BERT update is the most significant leap for search in 5 years


In October 2019, Google started rolling out what has been touted as the most significant leap in search since the introduction of RankBrain five years ago. Known as BERT – Bidirectional Encoder Representations from Transformers – the new NLP framework is set to significantly enhance the performance of the search engine powerhouse. The update came one year after Google AI published their research paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al, 2018) and subsequently open-sourced the framework. But what exactly is BERT, and why is such a big deal? Here, we’ll explore the ins and outs of the new update.


What is NLP?


To truly understand BERT and how it impacts search, we need to first understand the wider discipline of Natural Language Processing (NLP). Melding elements of computer science, artificial intelligence, and linguistics, NLP is the process of teaching a machine how human language works; training computers to understand and recognise the nuances of human language.


Traditional ‘deep’ NLP as we know it emerged in the early 2010s, and today we see it applied in many aspects of everyday life – from online chatbots, to predictive text messages, to trending topics on Twitter, to vocal recognition assistants like Siri, Alexa, and OK Google.


NLP goes beyond training machines to understand spelling and grammar, it also involves teaching machines to understand the definitions of a word in different contexts. For instance, the definition of the word ‘running’ differs in the phrases ‘running an event’, ‘running away’, and ‘running for president’; NLP is used to help computers recognise and distinguish between these definitions based on the context of the overall input. It’s also used to help computers recognise the tone or sentiment behind a piece of text or a word. A great example of this is how tools like Grammarly can identify whether the tone of a passage is optimistic, aggressive, formal, neutral, etc.


Many traditional language-training NLP models tend to utilise a recurrent neural network (RNN) system. Recurrent neural networks allow a machine to retain the knowledge it has gained from earlier in a body of text and use it to predict what may come next. As the model reads a piece of text, each word will enter this network, forming what you can imagine as a self-organised map of every word in the passage. Words that are semantically related are grouped closer together; for instance, the words ‘cat’ and ‘dog’ will be placed closely to one another, but the words ‘cat’ and ‘feline’ will be even closer. (?) This helps the machine to recognise patterns and understand context as it scans the piece of text.


The limitation of this, however, is that RNN systems are unidirectional, meaning they can only understand the meaning of a word based on the words that precede it. If a machine can only understand a word based on the word that comes before it, then the true context cannot be determined until the end of the sentence, and this can cause errors. Additionally, traditional NLP systems require a lot of manual tagging, and for every new NLP task you must re-train the system to understand syntax and semantics. We’ll explore how BERT solves these problems below.


So what is BERT?


BERT is an NLP framework, but it is unlike any other that has come before it. It is a contextual language model that greatly improves the way computers can understand language and its nuances. BERT is an acronym for Bidirectional Encoder Representations from Transformers – and while that may sound quite complex, what BERT achieves is quite simple: it uses a number of innovative mechanisms and processes in order to understand human language better than any other NLP framework has ever been able to achieve.


Alongside the publication of their research paper in 2018, Google also made the framework open source, meaning anybody can use and expand upon its architecture for any number of language-based tasks and problems.


Prior to being rolled out in search, BERT had already achieved state-of-the-art results for 11 different natural language processing tasks. If, for example, you wanted to create a chatbot for your business, you could take BERT’s pre-trained architecture and fine tune it for this specific task and your specific products and customers. You could input a dataset containing thousands of product reviews, each tagged ‘positive’ or ‘negative’, and further train BERT in sentiment analysis to understand how to distinguish between future positive and negative reviews. Another example; Sadrach Pierre, Ph.D. recently experimented with BERT’s ability to classify articles as Fake News. There are a huge number of tasks BERT can be used for, and having been pre-trained with such a large corpus, all that’s required from programmers is little bit of fine tuning – which is a huge plus.


What makes BERT different?


While BERT utilises a number of mechanisms and models that are prevalent in NLP, it is how they’re used that sets BERT apart from other NLP frameworks:




BERT is the first NLP framework to be pre-trained using an unlabelled, pure plain text corpus. BERT uses both the entire English-language Wikipedia and a selection of ebooks as its corpora, providing a dataset of over 2.5 billion words to learn from. Before BERT, NLP models would require manual labelling, and word vectors and embeddings would need to be manually constructed. By pre-training with such a large corpus, BERT is able to autonomously learn the nuances of natural language, and will therefore only require fine tuning for future tasks. This saves a great deal of time for programmers looking to apply NLP on a specific task or project.


Deeply Bidirectional:


Like many NLP frameworks, BERT is based on a neural network system, designed to recognise patterns in words and how they’re used. However, unlike traditional neural network models that process words sequentially, either one-by-one from left-to-right or right-to-left, BERT processes the relationships between each word in a query bidirectionally – regardless of their positioning.


In traditional, unidirectional systems, a model can only understand the context of a word based on either the word that precedes it or the word that succeeds it, whereas BERT is able to learn the context of a word based on its relations with every word in the sentence. Unidirectional systems cannot obtain the context of a word appearing at the beginning of a sentence by using a word appearing at the end of the sentence – and this can cause issues. BERT’s deeply bidirectional model offers the solution, allowing a computer to see every word in a sentence simultaneously – like humans do.




Bidirectional processing is achieved through Transformers – models built for attention-based learning. Attention is a mechanism designed to identify where the focus of a sentence or piece of text lies. It involves a process of finding the connection between any two words in a sentence, and subsequently assigning weighting to these connections to determine which are the most important towards understanding the wider context of the sentence.


Through this system, each word is converted into a mathematical representation known as a ‘word embedding’. Within this embedding will be a set of attributes representing the word (this could relate to the sentiment of the word, whether it’s plural or singular, the part-of-speech of the word, etc.). These embeddings are mapped together on a broader space to understand how they relate to one another – where words that are similar to each other would appear closer together in the embedding space. Attention is the mechanism that calculates these embeddings together to determine the weight of the connection between words.


As an example, let’s consider the following sentence:


“Katie watched the parrot with binoculars”


A Transformer can take “Katie watched… with binoculars” and “parrot with binoculars” and determine that it is more likely that Katie used binoculars to watch the parrot than it is that Katie watched a parrot that was using binoculars. The connection between “watched” and “binoculars” is given more weight than “parrot” and “binoculars” based on what the model already understands about language (from pre-training) and based on the context provided by the word “with”. Other NLP models also use attention mechanisms, but because they work unidirectionally, they can’t achieve the same accuracy.


For another example of how attention is used to understand context, let’s consider the following sentence:


“The cat stayed inside the house because it didn’t like the cold”


The Transformer will use attention mechanisms to determine that the “it” in the sentence refers to the cat, and not the house.


So while traditional recurring neural networks understand the context of a word based on the word that either precedes or succeeds it, bidirectional Transformers use attention mechanisms to figure out the context of a word using all other words in the text, regardless of their positioning, in order to understand the role of the word within a sentence or a larger piece of text.


Training BERT is done through two key processes:


Masked Language Modelling (MLM)


When a piece of text is entered into a BERT architecture, 15% of it will be randomly ‘masked’. BERT will then have to determine what the missing words are based on the remaining words in the passage. The reason this occurs is to prevent BERT from simply learning one definition of the word, and instead allowing it to form an understanding of the word’s role in the specific sentence where it is masked. This greatly helps BERT understand how words can have different meanings depending on context.


Next Sentence Prediction (NSP)


Next Sentence Prediction is another key function used when training a BERT model. Through this technique, the BERT model is given pairs of sentences from a passage and must determine whether or not the second sentence in each pair is in fact the subsequent sentence to the first in the original body of text.


  • “Daniel Galvin isn’t just a world-renowned, world-class colourist with a celebrity client list.”


  • “Your GCSE grades are not as important as your passion for hair.”


  • “Daniel Galvin is also a brand, with two generations of the Galvin family running this famous family business.”


In the example above, BERT would be trained to recognise that Sentence (C) is likely the next sentence sequentially to Sentence (A), and that Sentence (B) is the random sentence. This technique is very useful when training BERT for tasks like question answering. Next Sentence Prediction, when combined with Masked Language Modelling, trains BERT to understand not only the context and relationship between words, but also between sentences.


The BERT Update


So we’ve explored what BERT is from a theoretical standpoint, but what is Google’s BERT algorithmic update for search?


The simple answer is, Google is now using BERT to improve search results. While BERT can be applied to a number of NLP tasks, this update specifically pertains to search queries, and to helping Google fully understand the true intent of a query.


Let’s go back to our example using the word “running”, or, in the following example, “run”:


“How to run a charity in New York”

“How to run for charity in New York”


In these two queries, the words ‘a’ and ‘for’ change the definition of the word ‘run’, and the word ‘charity’ is crucial in understanding the overall context.


Traditional NLP frameworks would likely recognise and group together “charity in New York” without considering the context provided earlier in the query – and would perhaps provide results about established charities in New York, or would provide a mixed set of results regardless of whether ‘a’ or ‘for’ are used. BERT can break these queries down to fully understand them:


“How to run”

“run for charity”

“in New York”

“run” + “in New York”

“charity in New York”


BERT is able to recognise all of the ways that each word may interact and, using bidirectional transformers, can determine the true intent of the query, and subsequently provide the most relevant results.


Currently, 10% of Google searches in the U.S. use BERT to serve the most relevant results – typically on “longer, more conversational queries”. BERT is also currently only trained for the English language. While there is no defined timeline, Google are committed to expanding the update to both a larger percentage of queries and to more languages in the future.


What does the BERT update mean for users?


For Google users, BERT means improved search query results, and therefore an enhanced user experience. As BERT continues to develop and as Google continues to roll out the update, the search engine’s understanding of human language will continue to improve considerably. Search results will become more relevant and responsive, and better served for your specific needs. It will become easier and easier to find the information you need.


BERT is also used for Google’s featured snippets, again providing more relevant, accurate results.


What impact does BERT have on SEO?


You cannot optimise for BERT, so the only way for SEOs to really leverage this update is to ensure that their content is always focused on the audience and their needs. BERT is not a ranking tool, and it doesn’t assign values to pages; it is simply used so Google can better understand the intent of the user.


As search engines push towards a more human way of understanding queries, so too should the content people are searching for. The more focused your content is on the specific intent of the user, the more likely it is that BERT will recognise this connection. Understand your audience, what they search for and how they search for it; less keyword stuffing and more natural, human content is key.


The introduction of BERT by no means indicates an end for RankBrain, Google’s first major AI algorithm introduced in 2015. Both methods will still be used to determine the best results.  It’s not always one or the other – in many cases, one query may require multiple methods – like BERT and RankBrain – to determine the most accurate and relevant output. The BERT update is simply an addition – albeit a hugely significant one – to Google’s pre-existing ranking system.


How will BERT impact translation?


While current BERT models focus only on the English language, as it develops it will become hugely useful for machine translation. If BERT can learn the nuances of English, then it can do so for any language, and in time we will very likely see BERT or new frameworks built upon BERT’s architecture greatly improve the accuracy and performance of machine translation.


A system like BERT is capable of learning from the English language and applying these learnings to other languages. Already, BERT is being used to improve featured snippets in 24 countries, and this has seen improvements in languages such as Korean, Portuguese and Hindi.


Because BERT has been open-sourced, it’s an exciting opportunity for Welocalize to further solidify its position as a frontrunner in translation technology, by building upon this new advancement in machine learning and further driving forward the future of translation and localisation in the digital landscape.


About Michael