Google’s BERT Update: A New Way with Words

Google’s recent BERT update is the most significant leap for search in 5 years

By Michael de Alwis

Overview:

  • Google’s BERT update is the most significant algorithmic update since RankBrain
  • BERT is a model designed to improve accuracy and performance in NLP tasks
  • The BERT update currently impacts 10% of Google search queries in the US
  • BERT could have major implications in both search and translation
  • BERT utilises many techniques already prevalent in NLP, but it is how they’re used that sets it apart

Might the days of searching “Host party at Hilton Paris” on Google and receiving tabloid results about Paris Hilton’s most iconic party looks be gone? That may be a niche example, but it’s one that illustrates the potential power of BERT, nonetheless.

 

In October 2019, Google started rolling out what has been touted as the most significant leap in search since the introduction of RankBrain five years ago. Known as BERT – Bidirectional Encoder Representations from Transformers – the new NLP framework is set to significantly enhance the performance of the search engine powerhouse. The update came one year after Google AI published their research paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al, 2018) and subsequently open-sourced the framework.

 

But what is BERT exactly, and why is it such a big deal? Here, we’ll explore the ins and outs of the new update at a beginner level in Part 1, before delving into a deeper technical understanding of how the framework operates in Part 2.

Part

The Basics of Google’s BERT:

What is NLP?

To truly understand BERT and how it impacts search, we first need to understand the wider discipline of Natural Language Processing (NLP). Melding elements of computer science, artificial intelligence, and linguistics, NLP is the field focusing on teaching machines how human language works, or training computers to understand and recognise the nuances of human language.

‘Deep’ NLP as we know it emerged in the early 2010s, and today we see it applied in many aspects of everyday life – from online chatbots, to predictive text messages, to trending topics on Twitter, to voice assistants like Apple’s Siri, Amazon’s Alexa, and Google Assistant.

NLP goes beyond training machines to understand spelling and grammar, it also involves teaching machines to understand the definitions of a word in different contexts. For instance, the definition of the word ‘running’ differs in the phrases ‘running an event’, ‘running away’, and ‘running for president’; NLP is used to help computers recognise and distinguish between these definitions based on the context of the overall input. It’s also used to help computers recognise the tone or sentiment behind a piece of text or a word. A great example of this is how tools like Grammarly can identify whether the tone of a passage is optimistic, aggressive, formal, neutral, etc.

Many  NLP models tend to utilise a recurrent neural network (RNN) system to solve the linguistic task. Recurrent neural networks allow a machine to retain the knowledge it has gained from earlier in a body of text and use it to predict what may come next. This helps the machine to recognise patterns and understand context as it scans the piece of text.

The limitation of this, however, is that most RNN systems are unidirectional, meaning they can only understand the meaning of a word based on the words that precede it – looking at a sequence of text left-to-right. If a machine can only understand a word based on the word that comes before it, then the true context cannot be determined until the end of the sentence, and this can cause errors. Although there are elaborations on basic RNNs that allow them to understand right-to-left context, these have their limitations.

To understand this we can look back to the linguistic theory of Lexical Semantics; which posits that one cannot truly determine the meaning of a word on its own; we must use our understanding of the words that surround it, and our established understanding of language in general, in order to fully decipher the meaning of a word. Traditional RNN systems help machines achieve this level of understanding, but fall short due to their unidirectional nature.

Additionally, traditional NLP systems require a lot of manual tagging, and for every new NLP task you undertake you must train the system to understand syntax and semantics from scratch. We’ll explore how BERT solves these problems in Part 2.

What is BERT then?

BERT is an NLP model, but it is unlike any other that has come before it. It is a contextual language model that greatly improves the way computers can understand language and its nuances. As mentioned above, BERT is an acronym for Bidirectional Encoder Representations from Transformers – and while that may sound quite complex, what BERT achieves is quite simple: it uses a number of innovative mechanisms and processes in order to understand human language better than any other NLP framework has ever been able to achieve.

BERT is taught a general understanding of how language works using a massive corpus of text data, and then this general knowledge can be fine-tuned for any specific language-related problem you might have. Alongside the publication of their research paper in 2018, Google also made the framework open source, meaning anybody can use and expand upon its architecture for any number of language-based tasks and problems.

Prior to being rolled out in search, BERT had already achieved state-of-the-art results for 11 different natural language processing tasks. If, for example, you wanted to create a chatbot for your business, you could take BERT’s pre-trained architecture and fine tune it for this specific task and your specific products and customers. You could input a dataset containing thousands of product reviews, each tagged ‘positive’ or ‘negative’, and further train BERT in sentiment analysis to understand how to distinguish between future positive and negative reviews. Another example; Sadrach Pierre, Ph.D. recently experimented with BERT’s ability to classify articles as Fake News.

There are a huge number of tasks the BERT algorithm can be used for, and having been pre-trained with such a large corpus, all that’s required from programmers is little bit of fine tuning – which is a huge plus.

Is Google’s BERT Update different from BERT itself?

So we’ve explored what BERT is from a theoretical standpoint, but what is Google’s BERT algorithmic update for search?

The simple answer is, Google is now using BERT to improve search results. While BERT can be applied to a number of NLP tasks, this update specifically pertains to search queries, and to helping Google fully understand the true intent of a query.

 

Let’s go back to our example using the word “running”, or, in the following example, “run”:

 

“How to run a charity in New York”

“How to run for charity in New York”

 

In these two queries, the words ‘a’ and ‘for’ change the definition of the word ‘run’, and the word ‘charity’ is crucial in understanding the overall context.

 

Before BERT, Google’s search algorithm would likely recognise and group together “charity in New York” without considering the context provided earlier in the query – and would perhaps provide results about established charities in New York, or would provide a mixed set of results regardless of whether ‘a’ or ‘for’ are used. BERT can build a representation of the meaning for both the entire query and for each word simultaneously. The model is able to recognise all of the ways that each word may interact and, using bidirectional Transformers, can determine the true intent of the query, and subsequently provide the most relevant results.

Currently, 10% of Google searches in the U.S. use BERT to serve the most relevant results – typically on “longer, more conversational queries”. BERT is also currently only trained for the English language. While there is no defined timeline, Google are committed to expanding the update to both a larger percentage of queries and to more languages in the future.

What does the BERT update mean for users?

For users on Google, BERT means improved search query results, and therefore an enhanced user experience. As the BERT algorithm continues to develop and as Google continues to roll out the update, the search engine’s understanding of human language will continue to improve considerably. Search results will become more relevant and responsive, and better served for your specific needs. It will become easier and easier to find the information you need.

BERT is also used for Google’s featured snippets, again providing more relevant, accurate results. It is likely you’ll begin to notice these improvements in featured snippets like Answer Boxes and ‘People Also Ask’ lists.

What impact does BERT have on SEO?

You cannot optimise for BERT, so the only way for SEOs to really leverage this update is to ensure that their content is always focused on the audience and their needs. BERT is not a ranking tool, and it doesn’t assign values to pages; it is simply used so Google can better understand the intent of the user.

As search engines push towards a more human way of understanding queries, so too should the content people are searching for. The more focused your content is on the specific intent of the user, the more likely it is that BERT will recognise this connection. Understand your audience, what they search for and how they search for it; less keyword stuffing and more natural, human content is key.

The introduction of BERT by no means indicates an end for RankBrain, Google’s first major AI algorithm introduced in 2015. Both methods will still be used to determine the best results.  It’s not always one or the other – in many cases, one query may require multiple methods – like BERT and RankBrain – to determine the most accurate and relevant output. The BERT update is simply an addition – albeit a hugely significant one – to Google’s pre-existing ranking system.

How will BERT impact translation?

While current BERT models concentrate only on the English language, as it develops it will become hugely useful for machine translation. If BERT can learn the nuances of English, then it can do so for any language, and in time we will very likely see BERT or new natural language processing models built upon BERT’s architecture greatly improve the accuracy and performance of machine translation.

A system like BERT is capable of learning from the English language and applying these learnings to other languages. Already, Google’s BERT algorithm is being used to improve featured snippets in 24 countries, and this has seen improvements in languages such as Korean, Portuguese and Hindi.

Part

The nitty-gritty:

Now that we’ve explored BERT and its impact, we can begin to deconstruct the framework and form an understanding of how exactly BERT is able to achieve what it does. In order to do so, we need to discuss what it is that sets BERT apart from other NLP frameworks.

What makes BERT different?

While BERT utilises a number of mechanisms and models that are prevalent in NLP, it is how they’re used that sets BERT apart from its predecessors:

 

Pre-Training:

BERT is an NLP framework that has been pre-trained using an unlabelled, pure plain text corpus. The Google BERT algorithm uses both the entire English-language Wikipedia and a selection of ebooks as its corpora, providing a dataset of over 2.5 billion words to learn from. What separates BERT from other models that have utilised unsupervised learning (i.e. not requiring manual tagging), like 2013’s word2vec, is the sheer size of the dataset used in pre-training, which allows BERT to attain stronger representations for words and text during the training process itself. By pre-training with such a large corpus, BERT is autonomously able to learn the nuances of natural language with greater accuracy, and will therefore only require fine tuning for future tasks. This saves a great deal of time for programmers looking to apply NLP on a specific task or project.

Deeply Bidirectional:

Like many NLP frameworks, BERT is based on a neural network, designed to recognise patterns in words and how they’re used. However, unlike traditional neural network models that process words sequentially, either one-by-one from left-to-right or right-to-left, BERT processes the relationships between all words simultaneously,  – regardless of their positioning.

In traditional, unidirectional systems, a model can only understand the context of a word based on either the word that precedes it or the word that succeeds it, whereas BERT is able to learn the context of a word based on its relations with every word in the sentence. Unidirectional systems cannot obtain the context of a word appearing at the beginning of a sentence by reading a word appearing at the end of the sentence – and this can cause issues. BERT’s deeply bidirectional model offers the solution, allowing a machine to see every word in a sentence simultaneously – like humans do.

Transformers:

Bidirectional processing is achieved through Transformers – models built for attention- and self-attention-based learning. Attention and self-attention are mechanisms designed to identify where the focus of a sentence or piece of text lies. It involves a process of finding the connection between any two words in a sentence, and subsequently assigning weighting to these connections to determine which are the most important towards understanding the wider context of the sentence.

First, each word is converted into a mathematical representation known as a ‘word embedding’ or ‘embedding vector’. Within this embedding will be a set of attributes representing the word (this could relate to the sentiment of the word, whether it’s plural or singular, the part-of-speech of the word, etc.). That’s the process of attention, but the most important mechanism in Transformers is actually self-attention.

The self-attention mechanism allows BERT to incorporate every contextual word into the representation of a word. So, while attention can form embeddings for each word, self-attention updates these embeddings using the embeddings of other words in the text. The embeddings are mapped together on a broader space to understand how they relate to one another – where words that are similar to each other would land closer together in the embedding space. Self-attention is the mechanism that calculates these embeddings together to determine the weight of the connection between words.

As an example, let’s consider the following sentence:

 

“Katie watched the parrot with binoculars”

 

A Transformer can take “Katie watched… with binoculars” and “parrot with binoculars” and determine that it is more likely that Katie used binoculars to watch the parrot than it is that Katie watched a parrot who was using binoculars. The connection between “watched” and “binoculars” is given more weight than “parrot” and “binoculars” based on what the model already understands about language (from pre-training) and based on the context provided by the word “with”.

For another example of how attention is used to understand context, let’s consider the following sentence:

 

“The cat stayed inside the house because it didn’t like the cold”

 

The Transformer will use attention mechanisms to determine that the “it” in the sentence refers to the cat, and not the house.

So while traditional, unidirectional recurrent neural networks understand the context of a word based on the word that precedes it, bidirectional Transformers use self-attention mechanisms to figure out the context of a word using all other words in the text, regardless of their positioning, in order to understand the role of the word within a sentence or a larger piece of text.

How is BERT trained?

Training BERT is done through two key processes:

Masked Language Modelling (MLM)

During training, when a piece of text is input into the BERT model, 15% of it will be randomly ‘masked’. BERT will then have to determine what the missing words are based on the context, or the remaining words in the passage. In predicting the masked words, BERT is able to ascertain their representations. This gives BERT the flexibility to learn multiple senses of a word, when doing so would be useful for the MLM task.

Next Sentence Prediction (NSP)

Next Sentence Prediction is another key objective used when training a BERT model. Through this technique, the BERT model is given pairs of sentences from a passage and it must determine whether or not the second sentence in each pair is in fact the subsequent sentence to the first in the original body of text.

 

  • “Daniel Galvin isn’t just a world-renowned, world-class colourist with a celebrity client list.” 
  • “Your GCSE grades are not as important as your passion for hair.”
  • “Daniel Galvin is also a brand, with two generations of the Galvin family running this famous family business.”

 

In the example above, BERT would be trained to recognise that Sentence (C) is likely the next sentence sequentially to Sentence (A), and that Sentence (B) is the random sentence. This technique is very useful when training BERT for tasks like question answering.

Next Sentence Prediction, when combined with Masked Language Modelling, trains BERT to understand not only the context and relationship between words, but also between sentences.

BERT and Adapt Worldwide

As leaders in SEO, engaging with BERT as it continues to develop allows Adapt Worldwide the opportunity to anticipate and subsequently capitalise upon these innovations within the ever-evolving search environment. By fully understanding BERT and its implications, we can play a significant role in defining the future of search.

Moreover, because BERT has been open-sourced, it’s an exciting opportunity for Adapt Worldwide to further solidify its position as a frontrunner in translation technology, by building upon this new advancement in machine learning and continuing to drive forward the future of translation and localisation in the digital landscape.