Published in

Towards Data Science

Akashdeep Singh Jaswal

Nov 22, 2019

3 min read

Byte Pair Encoding — The Dark Horse of Modern NLP

A simple data compression algorithm first introduced in 1994 supercharging almost all advanced NLP models of today (including BERT).

Background

The last few years have been an exciting time to be in the field of NLP. The evolution from sparse frequency-based word vectors to dense semantic word representation pre-trained models like Word2vec and GloVe set the foundation for learning the meaning of words. For many years, they served as reliable embedding layer initializations to train models in the absence of large amounts of task-specific data. Since the word embedding models pre-trained on Wikipedia were either limited by vocabulary size or the frequency of word occurrences, rare words like athazagoraphobia would never be captured resulting in unknown <unk> tokens when occurring in the text.

Dealing with rare words

Character level embeddings aside, the first real breakthrough at addressing the rare words problem was made by the researchers at the University of Edinburgh by applying subword units in Neural Machine Translation using Byte Pair Encoding (BPE). Today, subword tokenization schemes inspired by BPE have become the norm in most advanced models including the very popular family of contextual language models like BERT, GPT-2, RoBERTa, etc. Some referred to BERT as the beginning of a new era, yet, I refer to BPE as a dark horse in this race because it gets lesser attention (pun intended) than it deserves in the success of modern NLP models. In this article, I plan on shedding some more light on the details on how Byte Pair Encoding is implemented and why it works!

The Origins of Byte Pair Encoding

Like many other applications of deep learning being inspired by traditional science, Byte Pair Encoding (BPE) subword tokenization also finds its roots deep within a simple lossless data compression algorithm. BPE was first introduced by Philip Gage in the article “A New Algorithm for Data Compression” in the February 1994 edition of the C Users Journal as a technique for data compression that works by replacing common pairs of consecutive bytes with a byte that does not appear in that data.

Repurposing BPE for Subword Tokenization

To perform subword tokenization, BPE is slightly modified in its implementation such that the frequently occurring subword pairs are merged together instead of being replaced by another byte to enable compression. This would basically lead the rare word athazagoraphobia to be split up into more frequent subwords such as ['▁ath', 'az', 'agor', 'aphobia'].

Step 0. Initialize vocabulary.

Step 1. Represent each word in the corpus as a combination of the characters along with the special end of word token </w>.

Step 2. Iteratively count character pairs in all tokens of the vocabulary.

Step 3. Merge every occurrence of the most frequent pair, add the new character n-gram to the vocabulary.

Step 4. Repeat step 3 until the desired number of merge operations are completed or the desired vocabulary size is achieved (which is a hyperparameter).

What makes BPE the secret sauce?

BPE brings the perfect balance between character- and word-level hybrid representations which makes it capable of managing large corpora. This behavior also enables the encoding of any rare words in the vocabulary with appropriate subword tokens without introducing any “unknown” tokens. This especially applies to foreign languages like German where the presence of many compound words can make it hard to learn a rich vocabulary otherwise. With this tokenization algorithm, every word can now overcome their fear of being forgotten (athazagoraphobia).

References

Philip Gage, A New Algorithm for Data Compression. “Dr Dobbs Journal”
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look.

More from Towards Data Science

Your home for data science. A Medium publication sharing concepts, ideas and codes.

Serena Peruzzo

·Nov 22, 2019

Using NLP to understand laws

An unsupervised analysis of the Accessibility for Ontarians with Disabilities Act — The process of legal reasoning and decision making is heavily reliant on information stored in text. Tasks like due diligence, contract review, and legal discovery, that are traditionally time-consuming, can benefit from NLP models and be automated, saving a huge amount of time. …

Machine Learning

7 min read

Share your ideas with millions of readers.

Write on Medium

Iustina Ivanova

·Nov 22, 2019

Training RetinaNet on Google Colab to detect pliers, hammers, and screwdrivers from KTH Handtools Dataset.

In this article, RetinaNet is trained in Google Colab to detect plier, hammer and screwdriver instruments. The dataset was taken from an opened source called KTH Handtools Dataset. It consists of 3 types of images for the handtools: hammer, plier and screwdriver in different illuminations and different locations. Jupyter notebook…

Data Science

7 min read

Eric Kleppen

·Nov 22, 2019

A Simple Git Workflow for Github Beginners and Everyone Else

Become the Git Master When I started my first group project in the Data Analysis coding camp I attended, we had to pick a “Git master.” The thought of coordinating a workflow for five inexperienced people using git and Github was a bit daunting. When the time came to decide, since I had a…

Git

6 min read

A Simple Git Workflow for Github Beginners and Everyone Else

Ben Gilburt

·Nov 22, 2019

Gender bias in GPT-2

A man and his son are in a terrible accident and are rushed to the hospital for critical care. The doctor looks at the boy and exclaims “I can’t operate on this boy, he’s my son!”. How could this be? The answer? The doctor is the boy’s mother My answer……

Gender Equality

9 min read

Vincent Tatan

·Nov 22, 2019

The Danger of Title Play in Data Analytics

In Google, your title does not matter. Your work does! — “You are a data analyst so why do you do machine learning?” A fresh grad asked me. In his mind, it was really weird for me to put my title as Data Analyst. He is after all the Data Scientist/Machine Learning Engineer of a big e-Commerce startup. …

Data Science

7 min read

The Danger of Title Play in Data Analytics

Recommended from Medium

Kan Nishida

learn data science

Exploratory v3.3 Released!

Oluwawemimo Folayan

Descriptive Statistics for Data Science

Datascience George

The Importance Of Correlation

Giorgos Myrianthous

Towards Data Science

SparkSession vs SparkContext vs SQLContext vs HiveContext

Nam Nguyen

Towards Data Science

A day in the life of a Big Data Engineer

milind soorya

Mushroom dataset analysis and classification in python

Daniel Stofan

GoodVision Analytics Platform for Traffic Modelers

Kan Nishida

learn data science

Responses (8)

Byte Pair Encoding — The Dark Horse of Modern NLP

A simple data compression algorithm first introduced in 1994 supercharging almost all advanced NLP models of today (including BERT).

Background

Dealing with rare words

The Origins of Byte Pair Encoding

Repurposing BPE for Subword Tokenization

What makes BPE the secret sauce?

References

Sign up for The Variable

By Towards Data Science

More from Towards Data Science

Using NLP to understand laws

Training RetinaNet on Google Colab to detect pliers, hammers, and screwdrivers from KTH Handtools Dataset.

A Simple Git Workflow for Github Beginners and Everyone Else

Gender bias in GPT-2

The Danger of Title Play in Data Analytics

Recommended from Medium

Exploratory v3.3 Released!

Descriptive Statistics for Data Science

The Importance Of Correlation

SparkSession vs SparkContext vs SQLContext vs HiveContext

A day in the life of a Big Data Engineer

Mushroom dataset analysis and classification in python

GoodVision Analytics Platform for Traffic Modelers

5 Most Common Date/Time Data Wrangling Operations with Exploratory

Akashdeep Singh Jaswal

More from Medium

Train and Deploy Fine-Tuned GPT-2 Model Using PyTorch on Amazon SageMaker to Classify News Articles

HuBERT Explained

So retrieval is what we needed?

A Primer on Knowledge Distillation in NLP — Part 1