Byte Pair Encoding — The Dark Horse of Modern NLP

A simple data compression algorithm first introduced in 1994 supercharging almost all advanced NLP models of today (including BERT).

Background

The last few years have been an exciting time to be in the field of NLP. The evolution from sparse frequency-based word vectors to dense semantic word representation pre-trained models like Word2vec and GloVe set the foundation for learning the meaning of words. For many years, they served as reliable embedding layer initializations to train models in the absence of large amounts of task-specific data. Since the word embedding models pre-trained on Wikipedia were either limited by vocabulary size or the frequency of word occurrences, rare words like athazagoraphobia would never be captured resulting in unknown <unk> tokens when occurring in the text.

Dealing with rare words

Character level embeddings aside, the first real breakthrough at addressing the rare words problem was made by the researchers at the University of Edinburgh by applying subword units in Neural Machine Translation using Byte Pair Encoding (BPE). Today, subword tokenization schemes inspired by BPE have become the norm in most advanced models including the very popular family of contextual language models like BERT, GPT-2, RoBERTa, etc. Some referred to BERT as the beginning of a new era, yet, I refer to BPE as a dark horse in this race because it gets lesser attention (pun intended) than it deserves in the success of modern NLP models. In this article, I plan on shedding some more light on the details on how Byte Pair Encoding is implemented and why it works!

The Origins of Byte Pair Encoding

Like many other applications of deep learning being inspired by traditional science, Byte Pair Encoding (BPE) subword tokenization also finds its roots deep within a simple lossless data compression algorithm. BPE was first introduced by Philip Gage in the article “A New Algorithm for Data Compression” in the February 1994 edition of the C Users Journal as a technique for data compression that works by replacing common pairs of consecutive bytes with a byte that does not appear in that data.

Repurposing BPE for Subword Tokenization

To perform subword tokenization, BPE is slightly modified in its implementation such that the frequently occurring subword pairs are merged together instead of being replaced by another byte to enable compression. This would basically lead the rare word athazagoraphobia to be split up into more frequent subwords such as ['▁ath', 'az', 'agor', 'aphobia'].

Step 0. Initialize vocabulary.

Step 1. Represent each word in the corpus as a combination of the characters along with the special end of word token </w>.

Step 2. Iteratively count character pairs in all tokens of the vocabulary.

Step 3. Merge every occurrence of the most frequent pair, add the new character n-gram to the vocabulary.

Step 4. Repeat step 3 until the desired number of merge operations are completed or the desired vocabulary size is achieved (which is a hyperparameter).

What makes BPE the secret sauce?

BPE brings the perfect balance between character- and word-level hybrid representations which makes it capable of managing large corpora. This behavior also enables the encoding of any rare words in the vocabulary with appropriate subword tokens without introducing any “unknown” tokens. This especially applies to foreign languages like German where the presence of many compound words can make it hard to learn a rich vocabulary otherwise. With this tokenization algorithm, every word can now overcome their fear of being forgotten (athazagoraphobia).

References

  1. Philip Gage, A New Algorithm for Data Compression. “Dr Dobbs Journal”
  2. Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
  3. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  4. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look.

Your home for data science. A Medium publication sharing concepts, ideas and codes.

Share your ideas with millions of readers.

Recommended from Medium

Exploratory v3.3 Released!

Descriptive Statistics for Data Science

The Importance Of Correlation

SparkSession vs SparkContext vs SQLContext vs HiveContext

A day in the life of a Big Data Engineer

Mushroom dataset analysis and classification in python

GoodVision Analytics Platform for Traffic Modelers

5 Most Common Date/Time Data Wrangling Operations with Exploratory

Akashdeep Singh Jaswal

Akashdeep Singh Jaswal

ML Engineer | I enjoy being uncomfortable and solving complex problems through smart (AI) solutions. https://www.linkedin.com/in/akashjaswal/

More from Medium

Train and Deploy Fine-Tuned GPT-2 Model Using PyTorch on Amazon SageMaker to Classify News Articles

HuBERT Explained

HuBERT architecture introduction

So retrieval is what we needed?

A Primer on Knowledge Distillation in NLP — Part 1