The last few years have been an exciting time to be in the field of NLP. The evolution from sparse frequency-based word vectors to dense semantic word representation pre-trained models like Word2vec and GloVe set the foundation for learning the meaning of words. For many years, they served as reliable embedding layer initializations to train models in the absence of large amounts of task-specific data. Since the word embedding models pre-trained on Wikipedia were either limited by vocabulary size or the frequency of word occurrences, rare words like athazagoraphobia would never be captured resulting in unknown <unk> tokens when occurring in the text.

Character level embeddings aside, the first real breakthrough at addressing the rare words problem was made by the researchers at the University of Edinburgh by applying subword units in Neural Machine Translation using Byte Pair Encoding (BPE). Today, subword tokenization schemes inspired by BPE have become the norm in most advanced models including the very popular family of contextual language models like BERT, GPT-2, RoBERTa, etc. Some referred to BERT as the beginning of a new era, yet, I refer to BPE as a dark horse in this race because it gets lesser attention (pun intended) than it deserves in the success of modern NLP models. In this article, I plan on shedding some more light on the details on how Byte Pair Encoding is implemented and why it works!

Like many other applications of deep learning being inspired by traditional science, Byte Pair Encoding (BPE) subword tokenization also finds its roots deep within a simple lossless data compression algorithm. BPE was first introduced by Philip Gage in the article “A New Algorithm for Data Compression” in the February 1994 edition of the C Users Journal as a technique for data compression that works by replacing common pairs of consecutive bytes with a byte that does not appear in that data.

To perform subword tokenization, BPE is slightly modified in its implementation such that the frequently occurring subword pairs are merged together instead of being replaced by another byte to enable compression. This would basically lead the rare word athazagoraphobia to be split up into more frequent subwords such as ['▁ath', 'az', 'agor', 'aphobia'].

Step 0. Initialize vocabulary.

Step 1. Represent each word in the corpus as a combination of the characters along with the special end of word token </w>.

Step 2. Iteratively count character pairs in all tokens of the vocabulary.

Step 3. Merge every occurrence of the most frequent pair, add the new character n-gram to the vocabulary.

Step 4. Repeat step 3 until the desired number of merge operations are completed or the desired vocabulary size is achieved (which is a hyperparameter).

BPE brings the perfect balance between character- and word-level hybrid representations which makes it capable of managing large corpora. This behavior also enables the encoding of any rare words in the vocabulary with appropriate subword tokens without introducing any “unknown” tokens. This especially applies to foreign languages like German where the presence of many compound words can make it hard to learn a rich vocabulary otherwise. With this tokenization algorithm, every word can now overcome their fear of being forgotten (athazagoraphobia).

Philip Gage, A New Algorithm for Data Compression. “Dr Dobbs Journal”
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.