Chapter 2: Working with Text Data

Introduction

The purpose of this chapter is to show the early preparation of the data by following this diagram, and the focus is on the stage 1, with a focus on Data Preparation and Sampling

2.1 Understanding Word Embeddings

Embedding is the conversion of a words to a discrete objects of numeric format for a neural networks. Simply, it is a mapping from discrete objects, such as words, images, or even entire documents, to points in a continuous vector space—the primary purpose of embeddings is to convert nonnumeric data into a format that neural networks can process.

While word embeddings are the most common form of text embedding, there are also embeddings for sentences, paragraphs, or whole documents. Sentence or paragraph embeddings are popular choices for retrieval-augmented generation.
Retrieval augmented generation combines generation (like producing text) with retrieval (like searching an external knowledge base) to pull relevant information when generating text, which is a technique that is beyond the scope of this book. Since our goal is to train GPT-like LLMs, which learn to generate text one word at a time, we will focus on word embeddings.
Several algorithms and frameworks have been developed to generate word embeddings. One of the earlier and most popular examples is the Word2Vec approach.
Word embeddings can have varying dimensions, from one to thousands. A higher dimensionality might capture more nuanced relationships but at the cost of computational efficiency.
While we can use pretrained models such as Word2Vec to generate embeddings for machine learning models, LLMs commonly produce their own embeddings that are part of the input layer and are updated during training. The advantage of optimizing the embeddings as part of the LLM training instead of using Word2Vec is that the embeddings are optimized to the specific task and data at hand.

2.2 Tokenizing text

Tokenization is the process of splitting input text into smaller parts called tokens.

These tokens are the building blocks that a language model (like GPT or BERT) uses to understand and generate text.

Why is Tokenization Important?

Before an LLM can work with text (like generating or classifying it), it must convert the raw text into numbers—and this conversion starts with tokenizing the text.

Tokens are passed to an embedding layer, which transforms them into vectors that the model can process.

What Are Tokens?

Tokens can be:

Whole words

Example:

"Finance is power" → ["Finance", "is", "power"]

Word pieces / subwords

Used in models like BERT or GPT to handle unknown or rare words

Example:

"unhappiness" → ["un", "happiness"] or even ["un", "hap", "pi", "ness"]

Punctuation

Many tokenizers treat punctuation as separate tokens

Example:

"Hello, world!" → ["Hello", ",", "world", "!"]

Special symbols (used by LLMs)

These can include newline (\n), tab (\t), or model-specific tokens like <s> (start), </s> (end), etc.

2.3 Converting tokens into token IDs

To be able to convert the tokens to token ID, which is the intermediate steps before the token ID is embedded into the vectors, we will need to design a dictionary.

To be able to map the token into Token ID, we will build a vocabulary of words which will incorporates all the possibility of all the words in the text that we are considering as show below. This vocabulary defines how we map each unique word and special character to a unique integer, as shown in figure 2.6.