LLM Notes

2 minute read

Published: July 05, 2026

title: ‘LLM Notes’ date: 2026-07-05 permalink: /blog/2026/07/llm-notes/ tags:

The notes of Large Language Models.

LLM Notes

Tokenization

Why tokenize?
- Directly working with raw bytes is inefficient because the sequence can be very long and the input space can be very large and sparse. Working on raw words has unbounded vocabulary size and model will encounter unseen words at test time.
- Compression ratio: # of bytes per token
BPE (Byte Pair Encoding)
- Data driven, subword tokenization tailored to a training corpus
- Why should it train on UTF-8, not UTF-16 or UTF-32:
  - UTF-8 only has $2^8=256$ possible byte values. No pruning is needed to train all the bytes.
  - UTF-16 or UTF-32’s training corpus needs to be very large to contain all options.

corpus = list(text.encode("utf-8"))
num_unique_tokens = 256

while num_unique_tokens < target_size:
    # 1. Find the most frequent pair of adjacent tokens
    most_frequent_pair = max(occurance_of_every_adjacent_pair(corpus))

    # 2. Assign a new token ID to the most frequent pair
    new_token_id = num_unique_tokens
    num_unique_tokens += 1

    # 3. Replace the original tokens with the most frequent pair
    corpus = replace_token_with_the_most_freq_pair(corpus, new_token_id, most_frequent_pair)
    
```---
title: 'LLM Notes'
date: 2026-07-05
permalink: /blog/2026/07/llm-notes/
tags:
  - LLM
---

The notes of Large Language Models.

# LLM Notes

## Tokenization

- Why tokenize?
    - Directly working with raw bytes is inefficient because the sequence can be very long and the input space can be very large and sparse. Working on raw words has unbounded vocabulary size and model will encounter unseen words at test time.
    - Compression ratio: # of bytes per token

- BPE (Byte Pair Encoding)
    - Data driven, subword tokenization tailored to a training corpus
    - Why should it train on UTF-8, not UTF-16 or UTF-32:
        - UTF-8 only has $2^8=256$ possible byte values. No pruning is needed to train all the bytes.
        - UTF-16 or UTF-32's training corpus needs to be very large to contain all options.

corpus = list(text.encode(“utf-8”)) num_unique_tokens = 256

while num_unique_tokens < target_size: # 1. Find the most frequent pair of adjacent tokens most_frequent_pair = max(occurance_of_every_adjacent_pair(corpus))

# 2. Assign a new token ID to the most frequent pair
new_token_id = num_unique_tokens
num_unique_tokens += 1

# 3. Replace the original tokens with the most frequent pair
corpus = replace_token_with_the_most_freq_pair(corpus, new_token_id, most_frequent_pair)

```

Share on

Twitter Facebook LinkedIn

Qianying Zhou

LLM Notes

LLM

LLM Notes

Tokenization

Share on

You May Also Enjoy

DHE(Deep Hash Embedding)

DHE(Deep Hash Embedding)

Understanding arenas and heaps in malloc()

Understanding arenas and heaps in glibc’s malloc()

Proxygen source code review

Nginx high concurrency strategy

Strategy for high concurrency