LLM Notes

2 minute read

Published:


title: ‘LLM Notes’ date: 2026-07-05 permalink: /blog/2026/07/llm-notes/ tags:

  • LLM

The notes of Large Language Models.

LLM Notes

Tokenization

  • Why tokenize?
    • Directly working with raw bytes is inefficient because the sequence can be very long and the input space can be very large and sparse. Working on raw words has unbounded vocabulary size and model will encounter unseen words at test time.
    • Compression ratio: # of bytes per token
  • BPE (Byte Pair Encoding)
    • Data driven, subword tokenization tailored to a training corpus
    • Why should it train on UTF-8, not UTF-16 or UTF-32:
      • UTF-8 only has $2^8=256$ possible byte values. No pruning is needed to train all the bytes.
      • UTF-16 or UTF-32’s training corpus needs to be very large to contain all options.
corpus = list(text.encode("utf-8"))
num_unique_tokens = 256

while num_unique_tokens < target_size:
    # 1. Find the most frequent pair of adjacent tokens
    most_frequent_pair = max(occurance_of_every_adjacent_pair(corpus))

    # 2. Assign a new token ID to the most frequent pair
    new_token_id = num_unique_tokens
    num_unique_tokens += 1

    # 3. Replace the original tokens with the most frequent pair
    corpus = replace_token_with_the_most_freq_pair(corpus, new_token_id, most_frequent_pair)
    
```---
title: 'LLM Notes'
date: 2026-07-05
permalink: /blog/2026/07/llm-notes/
tags:
  - LLM
---

The notes of Large Language Models.

# LLM Notes

## Tokenization

- Why tokenize?
    - Directly working with raw bytes is inefficient because the sequence can be very long and the input space can be very large and sparse. Working on raw words has unbounded vocabulary size and model will encounter unseen words at test time.
    - Compression ratio: # of bytes per token

- BPE (Byte Pair Encoding)
    - Data driven, subword tokenization tailored to a training corpus
    - Why should it train on UTF-8, not UTF-16 or UTF-32:
        - UTF-8 only has $2^8=256$ possible byte values. No pruning is needed to train all the bytes.
        - UTF-16 or UTF-32's training corpus needs to be very large to contain all options.

corpus = list(text.encode(“utf-8”)) num_unique_tokens = 256

while num_unique_tokens < target_size: # 1. Find the most frequent pair of adjacent tokens most_frequent_pair = max(occurance_of_every_adjacent_pair(corpus))

# 2. Assign a new token ID to the most frequent pair
new_token_id = num_unique_tokens
num_unique_tokens += 1

# 3. Replace the original tokens with the most frequent pair
corpus = replace_token_with_the_most_freq_pair(corpus, new_token_id, most_frequent_pair)

```