How to understand the special tokens?

That’s certainly confusing…


You can understand what you’re seeing if you separate three different layers:

  1. The model’s vocabulary (all token IDs, including your added_tokens list).
  2. The low-level fast tokenizer (Rust tokenizers library; this is what tokenizer.json stores).
  3. The high-level Transformers wrapper (Python PreTrainedTokenizerFast / SpecialTokensMixin; this is driven by tokenizer_config.json and special_tokens_map.json and is what all_special_tokens comes from). (Hugging Face)

Your confusion is exactly because (2) and (3) use the word “special” differently.


1. First: what is a “special token” conceptually?

At the model/training level there are two broad kinds of tokens:

  • Normal (regular) tokens – subword pieces of natural language (“Hello”, “ing”, etc.).

  • Control / format tokens – tokens with special meaning in the training data, such as:

    • <|im_start|>, <|im_end|> – chat message boundaries.
    • <|vision_start|>, <|vision_end|>, <|vision_pad|> – multimodal boundaries/padding.
    • <tool_call>, </tool_call>, <tool_response>, </tool_response> – function-calling tags.
    • <think>, </think> – reasoning spans.
    • <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|> – fill-in-the-middle tokens.
    • <|endoftext|> – end-of-document token used in pretraining.

Qwen’s docs call these “control tokens”: tokens that represent special functionality rather than natural language itself. (Qwen)

From the model’s point of view, all of these are just token IDs. “Specialness” is about how the tokenizer and high-level library treat them.


2. What your added_tokens list in tokenizer.json actually is

The tokenizer.json file is the serialized fast tokenizer from the tokenizers library. It contains: vocabulary, merges, pre/post-processing, plus a list called added_tokens. (Hugging Face)

Your added_tokens snippet:

151643 <|endoftext|> true
151644 <|im_start|> true
151645 <|im_end|> true
151646 <|object_ref_start|> true
151647 <|object_ref_end|> true
151648 <|box_start|> true
151649 <|box_end|> true
151650 <|quad_start|> true
151651 <|quad_end|> true
151652 <|vision_start|> true
151653 <|vision_end|> true
151654 <|vision_pad|> true
151655 <|image_pad|> true
151656 <|video_pad|> true
151657 <tool_call> false
151658 </tool_call> false
151659 <|fim_prefix|> false
151660 <|fim_middle|> false
151661 <|fim_suffix|> false
151662 <|fim_pad|> false
151663 <|repo_name|> false
151664 <|file_sep|> false
151665 <tool_response> false
151666 </tool_response> false
151667 <think> false
151668 </think> false

Here:

  • The first column is the token ID.
  • The middle column is the string form of the token.
  • The last true/false is the Rust-tokenizer-level special flag. (paddlenlp.readthedocs.io)

What that flag does in the fast tokenizer:

  • special = true

    • The token is treated as an indivisible “added token”.
    • The pre-tokenizer will not split it into smaller pieces.
    • When you decode with skip_special_tokens=True, these tokens will be removed. (Hugging Face)
  • special = false

    • The token is just an extra vocab token. It may still be one piece, but it does not get special handling in the tokenizer’s decode / skip logic.

So:

What does the added_tokens list mean?
It is “all vocabulary items that were added on top of the base vocab”, along with a low-level special flag that controls how the fast tokenizer tokenizes/decodes them.

It is not “the list of all special tokens from Transformers’ point of view”.

You can see this design in the Transformers code: higher-level add_special_tokens() calls down into the fast tokenizer and creates AddedToken objects with special=True, but there can also be added tokens that are not special. (gemfury.com)


3. What tokenizer_config.json is doing

tokenizer_config.json is a wrapper configuration used by the Python transformers library. It does not contain the full vocab; it tells AutoTokenizer:

  • Which tokenizer class to instantiate ("tokenizer_class": "Qwen2Tokenizer").

  • Which tokens are:

    • bos_token, eos_token, pad_token, unk_token, etc.
    • additional_special_tokens (custom special tokens).
  • Behavior flags like model_max_length, padding_side, add_prefix_space, etc. (Hugging Face)

Your tokenizer_config.json says:

"eos_token": "<|im_end|>",
"pad_token": "<|vision_pad|>",
"additional_special_tokens": [
  "<|im_start|>",
  "<|im_end|>",
  "<|object_ref_start|>",
  "<|object_ref_end|>",
  "<|box_start|>",
  "<|box_end|>",
  "<|quad_start|>",
  "<|quad_end|>",
  "<|vision_start|>",
  "<|vision_end|>",
  "<|vision_pad|>",
  "<|image_pad|>",
  "<|video_pad|>"
]

So from Transformers’ perspective:

  • EOS = <|im_end|>
  • PAD = <|vision_pad|>
  • And these 13 tokens are “additional special tokens”.

This information is also mirrored in special_tokens_map.json for many models, and both files are loaded by AutoTokenizer. (Hugging Face)


4. How tokenizer.all_special_tokens is computed

In the Transformers Python code, the SpecialTokensMixin class holds all the special-token attributes and exposes properties like all_special_tokens and all_special_ids. (Hugging Face)

Conceptually it does something like:

specials = []
for v in tokenizer.special_tokens_map_extended.values():
    if isinstance(v, list):
        specials.extend(v)
    else:
        specials.append(v)

# deduplicate while preserving order
all_special_tokens = list(dict.fromkeys(specials))

Where special_tokens_map_extended is built from:

  • bos_token, eos_token, pad_token, unk_token, etc.
  • additional_special_tokens (and sometimes their legacy variants). (Hugging Face)

Crucially:

all_special_tokens never looks at the raw added_tokens list in tokenizer.json.
It only looks at named special tokens (bos_token, eos_token, pad_token, etc.) and additional_special_tokens stored in the config.

That is exactly why your all_special_tokens output is:

[
 '<|im_end|>',
 '<|vision_pad|>',
 '<|im_start|>',
 '<|object_ref_start|>',
 '<|object_ref_end|>',
 '<|box_start|>',
 '<|box_end|>',
 '<|quad_start|>',
 '<|quad_end|>',
 '<|vision_start|>',
 '<|vision_end|>',
 '<|image_pad|>',
 '<|video_pad|>',
]

This is just:

  • eos_token (<|im_end|>)
  • pad_token (<|vision_pad|>)
  • plus everything in additional_special_tokens (deduplicated).

Notice:

  • <|endoftext|> is not in additional_special_tokens and is not declared as EOS in tokenizer_config.json.
  • Tool / FIM / <think> tokens are also not in additional_special_tokens and have special=false at the tokenizer level.

Therefore they do not appear in all_special_tokens. This is normal and also shows up in other models (e.g. LLaVA’s <image> token sometimes appears in added_tokens but not in all_special_tokens unless it was wired into additional_special_tokens). (Hugging Face Forums)

So:

Why is all_special_tokens different from the added_tokens list and from the true subset of it?
Because all_special_tokens is a higher-level view built from tokenizer_config.json (special-tokens map and additional_special_tokens), while added_tokens is the raw vocabulary list (with a low-level special flag). They are related but intentionally not the same set.


5. Relationship between the three things you see

Let’s put your exact objects side-by-side.

5.1. added_tokens (fast tokenizer, low-level)

  • Contains all tokens that were added after the base vocab, including:

    • Qwen control tokens: <|endoftext|>, <|im_start|>, <|im_end|>, <|vision_*|>, etc.
    • Tool tokens: <tool_call>, <tool_response>, <think>, etc.
    • FIM / repo tokens: <|fim_*|>, <|repo_name|>, <|file_sep|>.
  • The trailing true/false is the Rust-layer “special” flag for tokenization behavior.

5.2. tokenizer_config.json (Transformers wrapper, high-level)

Defines:

  • eos_token = "<|im_end|>"
  • pad_token = "<|vision_pad|>"
  • additional_special_tokens = the 13 multimodal/chat tokens.

These become:

  • tokenizer.eos_token, tokenizer.pad_token
  • tokenizer.additional_special_tokens

and then feed into:

  • tokenizer.all_special_tokens
  • tokenizer.all_special_ids

via SpecialTokensMixin. (Hugging Face)

5.3. tokenizer.all_special_tokens (Python view)

  • Computed from special_tokens_map / special_tokens_map_extended (EOS, PAD, additional specials, etc.), not from the raw added_tokens list.

Hence you only see:

  • <|im_end|>
  • <|vision_pad|>
  • and the 11 other additional special tokens.

<|endoftext|> and <tool_call> are not in that config, so they don’t appear even though they exist in added_tokens.


6. Difference in roles: tokenizer.json vs tokenizer_config.json

You can think of it like this:

6.1 tokenizer.json = “how to actually tokenize text”

  • Full definition of the fast tokenizer:

    • Vocabulary and merges (BPE/Unigram/etc.).
    • Normalizer, pre-tokenizer, post-processor.
    • added_tokens and their low-level special flag. (Hugging Face)
  • Used by anything that needs the exact same tokenization behavior:

    • PreTrainedTokenizerFast in Python.
    • transformers.js in JavaScript. (Hugging Face)
    • Inference frameworks that load HF tokenizers directly (vLLM, etc.).

If you change this file, you are changing how raw text is split into IDs.

6.2 tokenizer_config.json = “how Transformers should treat this tokenizer”

  • A small JSON that tells Transformers:

    • Which tokenizer class to use (Qwen2Tokenizer).
    • Which tokens are EOS, PAD, BOS, etc.
    • Which tokens are additional_special_tokens.
    • Max length, padding side, whether to add BOS by default, etc. (Hugging Face)
  • Also now often stores:

If you change this file, you are changing metadata and behavior inside Transformers, not the raw tokenization algorithm.

6.3 Other ancillary files

Many HF model repos also contain:

  • special_tokens_map.json – basically the same info as the special_tokens_map attribute: mapping from names (eos_token, pad_token, additional_special_tokens) to actual strings. (Hugging Face)
  • added_tokens.json – a separate, simpler listing of added tokens (often derived from tokenizer.json).
  • config.json / generation_config.json – model config and default generation parameters, including eos_token_id, pad_token_id which must be consistent with the tokenizer side. (Hugging Face)

When these files get out of sync (e.g. EOS ID in config.json vs EOS string in tokenizer_config.json vs tokenizer.json contents), you get classic bugs: generation not stopping, NaNs during training, etc. There are real Qwen bugs like this discussed in the wild.


7. How to mentally understand special tokens in practice

A practical mental model that matches what you see:

  1. Vocabulary-level view (tokenizer.json / added_tokens)

    • “Which strings exist as single tokens?”
    • “Does the fast tokenizer treat them as special (never split, removable on decode)?”
  2. Transformers-level view (tokenizer_config.json / special_tokens_map.json)

    • “Which tokens does Transformers treat as EOS/PAD/BOS/CLS/SEP?”
    • “Which tokens are additional special tokens (additional_special_tokens)?”
    • This drives all_special_tokens, all_special_ids, skip_special_tokens=True, etc. (Hugging Face)
  3. Model/training-level view (chat template, data format)

    • “Which control tokens actually appear in the training data, and what do they mean?”

    • Qwen-style control tokens:

      • <|im_start|>, <|im_end|> – chat roles.
      • <|vision_*|>, <|image_pad|>, <|video_pad|> – multimodal.
      • <tool_call>, <tool_response>, <think> – tool + reasoning. (Qwen)

These three layers do not have to use the same subset of tokens, but they must be coherent for your use case.

For your specific tokenizer:

  • tokenizer.json lists all of those control tokens in added_tokens.
  • tokenizer_config.json chooses a subset as EOS / PAD / additional special tokens (mostly chat + vision).
  • tokenizer.all_special_tokens is the union of EOS/PAD plus additional_special_tokens, hence the 13-token list you’re seeing.

8. Summary

  • added_tokens list

    • All tokens added on top of the base vocab, with a low-level special flag used by the fast tokenizer (true = never split, removable on decode).
    • Contains many control tags (<tool_call>, <think>, FIM tokens, etc.) even if Transformers doesn’t treat them as “special”.
  • tokenizer.json

    • Full fast tokenizer definition (vocab, merges, normalizer, added_tokens).
    • Controls how text is split into tokens.
  • tokenizer_config.json

    • High-level wrapper config for Transformers.
    • Declares which tokens are EOS/PAD/etc and which are additional_special_tokens.
    • Controls what becomes tokenizer.eos_token, tokenizer.pad_token, tokenizer.additional_special_tokens, and ultimately tokenizer.all_special_tokens.
  • tokenizer.all_special_tokens

    • Computed from the special tokens map (EOS/PAD/BOS/UNK/etc.) plus additional_special_tokens.

    • Does not read the full added_tokens list, so it is normal and expected that:

      • all_special_tokens ⊂ added_tokens, and
      • It may differ from the subset of added_tokens with special=true.

9. Good reference links (clickable via citations)

A few high-quality references you can read in full:

  • Hugging Face docs – Tokenizer (high-level) and Fast tokenizers (low-level internals). (Hugging Face)
  • Transformers internals – PreTrainedTokenizerBase / SpecialTokensMixin (how special tokens and all_special_tokens are implemented). (Hugging Face)
  • Qwen docs – Key Concepts (explains regular vs control tokens in the Qwen family). (Qwen)
  • HF forum thread – “additional_special_tokens are not added” (LLaVA <image> token missing from all_special_tokens, same pattern as your issue). (Hugging Face Forums)
  • Example tokenizer configs – Qwen2-VL tokenizer_config.json (shows how Qwen actually declares EOS/PAD and additional special tokens). (Hugging Face)
1 Like