That’s certainly confusing…
You can understand what you’re seeing if you separate three different layers:
- The model’s vocabulary (all token IDs, including your
added_tokenslist). - The low-level fast tokenizer (Rust
tokenizerslibrary; this is whattokenizer.jsonstores). - The high-level Transformers wrapper (Python
PreTrainedTokenizerFast/SpecialTokensMixin; this is driven bytokenizer_config.jsonandspecial_tokens_map.jsonand is whatall_special_tokenscomes from). (Hugging Face)
Your confusion is exactly because (2) and (3) use the word “special” differently.
1. First: what is a “special token” conceptually?
At the model/training level there are two broad kinds of tokens:
-
Normal (regular) tokens – subword pieces of natural language (“Hello”, “ing”, etc.).
-
Control / format tokens – tokens with special meaning in the training data, such as:
<|im_start|>,<|im_end|>– chat message boundaries.<|vision_start|>,<|vision_end|>,<|vision_pad|>– multimodal boundaries/padding.<tool_call>,</tool_call>,<tool_response>,</tool_response>– function-calling tags.<think>,</think>– reasoning spans.<|fim_prefix|>,<|fim_middle|>,<|fim_suffix|>– fill-in-the-middle tokens.<|endoftext|>– end-of-document token used in pretraining.
Qwen’s docs call these “control tokens”: tokens that represent special functionality rather than natural language itself. (Qwen)
From the model’s point of view, all of these are just token IDs. “Specialness” is about how the tokenizer and high-level library treat them.
2. What your added_tokens list in tokenizer.json actually is
The tokenizer.json file is the serialized fast tokenizer from the tokenizers library. It contains: vocabulary, merges, pre/post-processing, plus a list called added_tokens. (Hugging Face)
Your added_tokens snippet:
151643 <|endoftext|> true
151644 <|im_start|> true
151645 <|im_end|> true
151646 <|object_ref_start|> true
151647 <|object_ref_end|> true
151648 <|box_start|> true
151649 <|box_end|> true
151650 <|quad_start|> true
151651 <|quad_end|> true
151652 <|vision_start|> true
151653 <|vision_end|> true
151654 <|vision_pad|> true
151655 <|image_pad|> true
151656 <|video_pad|> true
151657 <tool_call> false
151658 </tool_call> false
151659 <|fim_prefix|> false
151660 <|fim_middle|> false
151661 <|fim_suffix|> false
151662 <|fim_pad|> false
151663 <|repo_name|> false
151664 <|file_sep|> false
151665 <tool_response> false
151666 </tool_response> false
151667 <think> false
151668 </think> false
Here:
- The first column is the token ID.
- The middle column is the string form of the token.
- The last
true/falseis the Rust-tokenizer-levelspecialflag. (paddlenlp.readthedocs.io)
What that flag does in the fast tokenizer:
-
special = true- The token is treated as an indivisible “added token”.
- The pre-tokenizer will not split it into smaller pieces.
- When you decode with
skip_special_tokens=True, these tokens will be removed. (Hugging Face)
-
special = false- The token is just an extra vocab token. It may still be one piece, but it does not get special handling in the tokenizer’s decode / skip logic.
So:
What does the
added_tokenslist mean?
It is “all vocabulary items that were added on top of the base vocab”, along with a low-levelspecialflag that controls how the fast tokenizer tokenizes/decodes them.
It is not “the list of all special tokens from Transformers’ point of view”.
You can see this design in the Transformers code: higher-level add_special_tokens() calls down into the fast tokenizer and creates AddedToken objects with special=True, but there can also be added tokens that are not special. (gemfury.com)
3. What tokenizer_config.json is doing
tokenizer_config.json is a wrapper configuration used by the Python transformers library. It does not contain the full vocab; it tells AutoTokenizer:
-
Which tokenizer class to instantiate (
"tokenizer_class": "Qwen2Tokenizer"). -
Which tokens are:
bos_token,eos_token,pad_token,unk_token, etc.additional_special_tokens(custom special tokens).
-
Behavior flags like
model_max_length,padding_side,add_prefix_space, etc. (Hugging Face)
Your tokenizer_config.json says:
"eos_token": "<|im_end|>",
"pad_token": "<|vision_pad|>",
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
]
So from Transformers’ perspective:
- EOS =
<|im_end|> - PAD =
<|vision_pad|> - And these 13 tokens are “additional special tokens”.
This information is also mirrored in special_tokens_map.json for many models, and both files are loaded by AutoTokenizer. (Hugging Face)
4. How tokenizer.all_special_tokens is computed
In the Transformers Python code, the SpecialTokensMixin class holds all the special-token attributes and exposes properties like all_special_tokens and all_special_ids. (Hugging Face)
Conceptually it does something like:
specials = []
for v in tokenizer.special_tokens_map_extended.values():
if isinstance(v, list):
specials.extend(v)
else:
specials.append(v)
# deduplicate while preserving order
all_special_tokens = list(dict.fromkeys(specials))
Where special_tokens_map_extended is built from:
bos_token,eos_token,pad_token,unk_token, etc.additional_special_tokens(and sometimes their legacy variants). (Hugging Face)
Crucially:
all_special_tokensnever looks at the rawadded_tokenslist intokenizer.json.
It only looks at named special tokens (bos_token,eos_token,pad_token, etc.) andadditional_special_tokensstored in the config.
That is exactly why your all_special_tokens output is:
[
'<|im_end|>',
'<|vision_pad|>',
'<|im_start|>',
'<|object_ref_start|>',
'<|object_ref_end|>',
'<|box_start|>',
'<|box_end|>',
'<|quad_start|>',
'<|quad_end|>',
'<|vision_start|>',
'<|vision_end|>',
'<|image_pad|>',
'<|video_pad|>',
]
This is just:
eos_token(<|im_end|>)pad_token(<|vision_pad|>)- plus everything in
additional_special_tokens(deduplicated).
Notice:
<|endoftext|>is not inadditional_special_tokensand is not declared as EOS intokenizer_config.json.- Tool / FIM /
<think>tokens are also not inadditional_special_tokensand havespecial=falseat the tokenizer level.
Therefore they do not appear in all_special_tokens. This is normal and also shows up in other models (e.g. LLaVA’s <image> token sometimes appears in added_tokens but not in all_special_tokens unless it was wired into additional_special_tokens). (Hugging Face Forums)
So:
Why is
all_special_tokensdifferent from theadded_tokenslist and from thetruesubset of it?
Becauseall_special_tokensis a higher-level view built fromtokenizer_config.json(special-tokens map andadditional_special_tokens), whileadded_tokensis the raw vocabulary list (with a low-levelspecialflag). They are related but intentionally not the same set.
5. Relationship between the three things you see
Let’s put your exact objects side-by-side.
5.1. added_tokens (fast tokenizer, low-level)
-
Contains all tokens that were added after the base vocab, including:
- Qwen control tokens:
<|endoftext|>,<|im_start|>,<|im_end|>,<|vision_*|>, etc. - Tool tokens:
<tool_call>,<tool_response>,<think>, etc. - FIM / repo tokens:
<|fim_*|>,<|repo_name|>,<|file_sep|>.
- Qwen control tokens:
-
The trailing
true/falseis the Rust-layer “special” flag for tokenization behavior.
5.2. tokenizer_config.json (Transformers wrapper, high-level)
Defines:
eos_token = "<|im_end|>"pad_token = "<|vision_pad|>"additional_special_tokens =the 13 multimodal/chat tokens.
These become:
tokenizer.eos_token,tokenizer.pad_tokentokenizer.additional_special_tokens
and then feed into:
tokenizer.all_special_tokenstokenizer.all_special_ids
via SpecialTokensMixin. (Hugging Face)
5.3. tokenizer.all_special_tokens (Python view)
- Computed from
special_tokens_map/special_tokens_map_extended(EOS, PAD, additional specials, etc.), not from the rawadded_tokenslist.
Hence you only see:
<|im_end|><|vision_pad|>- and the 11 other additional special tokens.
<|endoftext|> and <tool_call> are not in that config, so they don’t appear even though they exist in added_tokens.
6. Difference in roles: tokenizer.json vs tokenizer_config.json
You can think of it like this:
6.1 tokenizer.json = “how to actually tokenize text”
-
Full definition of the fast tokenizer:
- Vocabulary and merges (BPE/Unigram/etc.).
- Normalizer, pre-tokenizer, post-processor.
added_tokensand their low-levelspecialflag. (Hugging Face)
-
Used by anything that needs the exact same tokenization behavior:
PreTrainedTokenizerFastin Python.transformers.jsin JavaScript. (Hugging Face)- Inference frameworks that load HF tokenizers directly (vLLM, etc.).
If you change this file, you are changing how raw text is split into IDs.
6.2 tokenizer_config.json = “how Transformers should treat this tokenizer”
-
A small JSON that tells Transformers:
- Which tokenizer class to use (
Qwen2Tokenizer). - Which tokens are EOS, PAD, BOS, etc.
- Which tokens are
additional_special_tokens. - Max length, padding side, whether to add BOS by default, etc. (Hugging Face)
- Which tokenizer class to use (
-
Also now often stores:
chat_template(forapply_chat_template). (Hugging Face)
If you change this file, you are changing metadata and behavior inside Transformers, not the raw tokenization algorithm.
6.3 Other ancillary files
Many HF model repos also contain:
special_tokens_map.json– basically the same info as thespecial_tokens_mapattribute: mapping from names (eos_token,pad_token,additional_special_tokens) to actual strings. (Hugging Face)added_tokens.json– a separate, simpler listing of added tokens (often derived fromtokenizer.json).config.json/generation_config.json– model config and default generation parameters, includingeos_token_id,pad_token_idwhich must be consistent with the tokenizer side. (Hugging Face)
When these files get out of sync (e.g. EOS ID in config.json vs EOS string in tokenizer_config.json vs tokenizer.json contents), you get classic bugs: generation not stopping, NaNs during training, etc. There are real Qwen bugs like this discussed in the wild.
7. How to mentally understand special tokens in practice
A practical mental model that matches what you see:
-
Vocabulary-level view (
tokenizer.json/added_tokens)- “Which strings exist as single tokens?”
- “Does the fast tokenizer treat them as special (never split, removable on decode)?”
-
Transformers-level view (
tokenizer_config.json/special_tokens_map.json)- “Which tokens does Transformers treat as EOS/PAD/BOS/CLS/SEP?”
- “Which tokens are additional special tokens (
additional_special_tokens)?” - This drives
all_special_tokens,all_special_ids,skip_special_tokens=True, etc. (Hugging Face)
-
Model/training-level view (chat template, data format)
-
“Which control tokens actually appear in the training data, and what do they mean?”
-
Qwen-style control tokens:
<|im_start|>,<|im_end|>– chat roles.<|vision_*|>,<|image_pad|>,<|video_pad|>– multimodal.<tool_call>,<tool_response>,<think>– tool + reasoning. (Qwen)
-
These three layers do not have to use the same subset of tokens, but they must be coherent for your use case.
For your specific tokenizer:
tokenizer.jsonlists all of those control tokens inadded_tokens.tokenizer_config.jsonchooses a subset as EOS / PAD / additional special tokens (mostly chat + vision).tokenizer.all_special_tokensis the union of EOS/PAD plusadditional_special_tokens, hence the 13-token list you’re seeing.
8. Summary
-
added_tokenslist- All tokens added on top of the base vocab, with a low-level
specialflag used by the fast tokenizer (true= never split, removable on decode). - Contains many control tags (
<tool_call>,<think>, FIM tokens, etc.) even if Transformers doesn’t treat them as “special”.
- All tokens added on top of the base vocab, with a low-level
-
tokenizer.json- Full fast tokenizer definition (vocab, merges, normalizer,
added_tokens). - Controls how text is split into tokens.
- Full fast tokenizer definition (vocab, merges, normalizer,
-
tokenizer_config.json- High-level wrapper config for Transformers.
- Declares which tokens are EOS/PAD/etc and which are
additional_special_tokens. - Controls what becomes
tokenizer.eos_token,tokenizer.pad_token,tokenizer.additional_special_tokens, and ultimatelytokenizer.all_special_tokens.
-
tokenizer.all_special_tokens-
Computed from the special tokens map (EOS/PAD/BOS/UNK/etc.) plus
additional_special_tokens. -
Does not read the full
added_tokenslist, so it is normal and expected that:all_special_tokens⊂added_tokens, and- It may differ from the subset of
added_tokenswithspecial=true.
-
9. Good reference links (clickable via citations)
A few high-quality references you can read in full:
- Hugging Face docs – Tokenizer (high-level) and Fast tokenizers (low-level internals). (Hugging Face)
- Transformers internals –
PreTrainedTokenizerBase/SpecialTokensMixin(how special tokens andall_special_tokensare implemented). (Hugging Face) - Qwen docs – Key Concepts (explains regular vs control tokens in the Qwen family). (Qwen)
- HF forum thread – “
additional_special_tokensare not added” (LLaVA<image>token missing fromall_special_tokens, same pattern as your issue). (Hugging Face Forums) - Example tokenizer configs – Qwen2-VL
tokenizer_config.json(shows how Qwen actually declares EOS/PAD and additional special tokens). (Hugging Face)