Skip to main content

Methodology

AXL v2.1 validated its alphabet against cl100k_base (OpenAI). v2.2 extends the audit across five tokenizer families using public documentation, NFKC normalization testing, and inference from model cards. Sources: OpenAI tiktoken repo, Google Gemma 3 blog (confirms SentencePiece for Gemini), Meta Llama docs (SentencePiece with BOS token), Mistral docs (V3 Tekken = tiktoken-based), xAI documentation (variable tokenizers across Grok versions), Anthropic docs (token counting documented, tokenizer undisclosed).

Audit Table

SymbolRolecl100k (GPT)ClaudeSentencePiece (Llama/Gemini)Tekken (Mistral V3)Grok
|field delimiter1 tokenlikely 1likely 1likely 1likely 1
:subfield1 tokenlikely 1likely 1likely 1likely 1
.op separator1 tokenlikely 1likely 1likely 1likely 1
+evidence chain1 tokenlikely 1likely 1likely 1likely 1
$ @ # !tags1 token eachlikely 1likely 1likely 1likely 1
~ ^tags1 token eachlikely 11-2 tokenslikely 1unknown
OBS INF etcoperations1 token eachlikely 1-2likely 1-3likely 1-2unknown
RE:relationstablelikely 1-2likely 1-3likely 1-2unknown
πidentity1 tokenunsafeunsafeunsafeunsafe
← → ↑ ↓direction1 token eachunsafeunsafeunsafeunsafe
ID:ascii identity1-2 tokenssafesafesafesafe
<- =>ascii direction1-2 tokenssafesafesafesafe
up down EQascii direction1 token eachsafesafesafesafe

NFKC Normalization

SentencePiece applies NFKC normalization. Key finding:
  • µ (micro sign, U+00B5) normalizes to μ (Greek mu, U+03BC) — breaks identity
  • π ← → ↑ ↓ survive NFKC — safe at normalization level, but may still multi-tokenize
  • All ASCII symbols survive NFKC unchanged

Conclusion

The universal-safe core is ASCII punctuation plus typed prefixes. Tokenizer neutrality is framed as orthographic invariance, not identical token counts. v2.2 ASCII mode guarantees parser correctness everywhere. Compression efficiency is profile-relative.