# What Is Tokenization and How AI Reads Your Text

When you type a sentence into a large language model, the artificial intelligence does not see individual letters, syllables, or words; instead, it processes "tokens," which are mathematically assigned numerical IDs representing statistically frequent chunks of text. This invisible translation layer dictates everything from how much an AI API costs to why highly advanced models inexplicably fail to count the number of letters in a simple word. Understanding tokenization provides the ultimate mechanical key to making sense of a language model's cognitive blind spots, its memory limits, and the inherent geographic biases baked into its architecture.

## The Illusion of Words: How Machines Perceive Text

To understand generative artificial intelligence, you must first abandon the assumption that machines process language in a manner even remotely similar to human cognition. When a human reads the word "elephant," the brain instantly activates an interconnected semantic network: concepts of large grey mammals, trunks, ivory, Africa, and memory are retrieved simultaneously. Humans read for meaning, continuously parsing letters into morphemes and words into structural grammar. 

When a large language model (LLM) like GPT-4 encounters the word "elephant," it perceives only a discrete numerical identifier—such as `37SELEphant` or `4321`—drawn from a massive, predetermined vocabulary lookup table of roughly 100,000 entries [cite: 1]. The machine does not "know" what an elephant is at this stage. The actual semantic understanding only emerges later, deep within the neural network's embedding layers and transformer attention heads, where these numbers are mapped into high-dimensional vector space to determine their relationships with other numbers [cite: 1, 2]. But the very first step—the literal perception of the text—happens through tokenization. 

This translation from text to numbers is an absolute structural necessity. Neural networks are fundamentally massive calculators; they operate exclusively on continuous mathematics, calculus, and linear algebra. Before a model can process, predict, or generate language, that language must be converted into discrete numerical inputs. Long before the invention of the modern transformer architecture in 2017, computer scientists and linguists debated the optimal methodology for breaking down human language into machine-readable units [cite: 1, 2]. 

### The Character Age and the Sequence Length Problem

In the early days of computing, the most logical approach to this problem was character-level tokenization. The story begins in the 1960s with the American Standard Code for Information Interchange (ASCII), which assigned a specific numerical value to every uppercase letter, lowercase letter, punctuation mark, and digit [cite: 1]. From a vocabulary management perspective, character-level tokenization possesses a beautiful, unassailable simplicity. An AI model would only need to memorize a tiny vocabulary of about 128 to 256 unique tokens to represent any text in the English language [cite: 1]. Because every word is just a combination of these base characters, the model would never encounter an "unknown" word. 

However, this method proved computationally devastating for early natural language processing (NLP). In a character-level model, sequences become insanely long. The simple sentence "The quick brown fox jumps over the lazy dog" requires 43 separate tokens just to be processed [cite: 1, 3]. A standard document translates into hundreds of thousands of tokens. Early neural architectures, such as Recurrent Neural Networks (RNNs), lacked the computational bandwidth to handle this. As they processed a text character by character, they suffered from the "vanishing gradient problem"—by the time the network reached the end of a long sentence, the mathematical weights prioritizing the beginning of the sentence had diluted to zero, causing the model to literally forget how the sentence started [cite: 1]. Furthermore, individual characters carry no semantic weight. The letter "c" communicates absolutely nothing about a feline; the model wastes immense computational resources just trying to learn that c-a-t forms a coherent concept before it can even begin to learn grammar [cite: 1, 4].

### The Word-Level Explosion

To solve the sequence length problem, researchers pivoted to the opposite extreme: word-level tokenization. This approach seems highly intuitive because words are the natural, semantic units of human language [cite: 1, 5]. If models learned entire words as single tokens, sequence lengths would be drastically reduced. "The quick brown fox" would be just four tokens. This logic powered an entire generation of early NLP systems, including Bag of Words models, TF-IDF systems, and early Word2Vec embeddings [cite: 1, 6].

Yet, word-level tokenization introduced a fatal flaw: vocabulary explosion [cite: 7]. In a word-level system, a model requires a unique, discrete token ID for every possible variation of a word. "Run," "runs," "running," and "ran" are treated as completely unrelated mathematical entities, forcing the model to learn the semantic meaning of each one from scratch [cite: 4, 7, 8]. When you factor in proper nouns, scientific jargon, typos, and newly coined internet slang, the required vocabulary size balloons into the millions. Managing an embedding table with millions of entries requires crippling amounts of memory. 

Worse, if a word-level model encounters a string of text it was not explicitly trained on, it has no fallback mechanism. It simply outputs a generic `<UNK>` (unknown) token, effectively blinding the AI to new expressions and severely degrading its utility in real-world applications [cite: 7, 9, 10, 11]. 

## The Subword Solution: Byte-Pair Encoding

Modern LLMs require a pragmatic compromise: a tokenization system that keeps the total vocabulary size small and manageable (like character tokenization) while preserving the high semantic density and short sequence lengths of word tokenization [cite: 4, 7, 9]. 

The breakthrough that powers almost every major contemporary language model—including OpenAI's GPT-4, Anthropic's Claude, and Meta's Llama—is subword tokenization, driven primarily by an algorithm called Byte-Pair Encoding (BPE) [cite: 1, 9, 11, 12]. BPE was originally developed as a simple data compression algorithm for general computer files, but in 2015, Rico Sennrich and his colleagues at the University of Edinburgh adapted it for NLP [cite: 1]. 

Byte-Pair Encoding operates on a brilliant, greedy statistical logic. Rather than telling the machine what a word is, the algorithm discovers the most efficient way to parse language by looking at raw data. The process works as follows:
1. The algorithm starts with a base vocabulary of individual characters.
2. It scans a massive training corpus and counts the frequencies of all adjacent pairs of characters. 
3. It identifies the most frequent adjacent pair and merges it into a new, single token [cite: 9, 13]. 
4. It updates the corpus with this new token and repeats the process.

For example, if the algorithm repeatedly sees the letters "h" and "e" next to each other, it merges them into a single token: "he". In the next iteration, if "t" and "he" frequently appear together, it merges them into "the" [cite: 7]. The algorithm repeats this process tens of thousands of times until it reaches a target vocabulary size defined by the developer, typically between 32,000 and 200,000 unique tokens [cite: 10, 13, 14]. 

This creates a highly optimized hybrid vocabulary. Highly frequent words—like "The," "apple," or "computer"—are merged into single, whole-word tokens, which keeps the sequence length short and efficient [cite: 9, 11, 15]. However, if the model encounters a rare word, a complex medical term, or a new slang word, it does not crash or throw an `<UNK>` error. Instead, it gracefully decomposes the unfamiliar word into smaller, highly recognizable subword chunks [cite: 9, 11]. For instance, the word "unhappiness" might be split into the tokens `un`, `happi`, and `ness`. The model can infer the meaning of the rare word by relying on the semantic weight of these common prefixes and suffixes, allowing it to "understand" words it has never explicitly seen before [cite: 16, 17].

### Byte-Level BPE: True Universality

In 2019, the release of GPT-2 introduced a critical evolution to this algorithm: byte-level BPE [cite: 1]. Rather than starting with Unicode characters, which can be messy and massive, byte-level BPE drops down to the foundational building blocks of computing: raw UTF-8 bytes [cite: 11, 18]. 

In UTF-8 encoding, any character that exists in any language can be represented by a sequence of 1 to 4 bytes. By setting the base vocabulary to all 256 possible byte values, developers ensured that the tokenizer could represent absolutely any text that exists or will ever exist—including rare alphabets, obscure mathematical symbols, and novel emojis [cite: 1, 18]. If a byte-level BPE tokenizer encounters a completely alien symbol, it simply falls back to rendering it as a sequence of base bytes. Because every text can be represented as bytes, and all bytes are in the vocabulary, the dreaded `<UNK>` token was effectively eradicated from modern architecture [cite: 1]. 

OpenAI's proprietary tokenizer implementation, `tiktoken`, relies heavily on byte-level BPE. Written in the Rust programming language for maximum speed, `tiktoken` utilizes heavily optimized vocabulary tables (such as `cl100k_base` for GPT-4 and `o200k_base` for GPT-4o) to rapidly compress and decompress text into numerical arrays before they hit the neural network [cite: 1, 14, 19, 20].

## Alternative Tokenization: WordPiece and SentencePiece

While BPE dominates the generative AI landscape, it is not the only subword algorithm in production. Understanding the alternatives is critical for AI engineers, as swapping a tokenizer fundamentally breaks a model's performance [cite: 14].

**WordPiece**
Developed by Google and famously used in the BERT (Bidirectional Encoder Representations from Transformers) family of models, WordPiece shares BPE's bottom-up merging philosophy but changes the mathematical criteria for *how* merges are selected [cite: 10, 11]. While BPE is a "greedy" algorithm that blindly merges the most frequent pairs, WordPiece measures the likelihood of the training data. It evaluates a potential merge by dividing the frequency of the pair by the product of the individual frequencies of its parts [cite: 11]. 

In simpler terms, WordPiece asks: "Is this combination of characters occurring far more often than we would expect by random chance?" If the tokens "g" and "s" appear together vastly more than their independent frequencies would predict, WordPiece merges them [cite: 11]. WordPiece also uses a specific formatting quirk, adding a `##` prefix to any subword that does not appear at the beginning of a word, helping the model keep track of word boundaries [cite: 10, 14].

**SentencePiece and Unigram**
Another Google innovation, SentencePiece, is designed for true language independence. Most tokenizers rely on a "pre-tokenization" step, where text is split by spaces and punctuation before the subword algorithm is even applied [cite: 12, 21]. However, relying on whitespace is a deeply Western-centric assumption; languages like Mandarin Chinese and Japanese do not use spaces between words [cite: 19, 21]. 

SentencePiece treats the space itself as just another normal character (often represented visually by an underscore `_`), completely bypassing the need for whitespace pre-tokenization [cite: 14, 19, 22]. This makes it vastly superior for multilingual models. SentencePiece is often paired with the Unigram language model algorithm, a "top-down" approach that starts with a massively bloated vocabulary of all possible substrings and iteratively removes the least useful tokens until it hits its target size, balancing the probabilities along the way [cite: 1, 19, 23]. 

### The Danger of Glitch Tokens

Because tokenizers are trained on massive datasets using statistical algorithms, they occasionally capture bizarre anomalies that embed themselves permanently into a model's architecture. 

In 2023, researchers discovered a phenomenon known as "glitch tokens" across several major models, including GPT-3, GPT-2, and LLaMA [cite: 18, 24]. The most infamous example was a token corresponding to the string `SolidGoldMagikarp`. When users prompted GPT-3 to repeat or explain this string, the model would hallucinate wildly, evade the question, or spout incoherent garbage [cite: 24]. 

This catastrophic failure occurred because of a misalignment between the data used to train the tokenizer and the data used to train the actual language model. The tokenizer had ingested massive amounts of raw Reddit data, where "SolidGoldMagikarp" was a highly active user. Because the string appeared so frequently, the BPE algorithm dutifully merged it into a single, permanent token ID [cite: 18]. However, when developers subsequently filtered and cleaned the dataset to train the actual neural network, they removed much of that Reddit data. 

As a result, the token ID existed in the model's vocabulary, but the model had almost no training examples to learn what it meant. The embedding vector for `SolidGoldMagikarp` remained randomly initialized [cite: 18, 24]. When a user input that specific string, it injected pure mathematical noise into the neural network's attention mechanisms, causing the system to derail. This highlights a critical rule in AI development: tokenizer vocabularies must perfectly reflect the distributions of the model's training data [cite: 12, 24].

## The Strawberry Problem: Why AI Cannot Count

Because language models perceive the world strictly through the lens of subword tokens, they suffer from profound, structural cognitive blind spots regarding the granular makeup of text. This architectural quirk is the root cause of the viral "strawberry problem" that has plagued nearly all modern language models.

When a user asks standard models like GPT-4 or Claude, "How many 'r's are in the word strawberry?", the model confidently and incorrectly answers "two" [cite: 25, 26, 27]. To a human, this seems like an absurd failure of basic intelligence. To an AI engineer, it is a perfectly logical outcome of BPE tokenization. 

When the word "strawberry" is passed through OpenAI's `cl100k_base` tokenizer, the model never sees the ten discrete letters s-t-r-a-w-b-e-r-r-y. Because "strawberry" is a relatively common word, the tokenizer has learned to compress it. It splits the word into three distinct token IDs representing the chunks `str`, `aw`, and `berry` (specifically, token IDs `496`, `675`, and `15717`) [cite: 25].

[image delta #1, 0 bytes]

 
From the model's perspective, "strawberry" is just a sequence of three opaque mathematical objects [cite: 24, 25]. Asking the model to count the individual letters inside those tokens is akin to asking a human to count the number of atoms in a brick while only being allowed to look at a completed wall [cite: 15, 28]. The granular, character-level information simply does not exist in the model's direct line of sight; the tokenizer erased it before the neural network could analyze it. 


This token boundary blindness severely impacts other domains, most notably mathematics and arithmetic. Language models often fail at multiplying large, multi-digit numbers (like 2-digit by 5-digit multiplication) not because they lack deductive logic, but because the numbers are arbitrarily sliced into subword chunks based on frequency. For example, the number `12345` might be tokenized into `123` and `45`, or `12` and `345` [cite: 13]. The mathematical carry-overs cross token boundaries that the model never explicitly sees, forcing the neural network to memorize complex patterns of chunked numbers rather than learning the underlying algorithmic rules of addition and multiplication [cite: 28, 29]. 

### The Word Count Dilemma

Tokenization is also the primary reason why large language models are notoriously terrible at following strict length constraints, such as "write exactly 200 words for this essay." 

To a human, counting 200 words is trivial. But an LLM does not possess an internal counter; it operates purely as an autoregressive sequence predictor, calculating the probability of the next optimal token based on the previous tokens [cite: 2, 30, 31]. The relationship between tokens and words is loose and highly variable. On average, one token equals roughly 0.75 words in standard English [cite: 18, 32]. However, this ratio changes wildly depending on punctuation, formatting, capitalization, and the rarity of the vocabulary used. 

Asking a language model to stop generating text at exactly 200 words is effectively asking a system that thinks in subword probabilities to dynamically monitor a secondary, invisible metric (word boundaries) while simultaneously executing complex semantic generation. As one metaphor accurately describes it, it is like asking a jazz musician to improvise a beautiful solo but forcing them to stop on exactly the 137th note without being allowed to count [cite: 33]. The model has no structural mechanism to compute a global word constraint over an exponentially large space of possible token sequences [cite: 30, 33].

## Enter OpenAI o1: Bypassing Token Limits with Reasoning

To bypass the structural limitations imposed by subword tokenizers, AI researchers have begun developing models that integrate "Chain of Thought" (CoT) reasoning directly into the inference generation process. The most prominent example of this is OpenAI's o1 model series, released in September 2024 (previously rumored under the codenames Q* and Project Strawberry) [cite: 25, 26, 34]. 

Rather than generating an immediate answer using rapid, intuitive "System 1" prediction, the o1 model is trained to engage in slower, analytical "System 2" deliberation [cite: 34, 35]. When given a complex prompt, o1 does not immediately output text to the user. Instead, it utilizes a hidden "scratchpad" to generate a sequence of internal reasoning tokens [cite: 34, 36, 37]. 

This process was achieved through a training breakthrough known as process supervision and reinforcement learning. Traditional models use outcome supervision, where the AI is rewarded solely if its final answer is correct. In process supervision, the model is rewarded for each correct, logical step it takes in a chain of thought, teaching it to break down complex tasks, recognize its own errors, backtrack, and try alternative strategies [cite: 27, 38].

Crucially, this hidden reasoning stage allows the o1 model to circumvent the "strawberry problem." When asked to count the letters in "strawberry," the model does not attempt to guess based on the opaque token IDs `496`, `675`, and `15717`. Instead, it uses its reasoning tokens to explicitly spell out the word in its hidden context window (e.g., "S - T - R - A - W..."), effectively forcing the tokenizer to process each letter as a distinct token [cite: 25, 27]. By laying the characters out individually, the model's self-attention mechanism can easily iterate over them, tally the 'r's, verify its logic, and then confidently output the correct answer [cite: 25, 27, 39]. 

While this chain-of-thought architecture vastly improves performance in mathematics, physics, and coding—allowing o1 to exceed human PhD-level accuracy on benchmarks like GPQA—it comes with significant trade-offs [cite: 36, 40]. Reasoning tokens consume test-time compute. They are slower to generate, and because these invisible tokens are injected directly into the model's context window, they consume the user's available token budget and drive up inference costs significantly [cite: 34, 36, 41]. 

## The AI Language Tax: Tokenization's Geographic Bias

While byte-level BPE is computationally brilliant for English, its greedy, data-driven nature has created a severe global inequity in the AI ecosystem. This phenomenon is widely known among researchers as the "tokenization penalty" or the "language tax" [cite: 23, 42]. 

Tokenizer vocabularies are constructed by scanning massive training corpora to find the most statistically frequent character pairings. However, the datasets used to train models like GPT-4 or Llama 3 are overwhelmingly biased toward English and programming languages (for instance, Llama 3's training data was 95% English and code, with only 5% dedicated to all other world languages combined) [cite: 42, 43, 44]. 

As a result, tokenizer vocabularies are saturated with English prefixes, suffixes, and whole words. In English, a common word usually maps cleanly to a single token. But for non-Latin scripts, low-resource languages, or languages with complex morphology—such as Arabic, Hindi, Burmese, or Mandarin—the tokenizer lacks the statistical frequency to have learned their common words [cite: 4, 42, 43, 45]. 

Consequently, when a user prompts a model in a non-Western language, the tokenizer shatters the text into tiny, inefficient fragments. Because Arabic or Hindi characters live in high Unicode blocks, they require multiple UTF-8 bytes to represent a single character [cite: 43, 45, 46]. An English-first tokenizer, having never seen these characters frequently enough to merge them, defaults to a byte-level fallback, splitting a single non-Latin character into two, three, or even four separate tokens [cite: 45, 46]. 

The impact of this bias is measurable and financially punishing. When we translate a simple English sentence into various languages and process it through OpenAI's `cl100k_base` tokenizer, the token count explodes for non-Western scripts [cite: 19, 28].

### Token Count Comparison by Language

| Language | Translated Sentence (Meaning: "The quick brown fox jumps over the lazy dog") | Token Count | Cost Multiplier |
| :--- | :--- | :--- | :--- |
| **English** | The quick brown fox jumps over the lazy dog. | 10 | 1.0x |
| **Spanish** | El zorro marrón rápido salta sobre el perro perezoso. | 11 | 1.1x |
| **German** | Der schnelle braune Fuchs springt über den faulen Hund. | 13 | 1.3x |
| **Mandarin** | 敏捷的棕色狐狸跳过懒狗。 | 18 | 1.8x |
| **Japanese** | 素早い茶色の狐は怠け者の犬を飛び越えた。 | 23 | 2.3x |
| **Hindi** | तेज़ भूरी लोमड़ी आलसी कुत्ते के ऊपर कूदती है। | 41 | 4.1x |
| **Burmese** | မြန်ဆန်သော အညိုရောင်မြေခွေးသည် ပျင်းရိသောခွေးကို ခုန်ကျော်သွားသည်။ | 58 | 5.8x |

*Note: Token counts approximate based on standard BPE implementations for modern LLMs (e.g. GPT-4/o1). Ratios accurately reflect the fragmentation penalty [cite: 18, 19, 28].*


This token over-fragmentation acts as a silent tax on the developing world [cite: 23].

[image delta #2, 0 bytes]

 Because commercial AI API providers (like OpenAI, Anthropic, and Google) bill customers based strictly on the volume of tokens processed, a sovereign government or enterprise operating in Hindi or Arabic pays drastically more to process the exact same semantic information as an English counterpart [cite: 23, 42, 46]. 

Furthermore, this inefficiency actively degrades model performance. NLP researchers measure this using two metrics: Tokenization Parity (TP) and Information Parity (IP) [cite: 45]. When a language suffers from poor TP, the model is forced to dedicate vast amounts of its computational depth just to assemble basic linguistic concepts from raw byte fragments. This leaves the model with less attention bandwidth available for higher-level semantic reasoning, resulting in slower inference latency and degraded performance in complex tasks like reading comprehension or summarization [cite: 43, 44, 45]. 

Recognizing this critical bottleneck, AI developers are actively attempting to mitigate the language tax. Newer generations of models have drastically expanded their token vocabularies. Meta's Llama 3 increased its vocabulary to 128,000, while OpenAI's `o200k_base` tokenizer expanded to 200,000 tokens [cite: 15, 22]. Google's Gemma 3 tokenizer utilizes a 256,000 vocabulary size specifically to keep African and Indic language tokenization penalties under a 3x multiplier [cite: 19]. Similarly, regional models like Qwen (optimized for Chinese) and Falcon (optimized for Arabic) rely on custom vocabularies built specifically around those character distributions, resulting in vastly improved regional compression [cite: 15, 46]. 

## Context Windows and the Memory Bottleneck

Understanding how tokenization compresses (or fragments) language is vital for navigating the operational limits of AI systems, most notably the "context window." 

A large language model is inherently stateless. It possesses no persistent, episodic memory of past interactions [cite: 47, 48]. When you are chatting with an AI, the only reason it remembers what you said three messages ago is because the application interface invisibly re-sends the entire history of the conversation back into the model with every new prompt [cite: 48, 49]. The context window is the absolute maximum amount of text—measured strictly in tokens, not words—that the model can hold in its working memory at a single time [cite: 47, 48, 49].

As sequences grow longer, processing them becomes computationally expensive. To prevent the model from recalculating the mathematics of every previous token from scratch each time a new word is generated, models utilize a Key-Value Cache (KV Cache) [cite: 46, 50]. The KV cache temporarily stores the attention calculations for past tokens. However, this cache requires significant hardware VRAM. 

When tokenization is inefficient, the context window fills up rapidly, burning through the KV cache. For example, if a model has a 128,000 token limit, it can hold roughly 96,000 standard English words [cite: 18]. But if a user inputs complex medical terminology, raw source code, or a non-Latin language like Arabic, the tokenizer fractures the text [cite: 1, 18, 43]. Medical jargon might inflate token counts by 50%, effectively shrinking the usable context window down to 85,000 tokens [cite: 1]. Once the absolute token limit is breached, the model begins to drop older information, leading to the frustrating phenomenon where an AI suddenly "forgets" foundational instructions provided at the beginning of a long session [cite: 48, 49]. 

To circumvent this hard mathematical limit, developers employ Retrieval-Augmented Generation (RAG). Instead of pasting a massive, 500-page PDF directly into the prompt (which would exhaust the context window and the token budget), a RAG system uses a vector database to search the document for only the most relevant paragraphs [cite: 48, 50]. It then injects those highly specific tokens directly into the LLM's context window alongside the user's question, providing the model with the exact information it needs without overflowing its working memory [cite: 48].

## From Probabilities to Text: Decoding Strategies

Once the model has ingested the tokenized prompt and processed it through its transformer layers, it must output an answer. However, the model does not output a single, definitive text string; it outputs a massive probability distribution, scoring every single token in its 100,000+ vocabulary on how likely it is to be the next word [cite: 51]. The process of converting these probabilities back into human-readable text is called "decoding."

If the model simply selected the single most probable token every time—a method known as "greedy decoding"—the resulting text would be incredibly flat, repetitive, and robotic (e.g., getting stuck in loops like "I am sorry I am sorry I am sorry") [cite: 51]. 

To introduce natural variation and creativity, engineers apply sampling algorithms that dictate how the model chooses from the distribution of probabilities [cite: 51]:

*   **Top-K Sampling:** The model restricts its choice to only the *K* most likely tokens (e.g., the top 50 choices). It completely ignores all other tokens, preventing it from selecting bizarre or hallucinatory words [cite: 51].
*   **Top-p (Nucleus) Sampling:** Instead of a fixed number, the model looks at the cumulative probability. If *p* is set to 0.90, the model considers only the subset of tokens whose combined probabilities equal 90%. This dynamically adapts: if the model is highly confident, it might only consider 2 tokens; if it is uncertain, it might consider 40 tokens [cite: 51].
*   **Min-p Sampling:** A newer, highly effective strategy popular in open-source frameworks. Rather than using a fixed probability mass, Min-p looks at the probability of the absolute top token and sets a dynamic cutoff threshold. If the top token has a 60% probability, and Min-p is set to 0.1 (10%), it throws away any token that has less than a 6% chance (10% of 60%) [cite: 51].
*   **Temperature:** This is a mathematical scalar applied to the raw scores (logits) before they are turned into probabilities. A low temperature (e.g., 0.2) sharpens the distribution, making the most likely tokens even more dominant, resulting in strict, analytical output. A high temperature (e.g., 0.9) flattens the distribution, giving lower-probability tokens a higher chance of being selected, which increases creativity but also the risk of hallucinations [cite: 51].

### Practical Security Implications

Because LLMs respond exclusively to token IDs rather than rendered text, tokenization presents unique vulnerabilities for prompt injection and cybersecurity. 

Malicious actors frequently use invisible Unicode characters, homoglyphs (such as a Cyrillic 'a' instead of a Latin 'a'), or zero-width joiners. To a human reviewer looking at a screen, the text appears normal or harmless. However, the tokenizer processes these invisible characters as entirely different numerical tokens, allowing attackers to smuggle hidden instructions into the model [cite: 28]. For engineers building safety guardrails, it is imperative to inspect the actual token array the model receives, rather than relying on human-readable strings [cite: 28]. 

## Bottom line

Large language models do not comprehend human language; they process statistical relationships between numbered, subword tokens. This foundational architecture allows artificial intelligence to compress and generate text at incredible scale, but it creates rigid, structural blind spots. Because tokenizers merge letters into opaque subword chunks, models naturally struggle with character-level reasoning, accurate arithmetic, and strict word counts. Furthermore, the statistical bias of tokenizers imposes a heavy "language tax" on non-Western text, skyrocketing API costs and depleting context windows for billions of global users. While developers are actively mitigating these flaws through larger vocabularies and hidden "reasoning tokens," the fundamental reality of AI remains unchanged: to master these systems, you must first understand the invisible numerical pieces they use to construct the world.

## Sources

1. [Medium - The secret language of AI](https://medium.com/@rkuma18/the-secret-language-of-ai-how-chatgpt-actually-reads-your-text-9f3eeda08564)
2. [Blog.gopenai - Your LLM doesn't understand words](https://blog.gopenai.com/your-llm-doesnt-understand-words-it-understands-tokens-5f15e27e7c11)
3. [Hugging Face - ChatterBox](https://huggingface.co/rahul7star/LLM-Brain/blob/main/ChatterBox.md)
4. [Arxiv - Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/pdf/2212.04356)
5. [Scribd - NLP Unit 1](https://www.scribd.com/document/906969495/NLP-Unit-1)
6. [ResearchGate - CNN-Based Accent Similarity](https://www.researchgate.net/publication/401227179_CNN-Based_Accent_Similarity_Detection_Using_Masked_Spectrogram_Reconstruction)
7. [CTE Univ - Linguistics Complete Intro](https://cte.univ-setif2.dz/moodle/pluginfile.php/244044/mod_resource/content/3/%5BHornsby%2C_David%5D_Linguistics___a_complete_introduc%28z-lib.org%29.pdf)
8. [YouTube - Byte Pair Encoding BPE](https://www.youtube.com/watch?v=-YKXSkmitaU)
9. [YouTube - LLM Decoding Strategies](https://www.youtube.com/watch?v=o-_SZ_itxeA)
10. [LLM Calculator - Tokenization Performance Benchmark](https://llm-calculator.com/blog/tokenization-performance-benchmark/)
11. [Reddit - I built a tool to benchmark tokenizers across 100+ languages](https://www.reddit.com/r/MachineLearning/comments/1n0r8b7/i_built_a_tool_to_benchmark_tokenizers_across_100/)
12. [Hosn - Tokenizer efficiency Arabic LLM](https://hosn.om/blog/tokenizer-efficiency-arabic-llm.html)
13. [OpenAI Community - All languages are not created tokenized equal](https://community.openai.com/t/all-languages-are-not-created-tokenized-equal/216407)
14. [Medium - Comprehensive Tokenizer Performance Analysis](https://atul4u.medium.com/tokenizer-comparison-part2-comprehensive-tokenizer-performance-analysis-a8e0613bed0d)
15. [Reddit - Hypothesis: The Strawberry Problem is one of BPE](https://www.reddit.com/r/LocalLLaMA/comments/1fj7pts/hypothesis_the_strawberry_problem_is_one_of/)
16. [Arxiv - StochasTok](https://arxiv.org/html/2506.01687v3)
17. [Stanford CS224N - Tokenization Multilinguality](https://web.stanford.edu/class/cs224n/slides_w26/cs224n-2026-lecture14-guest-julie-tokenization-multilinguality.pdf)
18. [Digital Orientalist - Pitfalls of Chinese Tokenization](https://digitalorientalist.com/2025/02/04/to-merge-or-not-to-merge-the-pitfalls-of-chinese-tokenization-in-general-purpose-llms/)
19. [Hugging Face - Dangers of Tokenizer Recycling](https://huggingface.co/blog/catherinearnett/dangers-of-tokenizer-recycling)
20. [Reddit - Why LLMs suck at following word counts](https://www.reddit.com/r/AI_Agents/comments/1slbeq5/why_llms_suck_at_following_word_counts_its/)
21. [Hacker News - Ask HN: Why do LLMs struggle with word count?](https://news.ycombinator.com/item?id=45149038)
22. [Medium - Why LLMs input is measured in tokens](https://medium.com/@rongalinaidu/why-llms-input-is-measured-in-tokens-not-words-and-why-it-matters-017001340ae4)
23. [Dust - Understanding LLM limitations](https://docs.dust.tt/docs/understanding-llm-limitations-counting-and-parsing-structured-data)
24. [Medium - Why Large Language Models struggle when counting letters](https://medium.com/@ligtleyang/why-large-language-models-struggle-when-counting-letters-in-a-word-9e22f38719f1)
25. [Medium - The secret language of AI](https://medium.com/@rkuma18/the-secret-language-of-ai-how-chatgpt-actually-reads-your-text-9f3eeda08564)
26. [Blog.gopenai - Your LLM doesn't understand words](https://blog.gopenai.com/your-llm-doesnt-understand-words-it-understands-tokens-5f15e27e7c11)
27. [Plain English - Cracking the Code Byte Pair Encoding](https://ai.plainenglish.io/cracking-the-code-byte-pair-encoding-bpe-explained-ec28a8fe7e03)
28. [Medium - WordPiece Tokenization](https://medium.com/@atharv6f_47401/wordpiece-tokenization-a-bpe-variant-73cc48865cbf)
29. [APXML - NLP Fundamentals Advanced Tokenization](https://apxml.com/courses/nlp-fundamentals/chapter-1-nlp-text-processing-techniques/advanced-tokenization-methods)
30. [Hugging Face - Tokenizer Summary](https://huggingface.co/docs/transformers/tokenizer_summary)
31. [Substack - Understanding Byte Pair Encoding](https://vizuara.substack.com/p/understanding-byte-pair-encoding)
32. [Prompt Engineer - Why ChatGPT Can't Count Rs in Strawberry](https://prompt.16x.engineer/blog/why-chatgpt-cant-count-rs-in-strawberry)
33. [Hypotenuse AI - OpenAI Strawberry or o1 Preview](https://www.hypotenuse.ai/blog/openais-strawberry-or-o1-preview)
34. [Levelup Gitconnected - OpenAI GPT o1 A Leap in AI Reasoning](https://levelup.gitconnected.com/openais-gpt-o1-a-leap-in-ai-reasoning-86257d86b9b8)
35. [Simon Willison - OpenAI o1](https://simonwillison.net/2024/Sep/12/openai-o1/)
36. [OpenAI - Learning to Reason with LLMs](https://openai.com/index/learning-to-reason-with-llms/)
37. [Lee Hanchung - Reasoning Understanding o1](https://leehanchung.github.io/blogs/2024/10/08/reasoning-understanding-o1/)
38. [Medium - Exploring Reasoning Capabilities of OpenAI o1](https://viveksmenon.medium.com/exploring-the-reasoning-capabilities-of-openais-o1-models-7b8f3487075a)
39. [ML Radio - How OpenAI o1 Models Simulate Human Reasoning](https://mlrad.io/how-openai-o1-models-simulate-human-reasoning)
40. [Reddit - OpenAI o1 uses reasoning tokens](https://www.reddit.com/r/LocalLLaMA/comments/1ffg1fg/openai_o1_uses_reasoning_tokens/)
41. [OpenAI API Docs - Reasoning Best Practices](https://developers.openai.com/api/docs/guides/reasoning-best-practices)
42. [Arxiv - Language-Specific Tokenization Penalties](https://arxiv.org/pdf/2601.13328)
43. [ResearchGate - Do All Languages Cost the Same Tokenization](https://www.researchgate.net/publication/376393760_Do_All_Languages_Cost_the_Same_Tokenization_in_the_Era_of_Commercial_Language_Models)
44. [Arxiv - Tokenization Penalties (HTML)](https://arxiv.org/html/2601.13328v1)
45. [ACL Anthology - EMNLP Main 1224](https://aclanthology.org/2025.emnlp-main.1224.pdf)
46. [Frontiers In - Artificial Intelligence](https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1538165/full)
47. [Medium - What is LLM Context Window](https://medium.com/@tahirbalarabe2/what-is-llms-context-window-understanding-and-working-with-the-context-window-641b6d4f811f)
48. [AI Primer - What LLMs actually do](https://www.aiprimer.net/library/ai-fundamentals/what-large-language-models-actually-do)
49. [Medium - Why LLMs sometimes forget your conversation](https://medium.com/@techintel0211/why-large-language-models-sometimes-forget-your-conversation-understanding-context-windows-32b3d12684f9)
50. [HKDCA - LLM Stack Practical Guide](https://www.hkdca.com/wp-content/uploads/2025/08/llm-stack-practical-guide-understanding-ai-electric-minds.pdf)
51. [Can Tech IT - The Human Context Window](https://cantechit.com/)
52. [Substack Nidly - What is Tokenization](https://nidly.substack.com/p/what-is-tokenization-in-ai-how-it)
53. [ES Publisher - Article 2094](https://www.espublisher.com/uploads/article_pdf/es2094.pdf)
54. [ITech Creations - Intro to NLP](https://www.itechcreations.in/cbse-class-10/introduction-to-natural-language-processing-how-ai-understands-human-language-class-10/)
55. [BAOU - MSCDS 303](https://baou.edu.in/assets/pdf/2025_MSCDS_303.pdf)
56. [Blog.gopenai - Your LLM doesn't understand words](https://blog.gopenai.com/your-llm-doesnt-understand-words-it-understands-tokens-5f15e27e7c11)
57. [AgentSwarms - Learn](https://agentswarms.fyi/learn)
58. [Lapasserelle - How LLMs Work](https://lapasserelle.com/documents/how_llms_work.pdf)
59. [Juejin - Post 7601374668961316914](https://juejin.cn/post/7601374668961316914)
60. [Buttondown - AI News Gemma 2 Tops rLocalLLaMA](https://buttondown.com/ainews/archive/ainews-gemma-2-tops-rlocalllama-vibe-check/)
61. [Blog.gopenai - Your LLM doesn't understand words](https://blog.gopenai.com/your-llm-doesnt-understand-words-it-understands-tokens-5f15e27e7c11)
62. [Medium - The secret language of AI](https://medium.com/@rkuma18/the-secret-language-of-ai-how-chatgpt-actually-reads-your-text-9f3eeda08564)
63. [Medium - The Art of Tokenization](https://medium.com/data-science/the-art-of-tokenization-breaking-down-text-for-ai-43c7bccaed25)
64. [Medium - The Invisible Building Blocks of AI](https://medium.com/data-science-collective/the-invisible-building-blocks-of-ai-what-you-need-to-know-about-tokenization-acadd86a63ba)
65. [Shanoj - AI Engineering](https://shanoj.com/tag/ai-engineering/)
66. [BAOU - MSCDS 303](https://baou.edu.in/assets/pdf/2025_MSCDS_303.pdf)
67. [PolyMTL - Article 58326](https://publications.polymtl.ca/58326/1/2024_AndressaStefanySilvaDeOliveira.pdf)
68. [Medium - OpenAI Strawberry Mathematical Foundations](https://medium.com/autonomous-agents/open-ai-strawberry-mathematical-foundations-and-emergent-reasoning-in-chain-of-thought-models-e20f2b738fba)
69. [Lee Hanchung - Reasoning Understanding o1](https://leehanchung.github.io/blogs/2024/10/08/reasoning-understanding-o1/)
70. [Hypotenuse AI - OpenAI Strawberry or o1 Preview](https://www.hypotenuse.ai/blog/openais-strawberry-or-o1-preview)
71. [Levelup Gitconnected - OpenAI GPT o1 A Leap in AI Reasoning](https://levelup.gitconnected.com/openais-gpt-o1-a-leap-in-ai-reasoning-86257d86b9b8)
72. [Louis Bouchard - OpenAI o1](https://www.louisbouchard.ai/openai-o1/)
73. [Zenodo - Mathematics Is All You Need](https://zenodo.org/records/19080172/files/Mathematics_Is_All_You_Need.pdf)
74. [Zenodo - Mathematics Is All You Need (Duplicate)](https://zenodo.org/records/19080172/files/Mathematics_Is_All_You_Need.pdf)
75. [SecWest - Strawberry](https://www.secwest.net/strawberry)
76. [Prompt Engineer - Why ChatGPT Can't Count Rs in Strawberry](https://prompt.16x.engineer/blog/why-chatgpt-cant-count-rs-in-strawberry)
77. [Hypotenuse AI - OpenAI Strawberry or o1 Preview](https://www.hypotenuse.ai/blog/openais-strawberry-or-o1-preview)
78. [Lee Hanchung - Reasoning Understanding o1](https://leehanchung.github.io/blogs/2024/10/08/reasoning-understanding-o1/)
79. [One Useful Thing - Something New on OpenAI's Strawberry](https://www.oneusefulthing.org/p/something-new-on-openais-strawberry)
80. [Medium - The secret language of AI](https://medium.com/@rkuma18/the-secret-language-of-ai-how-chatgpt-actually-reads-your-text-9f3eeda08564)
81. [Blog.gopenai - Your LLM doesn't understand words](https://blog.gopenai.com/your-llm-doesnt-understand-words-it-understands-tokens-5f15e27e7c11)
82. [Blog.gopenai - Your LLM doesn't understand words](https://blog.gopenai.com/your-llm-doesnt-understand-words-it-understands-tokens-5f15e27e7c11)
83. [Medium - The secret language of AI](https://medium.com/@rkuma18/the-secret-language-of-ai-how-chatgpt-actually-reads-your-text-9f3eeda08564)
84. [Medium - Tokens Embeddings Semantic Space](https://learncsdesigns.medium.com/day-4-tokens-embeddings-semantic-space-5e8e906f679a)
85. [Arxiv - HTML 2309.13638](https://arxiv.org/html/2309.13638v1)
86. [IoT Digital Twin - LLM Tokenization](https://iotdigitaltwinplm.com/llm-tokenization-bpe-sentencepiece-tiktoken-2026/)
87. [Blog.gopenai - Your LLM doesn't understand words](https://blog.gopenai.com/your-llm-doesnt-understand-words-it-understands-tokens-5f15e27e7c11)
88. [AgentSwarms - Learn](https://agentswarms.fyi/learn)
89. [Tom Archer - How LLMs Tokenize Text](https://tomarcher.io/posts/how-large-language-models-tokenize-text/)
90. [ToolSnak - AI Token Counter](https://www.toolsnak.com/en/ai-token-counter)
91. [Medium - The secret language of AI](https://medium.com/@rkuma18/the-secret-language-of-ai-how-chatgpt-actually-reads-your-text-9f3eeda08564)
92. [Zenn - Tokenization Algorithms](https://zenn.dev/shinyay/books/getting-started-with-tokens-en/viewer/04-tokenization-algorithms)
93. [Crates.io - Tiktoken](https://crates.io/crates/tiktoken)
94. [Github - Toksum](https://github.com/kactlabs/toksum)

**Sources:**
1. [gopenai.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFmHw9jpFoTwXFhKSNxlfzqa2g-t3_2QfpoowHn4Q3zHTIH7OO7WohVJk8Oa3e6XQYZ28ndiAU3wvXCcQKCI6lR8qpwmSc8DhRiJ7BjMYN7PSWSpcng8Qjxqyo13pKkbWp9u6eJZxG1x0DcF66CCz6G8YirDBnuusPTYnnFt4IKX6Yt6jM-2x4Y2FVwS0QrDbe34A==)
2. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQF7TAaHhxwPR2EVz0JJALuYK44EfXW5VclUoVYAtRENZ-xjBe85tMw9vaOIuiA6xrDpHBzwH5I0hYJnkrnxOaiMimmpMIEkzmrM8qK60DiuiCjOEpkxHPjAbhsjb2K18iEzSEAqHXxuv-Nv4JZslM6oxw7cSs28XmaTnz3nf44zCss-uwsipfXmV7Qmx0-gP6eg_ob2PZmbAk_1fGCycrhDQQTAOg==)
3. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEAZb4fEbTy-ostOEITafydsZO-8aFEKmkUyfOzepDmdEgCDhTT7H7Z7RGRvd6_3bt52Pi0qLA7_QNjoQzN3TLA4sdS0yOTylDBDPI4tkFo_sFnlnI_ujq_3J-nviGQifXm7SNqict0WLLQog8snOZWDZNsRN1FTf1f_BsLLMrbFQfm3fpXJ_bHA7p1dOGgdelFisf01cLfyIbafG3J9cUa9AmSjC61XWvpStMm6lBFaLXo1feZvMouebpq)
4. [substack.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFqzzoHNjFNqcKjheOu7jspk4jS2l0iP_5rfGOhVA_noFcDpNfCcQW129jAmWFNTJXIZYAhkcvWJb8-Z9TdnKyTWUxzzPkk2PsTx9rqdgGz1epwBou6oKhQte0WZpxx5I50cGt4mnSby6p0t9yP6gdgOyzsQg==)
5. [espublisher.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGkjHCwBI5aYbeVLl9oW3elVDFZHSZ1oQ6IQ2VtN77IDpRUAIzICbXP_swH5n5l7AX2InTzMTr7rOQImAgvHkoAP5zYUOzYaCR5dB23x03E0ZDoQmr-frTBA1Zti8djmdHPH6G9zDeXWi-BnsZNgVfm)
6. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE4mXQh6SfqmPHLtnkjGFAXerR65OnnxSjtHY6eZ843fmjEzfrDh8M-yABSQsZUhtqNOcTF9yDEypUyt-Dhk9VwVERMYnaDpSHXja0PwaCC22-82s1FdjrD2kn4VUXcwAKH0xc4vb57hX7m296K-105i1A1AkB6sQXSkAvXCpA-grs2ksHplFVN4P3WdbxKx9_7qfb0)
7. [apxml.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGtR4xZ4vQRlZUUjPmIXSFVLsMq5YR2kRFF7YUq8VwKUmbRZ5_t1wokFfvz5rhd5nshxHNNfrSmi-F5Tol_cGx44ehFCJe8DNEEgrqvaIyV-O5_Cj33bkFTs2EgJd4KWbuRZBz0p9IINddePDFKxFYN-l0JgVbEsn4aMkVVmlfhcLA7UpzoJfeNfDS1vHoQKOlPxKPET1qeOs78zXRwrAyfUDSltSr_xA==)
8. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH_gj9mvnlumZMWrUPn7dDEkoY97cSaOeG7vK2bZO0Ig1C-a4khgpCLxDMLii7vwIScMtTZU3bwT28HMib8-TjJCkT1a7-KvL4bCfkMKC-W67Jg5-o5h5J5MlZNNm3bw13WWCPoTn3M7R_nlu9vS9jIcUbkJP_7WAIumM9tYfIo1Ro9NdQAg9aPJOdBGMmLXdbTE9jYOAigZCg5B7sX1Q4lZQPjFGA=)
9. [plainenglish.io](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEXEJP753xA9JLbc-UVR7FbqdCIRNvxPVt1d2LRR6p1tysTKYaNoh2xzf3Zl1qTiG7v0U0L9acAM4hTkmtZ-YVVkZJt_udUZTrCN8J3IMuPEMPnseC-hEBqKdw1Pg_qYtc_ZJI5fux95wSTJqj_8XwazjFeXkMuyQGnyLKbf2LGpUgq7qwP7_ypAW4EL1tROnI=)
10. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGqxcdIXRhEMVJHF2XK72HSJydRkkFQaCSEVpWHxLPuouvyJ6QU1S-mgw3gQ57rKamSKYFWBzEaOK2X6MpLgSC5-dVlJbHdXlmT67EQ2dQ7N_JVAF4KCkpGv4SX94h1WFsvNOx5NVfX0jDghdT11LQK8d3d9l4jjdOZRl54IAfS7wqIJguqsEoLssA=)
11. [huggingface.co](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFIdsCOzYqSk6EAAzA_iX28Gso5nRAXITbCQJoJPZiZaOaj8PVe2ee4eMwGeRhI6SfzPz1cPW0NzVNg6xCnsB_bJLggsSHtGMhhGECYhOCOJYIELRhyTAHNk68DgaqXZHKBVD_KTfxiwlNP3x4mm1kB)
12. [huggingface.co](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHfiOhXJ7WqmCWlh52Xl8HWbfCutSNqd9UH-qKwGEgs7YjIoHrAq_qSR9I3b4s4aU36lp-hrRC7I4BrZjICkbNmPclXdn1J4OYyzJYvS2_SEG5WEQ8gehXJlo8C6gguMzST8yIMvKIupHzwGOvB5MmT-1dYZhzdEt53dULb-nxh9g==)
13. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHTfopjPBJ4nOI6740jrtCm1bxdH3QtrsgqFU7G_Dpu35h7QSXxs2yIn9J2XoFqmFIWsdKpjhhjUp-dlfYto8ZPMt7cMq0zyy5_c7inKJYnLo6SeUwGZ3deHXzk7mcKsNXIcozi2iIUkRfHarg8O4iCNiWpVZ5Uh62QTwgeqUJ7HcmlpQuiC9RKkJFJ)
14. [zenn.dev](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEjlzU4TcA3m_SlVHM3HFCji3WDGrR6448WnHRcGXaCPni18kPso8W1YUwu-hcynhP8KByJmCcL3mDiDc8uZ8oR4xenePscchcPGHN8Ey9oBU1dqUuEY0rGDQy4LW1isFpIsE0QGw3di-6ppQVm9EQ9avOXt_X5BUnNFGS6zeABT8z5JuMiteHY3zoC1UePwu5txc1_sw==)
15. [juejin.cn](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFqlWhjvwJ10Z3mz5fThDMbiz_UfLHBZVRiNN6u10UgFiXV7ZRMsGAxhSV9-_HIVs9a3wZ4eaV6bHxLW1g92aLaSoLUxsZTVlYZ8z_pQDzA7Yvk4pVLFtTA57bnKzO2QOY=)
16. [scribd.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE5R7qoag26FhwXO_OAuZAMeqWLfxBbTPIBT4tWeQluk_A0--LmzCye6DvIoiOJiTVFHFvLqpVrwjfycnzpoX6kh5Glz0hjoj-Crfk1WoKtIw0KA8bSOgqqRij60lid7oREFwBH3NDmbpo0)
17. [itechcreations.in](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEfcWVMZK7vLa8vribxBsNwVHU6_9WRnnrohZ_3U5XChG5yzSZL_ba6FeGKA8qgfA8XCpnq2Nan9AiXERDcz6CraBBjaYckP_JPLGPNtvKtt6bPeE3WPEgmFEgq7eeWGdpaWmr7B9khVV17ene5uiQ16wDBVsUpApIU2oeugppsAafaq_p8XJmMhvfq7XH64fHce5ekUzNmp7cUpO7kJWguIoBQA9tKlG7XpUQzD1ttoNJ24NctF_tx6GU=)
18. [tomarcher.io](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEZdJLI98aZl-r2-q1iQzW52_LTTTSfNPAddcL7ecDjnL4Nzt5_s8LQS3eDhjnaLqZA0khU1zQjzoTm22kgkD1RtpILzUk_52vgQVAOqNzg2e8SnKqn_RyCupNaFVU83TqNXqLoa5JexNunyPLnFkJUP13iSyFsJXbO)
19. [iotdigitaltwinplm.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHAjBcjfVyhmEtv9Xe2mIw4FoDUg4oosT3OKW8ErZrm_xTtkUirhYfa409k0LGe_PiUlQnp_-uiu0yR6DmdGKmSvv81wt6lo9wjGUYaX1u7Kp4Y7NjaCIUSDNHzSY-MXZAlqLeNz-EQUnibLrCUDOeLxgkjL0oF8Yb6TddVB0G134eBKDIb)
20. [crates.io](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGR8Sos96eKTi3bs4VvJZY5iJ8_escnKsZ6jTDjFJ8OiBp0t_so8w--VFf2KrFOcQypz-wOHYgNVq_KyLbB-7kTuSfGxjG5na469_FumMJpyTfoEJdEQVM=)
21. [digitalorientalist.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGG18ok3WDB8gkxv4-e3Sd7enQz3csbjYz00IQS0phPRHjNR5M-jl-pbZqJS0Tz846pcGvreRfqvNoaDOZMMq4FNB7sqbQCrXzbVn7GFsoE5oR9vMb4MRDtTBaTKOTEAM0gJzx-r0fby0iJFLfTVMwN1e_sGd0K9Yl6wLnutvgCO6lq3dX2ts_Z_pv9Qy6tOV-O_kaUAamrIJVIr1zg5eGuo5itQb7FVlE3oY-_9JdJdg_xdymxUVY=)
22. [llm-calculator.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFufEn_FsTrU9NJKCA8flwZVRu4ePeoxyL57tMOVBBHu6VLUHfD1eSIUNI3Cw4Et-DZsoMG2yDh19eKyloDVwPwZqdhnzdr0HvS0uWRMrB6GRboM2uZwheaf_6wOnZZG1PpfxoKquO2Hbq0xZ5qNoVjIdCM90mxGcvC)
23. [researchgate.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEJ8Urkwx2rdGzJAiQ16xUbrvOvsecKp9gvrGObDd1YrMNDCFoVnuTNOX8vZGOZsWxbvPO27FHKt8GJF-Hq7M56aJvDC9v1eFmWoTQa415SQDAfqsRmbzDVGBlgwy6w9whi1Lp9rF0CLW6N4ZL3MAICRDqx_-yynmYTCGeUpFeIpBYHhfiSSkmgt7MOOpKdb_FGSwXXoRJB2me1vh8jjtMxi3aurQCqVyY_LK65hyCkGqVFgkcJJNjd7E-xYZO2)
24. [stanford.edu](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHV5DkBQ21j9QA1cndST_PhEQW6GSw3nwu_OJVklqlfeK-HLr5q9RBk7WP_5N13UoG0-cexyKw7SusHa8hQEVWKEGI5y6L7zkeZb98FvjgfdAeUUlFgjozyHgtuCs70KFrA7Fld-JpvXIXGujnjgmnuF8V_brAOytEJB0ORE_YtnNjfxjcBtJNaS73xp73QrvguKhncCBbQvPPcqjWDWGf8YnaL7aXnfoLG)
25. [16x.engineer](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQF8pzuebXA_gn8qNTgv8Spk1WCyOxVZiZYbRP_ucqyAcbuENNMU20wDXk3CiNWpBZRi0jSlHvQIf-QHWTdMrowC7OUrC5TKxGD65Km3mCc04bifMdOVx0nYKTsfeRZ31fLxqBZWAsGF1oua6hUEydNuNgechbswVXpq22oFjo0=)
26. [hypotenuse.ai](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGPMe5-4vv2Cr16Fju5czWrkgYRkVeCgPouOYZ4QRbCFcw97KOusJ9oFutF0BHrcF5dBE2S57KDcAk-8IRNNQpbQgL-T02s4paItcce_vqOvlHPGyEAPYAaa-Anp6HPuxl2T8B3D_IvWO-Ku0yq01gDT9TxmNs=)
27. [secwest.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHjmN6orNqYy_RV6aKFyRGcy8Ks4XN881ACvfKFcUiEExwIpW5PrbUMriewMAogdYyVtJ_aLoi-i87w1PCfNncjJPra6LZxxAfTrQBMjsU_rzROZG13_VgN)
28. [agentswarms.fyi](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQF1jpVBSYxACmP0Hqjle4-13D1auaO25hpgjTzvtu0JdYisGDFQTdyjlxPsqQHShuiKxm5zbtcu8-jwDSoi9Mu-WnXF8CTEtPl4FqxoU74zcWHxoA==)
29. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFGhpvJIJJS3_9YXTpVYy8U9p7hoiV62LYo7tu3wWZfi4TV8e_aOM2WscwD-EHwBfjWNKGdRfG_9ejoCsiano0Kovlz1LRCIqpMC6LbAlzho6Oj_jp8_VIuZg==)
30. [reddit.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGWk2fHIS_IVFObHpebeAi_KCUoySTTgTQTupNamURfstP3HSATNfqAcbw3-wlrawDwypVQqhC6nyY8Ep-WJHk7jbVoD_BNcB27G87dMx7GoGa4gWPe8Jt4gPAbfNfsAPa25mhFS8kJeZ9x_2bbDsJ1jaWGyOMBwT41mtChLkwL-w_RIc-QmF2ngEEXkfwSwCObHZ6c7A==)
31. [dust.tt](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHLRAAOKGQtKVtEfiQU8YTmR1X22sUFGLpYPtDCJBPoq0iSXbB4gRNICUaDkcDyypdHg8HjHdeQfJfkOdFPRIhLRG8M-Awct_YtFiQeqJY0ZVr4pCd4zzr_RziENWMhCQ_qwT1m_2rfhM-0iLtNPA6yGKBqJjLNVbrQ_hmNQ4UD8ybTi5sjemNfrnhKN505Qu3r9w==)
32. [toolsnak.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEggecnlykLTMKPkgebFonoHH-wTds9CtyhhvnHXZqNRsC3sKv5Duf482tBUmhJp6iijg6wKiPeYQdsuHDyOsITH7RucwjUmdkku3PqE3TA266c9fsXDZsl_11T3CUWs1KwYg==)
33. [ycombinator.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE-j4YM_SNhdh-pJNIkKg1V6RD9x984Km8VKU9eF4WUkknjmVQ8tKlx70sxmS0wo53wnj4zaYGoNEEDvximz8L7-_Rrcf9v0uxfOS1eJ97lA3IZ8XBg-vTvJxvGrFbNlaWWMco=)
34. [github.io](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGwluDtD1KzkxYpcmiyL9D4oLN7LalRfhg64N589DuLDrMHatk_lakwuDR1Qt5kiTU1bgNIXIGGD4VCg07Qr8LlaSEFlBJ-CaTsArYCPw0y-N2PqBhWjpTblM2WU8V9zZbYeIwjsjhH5WAYLua3fs2tfMFBidGeLSREGOnj8ODHGQ==)
35. [louisbouchard.ai](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEROPmjbgJNchBQkWojJhARR5ycYAOrIWpdsxJWpjQd2vdGx1ZWHdFbform51mwoiuS1y3nSLYV5Nm0yGOhPCCcuaNUJj5TemBKcYmfezHM8WPTiJfpsgB6yl18eiI=)
36. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQECwx2cQ1IGyWM4jrOY1A8H9QbNtxaIjRSuHFqccURNDQtDb5yuhmj-YE8FMfdypbk1aDfpQTttfjHfkFtbF6tmrL5Dz0jHpHfFvs9adGwjwJ2VFlgA54ryQBMQRgpwtU9v6z8HYXVtZzemivEuURCPTCIHtJVNw3pBAiRO-9EUaoYsLPCWty5WqqkCfdp313A204HyULSXcqVG8A==)
37. [reddit.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGlgyjAyqAW8L3rynBOCET-4Xz6DnwC2J5SrqF8veK7QntZ9EXiIXa5r8IEIG_B1e66Gl6eXdIfma94OcoMy0b0CmGAVQOX7qJWOPeYOp04Zwud0qpcjW48QkQjUb55rH9T1sgt2OPmIa2ZAut2M4okJz6oMGUzTHLhKnL23NciZfkEoOHVbae4MrRd)
38. [gitconnected.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHP39J744NGZzeMywKU75IVuZCLVXRCnG3BKnWPC6tKCa5tomXJ9qUAjsJD45PuVFmFkt_cbMFjjn2NgCcAsMcXcFdnxACrU_Ldm212WdQh1ZoIsdmhN4lROpoyppHMFPVvrsfv4S9A5bPMDtRmv7YnxhLt0R_0Ki-TBBRszqm_LIDiPzgdSJdIKQ==)
39. [oneusefulthing.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH7wlwFDCum5UaUM3yvA0rh6d2V5Ny-IcQ9to0dCUdDD3i8l9LlkaBOQ9tE63_5h9XBIAp834k7118CYn4mOrhWTRcFBosD0cZSweKqGNCeRRsejhS61uzVcrrx2D-vigpkcOHQf7tfZl6OvZswFBrTOs2g0ymXqoqUcg==)
40. [openai.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGXq_pA7-YeXVle-dVGE3tKszsgi0Ht1iMJzfcKxTaeIv1hOrq2nAFr5uUfWgIcmTZOJOFSDYY20hD95bBeOB6KglQYonHLInoR9UzANsW6_-t2y607Hhix2RCFOtunSOgWwV4NNsgq1t2cKVE=)
41. [simonwillison.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGY7HCi3qf7dvyAlPELr9v4QUyZwMT-W09yGsME4gXaELywWX05Mih7xFoq_PSotKMBCfZaTYXuFZTTiS23q-GY2ThHAlg79iwvJ4ZTnMJpHf7sW9x-TQRkJ1Zqw1JPTQ6mlAisoc0=)
42. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH6GMtunlpB7AtLxmq0GmsiLXiu8SMX0xGPHdQLI7l_mQ6uXV24XJGv819Wxc6wvDVVwkTu0aByVvvPHTbx3DP9HZ_jPzj5AGWoBVUrGruTY0scMqexUA==)
43. [reddit.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHfQ-RWX9eY2Ui0vSjECjZKVeXJuCE32P-TbixGTrLO6ik9Kt_OXJGl6uMcTleCVLVucrELudSOudBO-v8-hgeo_E4AB0dWPQndbcVK-P7aUmPLU0ngNLBB8O8v6yZRWEqwzzTW5FtVeA2fyKL-f7GWcRNiXn84xqrlhDf5i4qDtjB6a7N0T9_wfHQAG384RMwIZBueJgZ9m8cg0pUAzkQvMbc=)
44. [frontiersin.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQG0QYdNRS6ivtVo8SouqI_-SB5mFPc99Wy-ASU0MAQgE2yZOBKybIJ2bA6VonTsoMgKDzIrTmdCh_PyenNo2fCwAp2JW-SjCc9Ug3QbI1QuN17T3_h5I2aRsYS-e91-SPZbtA3M-oGPDY2PPf9YcWmFqfCBYg7GaOPMOlaXRE-0ufrTY5NejkiyILFcos0_Jma42Q78ZivjSXW3)
45. [aclanthology.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHMjlkjq3BmY0kUf3vmRtN6lUMVEqXUi_C3j1X-GpzwAmTmbCiGLM8LQbBlyQ4BWCYVwgfraKMV8DydIkiB5oVx2-YlK5Jes_y6Tvkx7Wun-YvnFpojv8DhefXZWpb-v_HGhWWuTfM7)
46. [hosn.om](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQF-3r826yybYrrTi0xZLm6sEEcfkrEYA1YwRxbvu0gd5-qVl7qeR1SFSbRtEFqxhSoquZtnR2GK4ZzILzXB_6ipFFyoitTdnItsyTGlzZH4WC3zJaEbIfHcOt6AH2iVY1-HY6imfKydvLRcY9RolYQ=)
47. [aiprimer.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEX28c_oly3aNrONxywCrrpgN-nkXYirhuTnT_ZO_zDS4BFcD5EAta7hvcV213CalLuYgC_fhrhyV62xnq6wGRuOLB7rKmAQKOWzctFbvJ2Sd7LjKRsh4qyAtzIkqpk3fmS73JYt0jDyoSzh4DjXDQ2XmH_FHm--G3zYxVBkvXR_oi60yvq-AF2wGfV9iw=)
48. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH0WKkJbMTz5AvUwil8fFTVHO0POljHd8psx6wkVbS56_-oRCV8Z_l2cXLetTK8BgdajWSbflQhyVM_SHT_4CV9OKEoyrnje1eSO_HEwTssDywm1yJ6sdwuu7418FoJGvm7yby91-C6CI6vAcxBrJUwhNjKBn4u9T_Ayu5cGa6S7HnCZrrQ2UdI3aFDeLRqI-05kRiFpjNCdO68U7JhoMr-I2UMpSc9H556oev04yOXTn1CazOm5GOTFmN9x6eZMrs=)
49. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH8gFJ1h0S8bRvyjUfFVv6Z8VhlGpyYBsfn8mcXvDu-i6FqSdH0SFCBVJsPd2S_7pcrAQCE8ZygbBtndDdKkoO0VY78TYCiYRZ81YjP4qNr0YYKCEIbdwSlvPbA2IKEBeG-qHLcqmxwFLL4t0cVorG2DkNtXFx1C7eB-UYQ1YyI8nHrtaZ-7a2dZCWQNuzj2RSSWfZXFswTa1ye_Z1gse7LKfFcXg_r6x4ze6sXrsVyWdtHJA==)
50. [hkdca.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFVqo0wgbRx-28plOCSLe5eydEV6kbR7JdU2cUSrNTQkKMrbJdCo6iCIjuvSfzY9LEpS5BsLyqnyBOZW1m0Zo6cTwHbJLH3CGbPFr5c3Pq9xqey5hud3PpB8Kq6I6T5RhuGtIEvxWSWiGsz0jpulVR2r-BR5n_g-vzzmUaSloaqBb-i5aEBwwyG_44XBSIYf0eM5XBK-PI4FSxZmkDThXdnXwweAg==)
51. [youtube.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEHq1N6CGDn1bxdMFIjA0HkGSi0cwldFg6QEjHwwh5p_p-pm0jNu1ewaqNFWhf9CMjuGJQUvxGBzrxGG2jcGRGORlymp7td6RaW1qY3t1OHEFSGlcgOeMG7isj6z3Gx88az)