# How Transformer Attention Works in Plain English

The attention mechanism allows large language models to weigh the relevance of every word in a text against every other word simultaneously, creating a deep contextual understanding that mimics human reading. By breaking text into mathematical queries, keys, and values, transformers bypass older sequential processing methods, enabling massively parallel computation. Modern breakthroughs have further optimized this mechanism, allowing models to process millions of words without succumbing to computer memory limitations.

## The Sequential Memory Dilemma

Before the summer of 2017, the artificial intelligence industry relied heavily on Recurrent Neural Networks (RNNs) and their more advanced variants, such as Long Short-Term Memory (LSTM) networks, to process text [cite: 1, 2, 3]. The underlying logic of an RNN was elegant but fundamentally flawed for large-scale applications: it read text strictly sequentially, word by word, from left to right. It maintained an internal "hidden state" or memory that carried information from previous words into the processing of subsequent ones [cite: 1, 3]. 

However, this architecture suffered from what researchers often refer to as the "telephone game" problem [cite: 1]. As the model progressed through a long paragraph or document, the memory of the first few sentences became increasingly diluted or distorted. If a pronoun at the bottom of a page referred back to a noun at the very beginning, the RNN would often lose the connection [cite: 3]. 

Furthermore, because RNNs had to process a word before they could process the subsequent word, their training could not be easily parallelized [cite: 3, 4]. Modern Graphics Processing Units (GPUs) contain thousands of cores designed to do math simultaneously, but the sequential nature of RNNs forced these massive chips to wait in line, creating a severe bottleneck that prevented models from scaling on massive datasets [cite: 4].

This bottleneck was shattered by a team of Google Brain researchers in their landmark 2017 paper, *Attention Is All You Need* [cite: 1, 5]. The paper introduced the "Transformer" architecture, which completely discarded sequential recurrence in favor of a mechanism called "self-attention" [cite: 1, 5]. 

Instead of reading a sentence sequentially, the Transformer ingests the entire sequence at once. It then allows every token (a word or sub-word) to directly "look at" and draw context from every other token in the sequence simultaneously, regardless of physical distance [cite: 3, 5]. This not only solved the memory degradation problem—because the distance between any two words was suddenly reduced to a single mathematical operation—but it also allowed the model's training to be massively parallelized, kicking off the era of modern Large Language Models (LLMs) [cite: 3, 4, 5].

## The Mechanics of Self-Attention

To understand how self-attention works, we must look at how the model processes raw text. When a sentence enters a Transformer, each token is converted into a high-dimensional mathematical vector called an embedding [cite: 5, 6]. But an embedding alone only represents a word's static dictionary definition; it lacks context. For instance, the word "bank" has the exact same initial embedding whether the sentence is "I sat on the river bank" or "I deposited money in the bank" [cite: 7].

Self-attention is the mechanism that updates these static dictionary embeddings into rich, context-aware representations [cite: 1, 8]. It does this through a framework of three distinct vectors generated for every single token: Queries, Keys, and Values [cite: 2, 6, 9].

[image delta #1, 0 bytes]



### Queries, Keys, and Values (QKV)

The QKV concept is loosely inspired by retrieval systems, much like searching for a video in a database [cite: 4]. 

1. **The Query (What I am looking for):** Think of the Query as a token asking a question. It projects a vector that essentially says, "What other words in this sentence help explain my specific grammatical or semantic meaning?" [cite: 4, 9].
2. **The Key (What I contain):** Think of the Key as a token's index tag. It projects a vector broadcasting, "Here is the grammatical and semantic information I possess" [cite: 4, 9].
3. **The Value (My actual substance):** Think of the Value as the core meaning of the token that will be passed along if another word decides it is highly relevant [cite: 9].

When processing a sentence, the Transformer calculates the "attention score" between one word and every other word. It does this by taking the mathematical dot product of the first word's Query vector and the second word's Key vector [cite: 9, 10, 11]. The dot product is a way to measure alignment in linear algebra. If the Query of the word "bank" strongly aligns with the Key of the word "river," the dot product will yield a high numerical score. If it aligns poorly with the word "the," the score will be low [cite: 9].

### Scaling and the Softmax Function

Once these raw dot-product scores are calculated, they must be stabilized. If the dimensionality of the vectors is very large, dot products can grow into massive numbers, which destabilizes the neural network's gradients [cite: 12]. To fix this, the mechanism employs "Scaled Dot-Product Attention," dividing the scores by the square root of the key dimension [cite: 11, 12, 13].

These scaled scores are then passed through a "softmax" function [cite: 1, 4, 13]. The softmax function acts as a normalizer, converting the raw scores into percentages (probabilities) that always sum exactly to 1.0 (or 100%) for a given token [cite: 1, 4, 14]. 

For example, when evaluating the word "bank" in the context of a river, the softmax function might dictate that "bank" gives 70% of its attention to "river", 20% to "sat", and 10% to itself [cite: 1, 9].

Finally, the model multiplies these percentage weights by the Value vectors of the respective words and adds them all together [cite: 1, 4, 9]. The result is a brand-new, updated vector for the word "bank" that is heavily tinted by the "river" vector. The word is no longer a static dictionary definition; it is deeply contextualized and aware of its surroundings.



## Why One Head Isn't Enough: Multi-Head Attention

A single self-attention mechanism (or "head") computing a single attention distribution can only focus on one type of relationship at a time [cite: 15, 16]. If a token's Query is busy searching for grammatical subject-verb agreements, it might miss subtle semantic nuances, causal links, or contextual tone elsewhere in the sentence [cite: 15, 16]. 

To solve this, the Transformer splits its operations into "Multi-Head Attention" [cite: 1, 2, 15]. Instead of executing one massive attention calculation, the model runs multiple self-attention operations in parallel [cite: 2, 12, 17]. It takes the high-dimensional token embeddings and splits them across several smaller subspaces [cite: 15, 16]. For example, if a model has an embedding dimension of 512 and uses 8 attention heads, each head operates in a 64-dimensional subspace [cite: 12, 15]. 

By splitting the workload, each attention head naturally develops a specialization—even though the human programmers never explicitly code them to look for specific things [cite: 18, 19]. This emergent behavior is one of the most fascinating discoveries in artificial intelligence research, effectively serving as an analogue for how human teams divide labor [cite: 15, 19].

### The Emergence of Linguistic Specialization (BERTology)

In 2019, researchers from Stanford University and the University of Washington published an exhaustive analysis of the attention heads inside Google's BERT language model [cite: 20, 21, 22]. They discovered that specific attention heads autonomously specialize into distinct linguistic parsers [cite: 20, 22]. Some act as grammar engines, others track narrative flow, and others handle positional logic [cite: 15, 19].

The literature broadly categorizes these specialized heads into several distinct functional roles [cite: 15, 19, 20, 23]:

| Attention Head Specialization | Function and Behavior | Examples from Research Literature |
| :--- | :--- | :--- |
| **Syntactic Heads** | Focus on grammatical structure, parsing sentences into verbs, nouns, and modifiers. | Head 8-10 in BERT links direct objects to verbs with 86.8% accuracy; Head 8-11 connects determiners to nouns at 94.3% accuracy [cite: 20, 22]. |
| **Positional Heads** | Focus heavily on adjacent words, strictly attending to the token immediately preceding or following. | Crucial for assembling compound words, understanding local phrases, and maintaining sequence order [cite: 8, 19, 23]. |
| **Coreference Heads** | Focus on tracking pronouns and connecting them to their original noun subjects across distances. | Allows the model to correctly identify what "it," "he," or "they" refers to across multiple sentences [cite: 15, 18, 20]. |
| **Semantic Heads** | Focus on broad topic relationships and synonyms regardless of grammar or distance. | Connects conceptually related words like "agent," "test suite," and "pull request" to understand software engineering context [cite: 15]. |
| **Delimiter Heads** | Focus on special structural tokens (like sentence separators or the start of a prompt). | Helps the model manage transitions between distinct thoughts or document sections [cite: 20, 21, 23]. |

Once all these parallel heads have finished computing their specific viewpoints, their outputs are concatenated (stitched together) and passed through a final linear transformation [cite: 2, 11, 12]. The result is a singularly unified, profoundly deep representation of the text [cite: 16].

## Visualizing Attention via Heatmaps

Because attention scores are mathematically deterministic percentages (ranging from 0 to 1), researchers can easily extract them from a running model and visualize them as heatmaps [cite: 24, 25]. This provides a rare, transparent window into the "black box" of deep learning.

In an attention heatmap, the words of a sentence are placed along both the X and Y axes [cite: 25]. The intersection of any two words is shaded based on the attention score [cite: 25]. Dark, cooler colors (like dark blue) represent a low attention score (near 0%), while bright, warmer colors (like red or yellow) represent a high attention score [cite: 24, 25]. 

By looking at these heatmaps over successive epochs of training, AI researchers can perform "mechanistic interpretability" [cite: 25, 26]. If a model correctly answers a riddle, researchers can look at the heatmap to see exactly which words the model focused on to deduce the answer [cite: 25]. If a specific attention head consistently produces heatmaps linking verbs to their exact direct objects, researchers know that head has successfully learned English syntax without human intervention [cite: 20, 25].

## The Quadratic Bottleneck: The Memory Wall

If attention is so powerful, why can't we simply feed an entire library of books or an hour of video into an AI all at once? The answer lies in the harsh mathematical reality of how standard self-attention scales.

In standard dot-product attention, every single token in a sequence must generate a Query and compare it against the Key of *every other token* [cite: 10, 27, 28]. This creates an $N \times N$ matrix of relationships, where $N$ is the number of tokens in the sequence [cite: 10, 29]. 
* If a sequence is 100 tokens long, the model computes 10,000 interactions.
* If a sequence is 1,000 tokens long, the model computes 1,000,000 interactions.
* If a sequence is 100,000 tokens long, the model must compute 10,000,000,000 (ten billion) interactions.

This is known as $O(N^2)$ or "quadratic" time and memory complexity [cite: 10, 27, 30]. As the sequence length (the "context window") grows, the amount of GPU memory required to compute and store these massive attention matrices explodes [cite: 10, 27, 31]. It quickly exceeds the physical limits of standard AI accelerators like Nvidia's H100 GPUs [cite: 31]. 

The industry refers to this fundamental limitation as the "Memory Wall" [cite: 27, 29]. Over the past few years, the race to build models capable of reading massive codebases or entire novels has driven researchers to heavily optimize both the attention algorithm and the hardware data pathways [cite: 31, 32].

## The Efficiency Era: Rethinking the Architecture (2022–2026)

To bypass the quadratic bottleneck, researchers have developed brilliant algorithmic workarounds. These techniques do not change the fundamental linguistic philosophy of the Transformer, but they drastically alter how the mathematics are executed on the silicon. 

### FlashAttention: Rewriting the IO Pathway

Introduced by Tri Dao and colleagues in 2022, and continually refined into FlashAttention-2 and FlashAttention-3, this algorithm achieved massive speedups without losing any accuracy [cite: 27, 31]. It computes exact attention, but does so with profound "IO-awareness" (Input/Output awareness) [cite: 27, 31].

Modern GPUs contain two critical types of memory: 
1. **HBM (High Bandwidth Memory):** Massive capacity but relatively slow to read from and write to [cite: 31, 33].
2. **SRAM (Static Random-Access Memory):** Tiny capacity (often just megabytes per streaming multiprocessor) but blisteringly fast [cite: 31, 33].

Standard attention implementations move massive $N \times N$ matrices back and forth between the slow HBM and the fast SRAM repeatedly during calculation, creating a massive data traffic jam [cite: 27, 31]. FlashAttention bypasses this using a technique called "tiling." It loads smaller blocks of Queries, Keys, and Values from the slow HBM into the fast SRAM, calculates the attention for that block entirely on-chip, and writes only the final output back to the HBM [cite: 27, 31]. 

By avoiding the materialization of giant intermediate attention matrices in the slow memory, FlashAttention achieves 2x to 4x speedups, radically reduces memory usage, and allows models to push GPU utilization up to 75% of theoretical maximum FLOPs [cite: 31, 33].

### Taming the KV Cache: MQA, GQA, and MLA

During inference (when the model is actually generating text), Transformers predict one token at a time autoregressively [cite: 6, 34]. To predict token 1,001, the model needs the attention context of the previous 1,000 tokens. Recalculating the Keys and Values for those previous 1,000 tokens every single step would be incredibly slow, so models store them in what is known as the "KV Cache" [cite: 32, 34]. 

However, as context lengths grow to hundreds of thousands of tokens, the KV Cache consumes massive amounts of GPU memory [cite: 10, 32, 34]. Researchers developed several attention variants to compress this cache:
* **Multi-Query Attention (MQA):** Instead of every attention head having its own set of Keys and Values, MQA forces all Query heads to share a single Key and Value head [cite: 6, 17, 32]. This drastically shrinks the cache but can slightly degrade model performance on complex reasoning [cite: 17, 32].
* **Grouped-Query Attention (GQA):** Used by models like Llama and Mistral, GQA strikes a balance [cite: 6, 17]. It groups several Query heads together and assigns one Key-Value pair to each group (e.g., 32 Query heads sharing 8 Key-Value heads) [cite: 6]. This preserves high performance while slashing memory costs [cite: 6, 17].
* **Multi-head Latent Attention (MLA):** Pioneered by DeepSeek, MLA takes a radical approach by compressing the Key and Value matrices into a single, low-rank latent vector [cite: 17, 32]. This has been shown to compress the KV cache by up to 93.3% while maintaining top-tier performance, vastly improving inference speed [cite: 32].

### Sparse and Sliding Window Attention

Another strategy is to stop computing attention for tokens that are too far apart to matter. Mistral models utilize "Sliding Window Attention" (SWA) [cite: 10, 32, 35]. Instead of every token looking at every previous token, a token is only allowed to look at a fixed local window—for example, the previous 4,096 tokens [cite: 10, 32]. This changes the computational complexity from quadratic $O(N^2)$ to linear $O(N \times W)$, where W is the window size [cite: 10, 32].

While it sounds like SWA would cause the model to lose long-term memory, Transformers stack dozens of attention layers on top of each other [cite: 6, 9, 10]. If Token A can attend to Token B in Layer 1, and Token B can attend to Token C in Layer 2, then Token A becomes indirectly connected to Token C [cite: 6, 9, 10]. The effective "receptive field" grows larger with every layer, similar to how Convolutional Neural Networks process images [cite: 6, 9, 10].

Similarly, models using "Native Sparse Attention" (NSA) or "Mixture of Attention" (MoA) dynamically select only the most important tokens to attend to, routing different sparse patterns to different heads based on their function [cite: 30, 32, 34]. Some heads might focus locally, while others act as global sentinels [cite: 34]. 

### Scaling the Hardware: Ring Attention

When dealing with massive sequences that simply cannot fit on one machine, researchers at UC Berkeley developed "Ring Attention" [cite: 36, 37, 38]. Older distribution methods (like DeepSpeed Ulysses) required gathering the entire sequence on each device, which bottlenecked at scale [cite: 36]. 

Ring Attention splits the input sequence into blocks and distributes them across a cluster of GPUs (or TPUs) arranged in a logical ring topology [cite: 36, 39]. Each device holds a portion of the Queries, while the Key-Value blocks are passed sequentially around the ring from device to device [cite: 36, 40]. Because no single device is ever forced to store the full sequence, Ring Attention allows context sizes to scale linearly with the number of devices added [cite: 36, 37]. In testing, this allowed models to handle contexts exceeding 100 million tokens without making mathematical approximations [cite: 37, 39].

### Compressing History: Infini-Attention

To push context windows toward practical infinity—as seen in Google's Gemini 1.5, which boasts a 1-million to 2-million token window—Google researchers introduced **Infini-attention** [cite: 41, 42, 43]. 

Infini-attention splits the workload into two simultaneous pathways within a single Transformer block:
1. **Local Masked Attention:** Standard, highly precise dot-product attention is applied only to the most recent segment of text [cite: 7, 43, 44].
2. **Compressive Memory:** Instead of discarding older tokens that fall outside the local window, their Key-Value states are mathematically compressed and stored in a fixed-size memory matrix [cite: 7, 41, 43]. 

When the model needs to recall a specific fact from hundreds of pages ago, its current Query vector interacts with this dense compressive memory matrix using a linear attention mechanism to retrieve the historical data [cite: 7, 41, 42]. This approach fuses the high resolution of local attention with the endless capacity of a compressed archive, solving the "lost in the middle" problem where LLMs previously forgot facts buried deep in their context [cite: 7, 43].

| Attention Optimization | Primary Bottleneck Solved | Mechanism of Action |
| :--- | :--- | :--- |
| **FlashAttention** | GPU Memory IO latency | Reorders operations (tiling) to keep data in fast SRAM, minimizing slow HBM reads/writes without altering accuracy [cite: 27, 31]. |
| **GQA / MLA** | KV Cache memory bloat | Groups Query heads to share Key/Value heads, or compresses the entire cache into a latent vector to save memory during inference [cite: 6, 17, 32]. |
| **Sliding Window / Sparse** | Quadratic scaling complexity | Restricts attention to a local window or select subsets of tokens, relying on stacked layers to build global context [cite: 10, 32, 34]. |
| **Ring Attention** | Single-GPU memory limits | Distributes sequence blocks in a circular network across multiple GPUs, allowing context to scale linearly with hardware [cite: 36, 37, 40]. |
| **Infini-Attention** | Context length constraints | Combines exact local attention with a continuously updated compressed memory matrix for historical retrieval [cite: 7, 43]. |

### Case Study: Naver's HyperCLOVA X

The practical impact of combining these optimizations is evident in enterprise-scale models like Naver's HyperCLOVA X, a model optimized specifically for the Korean language alongside English and coding [cite: 45, 46, 47]. 

To build an efficient system capable of processing massive multilingual datasets without exorbitant training costs, Naver utilized advanced attention techniques like Grouped-Query Attention (GQA) and rotational position embeddings [cite: 45, 48]. Furthermore, they integrated aggressive pruning (removing low-importance parameters) and knowledge distillation, allowing a smaller model like HyperCLOVA X-SEED-0.5B to train with nearly 39 times greater resource efficiency than comparable models [cite: 49]. This demonstrates how attention optimization is not just academic theory; it dictates the commercial viability and sovereign capability of national AI models [cite: 46, 49].

## Manipulating Attention: The Science of Prompt Engineering

Understanding how the attention mechanism works under the hood is critical for interacting with LLMs effectively. The emerging field of "Context Engineering" or "Prompt Engineering" is fundamentally the practice of manipulating a model's attention allocation [cite: 28, 50]. 

Because the self-attention mechanism forces every token to evaluate every other token, context is a finite and fragile resource [cite: 28, 51]. As context lengths increase, the model's "attention budget" gets stretched thin across thousands of tokens, increasing the likelihood of the model hallucinating or losing focus [cite: 28, 51]. Modern prompt engineering relies on structural techniques to guide the QKV mechanism toward the right information:

* **XML Tagging:** Structuring prompts with tags like `<instructions>` or `<examples>` creates highly distinct token embeddings. Because LLMs are often trained on code, the attention mechanism recognizes these structural delimiters, allowing Delimiter Heads to properly segment the input and prevent the model from confusing instructions with reference data [cite: 52, 53, 54].
* **Chain-of-Thought (CoT):** Asking a model to "Think step-by-step" is not a psychological trick; it is an attention allocation strategy [cite: 50, 55]. By forcing the model to generate intermediate reasoning tokens, those new tokens enter the context window [cite: 50]. The model's attention heads can now attend to these explicit, logical stepping stones when generating the final answer, rather than trying to map a complex problem to a solution in a single, massive attention leap [cite: 50, 55].
* **Context Layering:** Prompt engineers organize information hierarchically—placing the system role, explicit rules, examples, and the immediate user query in strict sequence [cite: 51, 53, 56]. This structured flow aligns with how Positional Heads and Syntactic Heads expect information to be presented, reducing the "noise" that the Query vectors must filter out [cite: 51, 53, 56].

## Neuroscience and Artificial Cognition

As LLMs scale and their attention mechanisms become more sophisticated, researchers are observing striking functional alignments between artificial attention architectures and biological cognition [cite: 26, 57]. 

A body of research emerging between 2024 and 2026 suggests that the functional architecture of highly capable LLMs mirrors organizational patterns found in human Functional Brain Networks (FBNs) [cite: 57, 58]. In a study analyzing "cognitive heads" using the CogQA dataset, researchers mapped specific attention heads to distinct cognitive tasks like memory retrieval or logical inference [cite: 58]. 

They found an extreme degree of sparsity: for any given cognitive function, fewer than 7% of the attention heads were actually highly active [cite: 58]. The models rely on highly specialized, localized subnetworks to solve specific problems—paralleling the human brain's tendency to localize functions in specific cortical regions [cite: 26, 57, 58].

Institutions like Google DeepMind and NTT Research are actively exploring how to bridge this gap further [cite: 59, 60, 61]. DeepMind's proposals around "Nested Learning" and "Reinforced Attention Learning" represent a fundamental shift [cite: 59]. Rather than merely rewarding an AI for predicting the correct next word, Nested Learning rewards the model for *paying attention to the right internal information* at different temporal speeds [cite: 59]. 

By creating a continuum memory system with slow-updating long-term components and fast-updating short-term components, researchers are attempting to mimic how the human brain manages lifelong learning [cite: 59]. If successful, this could solve the catastrophic forgetting problem, moving AI away from being a static snapshot of training data toward becoming a continuously learning system [cite: 59, 60].

## Bottom line

The attention mechanism revolutionized artificial intelligence by allowing models to dynamically weigh the relationships between all parts of a sequence simultaneously, breaking free from the constraints of linear, sequential processing. Through the use of parallel query, key, and value vectors, "attention heads" autonomously specialize to decipher syntax, semantics, and context with remarkable precision. While the quadratic computational cost of attention remains a hurdle, modern engineering innovations—ranging from FlashAttention to Infini-attention—are dismantling the memory wall, paving the way for models with infinite context windows and increasingly brain-like functional architecture. 

## Sources

1. [The Latest Research Progress of Attention Mechanism in Deep Learning](https://www.researchgate.net/publication/392209609_The_Latest_Research_Progress_of_Attention_Mechanism_in_Deep_Learning)
2. [The silent revolution: how the attention mechanism has rewritten the rules](https://medium.com/@gianluca.mondillo/the-silent-revolution-how-the-attention-mechanism-has-rewritten-the-rules-of-artificial-378a0f9f6191)
3. [DeepMind's nested learning and reinforced attention](https://www.youtube.com/watch?v=MNi5s0KaNX4)
4. [DeepMind Research Publications](https://deepmind.google/research/publications/)
5. [Efficient Attention Mechanisms for Large Language Models: A Survey](https://arxiv.org/abs/2507.19595)
6. [FlashAttention-3 release and explanation](https://tridao.me/blog/2024/flash3/)
7. [FlashAttention: Efficient Long-Sequence Modeling in Transformers](https://www.scribd.com/document/1011312085/Flash-Attention)
8. [FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135)
9. [FlashAttention OpenReview Paper](https://openreview.net/pdf?id=H4DqfPSibmx)
10. [Github paper notes on FlashAttention](https://github.com/AkihikoWatanabe/paper_notes/issues/688)
11. [Advanced Attention Mechanisms in Transformer LLMs](https://pub.towardsai.net/advanced-attention-mechanisms-in-transformer-llms-44cac04ec356)
12. [Mistral AI attention mechanism optimizations sliding window attention](https://www.youtube.com/watch?v=NGR2Axsg008)
13. [Sliding Window Attention: How Mistral Works](https://typefully.com/hrishioa/sliding-window-attention-how-mistral-works-jZnXRqh)
14. [Mastering Mistral AI: From Sliding Window Attention to Efficient Inference](https://medium.com/@sayedebad.777/mastering-mistral-ai-from-sliding-window-attention-to-efficient-inference-22d944384788)
15. [Sliding Window Attention in Mistral with Receptive Field in CNNs](https://medium.com/@ramponnana.2011/sliding-window-attention-in-mistral-with-receptive-field-in-cnns-bdc5f8d5d055)
16. [Current Time CN Search](https://www.google.com/search?q=time+in+CN)
17. [Stanford CS224N Transformer lecture notes W25](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1254/slides_w25/cs224n-2025-lecture08-transformers.pdf)
18. [Stanford CS224N Transformer lecture notes W24](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1244/slides/cs224n-2024-lecture08-transformers.pdf)
19. [Stanford CS224N Transformer lecture notes Spr24](https://web.stanford.edu/class/cs224n/slides/cs224n-spr2024-lecture08-transformers.pdf)
20. [UBC Transformer Attention Lecture Notes](https://www.cs.ubc.ca/~dsuth/440/22w2/slides/11-attention.pdf)
21. [Hands-on Transformer Deep Dive Part 2: Multi-head attention variants](https://xiaolishen.medium.com/hands-on-transformer-deep-dive-part-2-multi-head-attention-variants-with-code-1d76c8ae65cd)
22. [Attention Is All You Need Explained Like You’re Smart and Busy](https://medium.com/@adnanmasood/attention-is-all-you-need-explained-like-youre-smart-and-busy-2a3d7436144f)
23. [Attention Is All You Need Wikipedia](https://en.wikipedia.org/wiki/Attention_Is_All_You_Need)
24. [Multi-head attention mechanism development](https://www.youtube.com/watch?v=PZE6Ev-pEXk)
25. [Attention Is All You Need NeurIPS Paper](https://proceedings.neurips.cc/paper/7181-attention-is-all-you-need.pdf)
26. [Paper Review: Attention Is All You Need](https://medium.com/@redbeet1007/paper-review-attention-is-all-you-need-vaswani-2017-1d79b986cccf)
27. [Dive into Deep Learning Tsinghua](https://d2l.ai/)
28. [Tsinghua CoAI Research Highlights 2024](https://coai.cs.tsinghua.edu.cn/storage/form/file/TEqX00Qs_Tsinghua-CoAI-research-highlights-2024-full-version.pdf)
29. [Mixture of Attention (MoA) Research](https://arxiv.org/html/2406.14909v1)
30. [Effective Context Engineering for AI Agents Anthropic](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
31. [Anthropic Prompt Engineering Template](https://www.reddit.com/r/PromptEngineering/comments/1n08dpp/anthropic_just_revealed_their_internal_prompt/)
32. [Anthropic Prompt Engineering Guide Overview](https://www.aiwithgrant.com/guides/anthropic-prompt-engineering-overview)
33. [How to Start Anthropic Prompt Engineering](https://blog.promptlayer.com/how-to-start-anthropic-prompt-engineering/)
34. [Prompt Engineering with Anthropic Claude](https://medium.com/promptlayer/prompt-engineering-with-anthropic-claude-5399da57461d)
35. [What Does BERT Look At? An Analysis of BERT's Attention (ACL)](https://aclanthology.org/W19-4828/)
36. [What Does BERT Look At Bibbase](https://bibbase.org/network/publication/clark-khandelwal-levy-manning-whatdoesbertlookatananalysisofbertsattention-2019)
37. [What Does BERT Look At Stanford PDF](https://www-nlp.stanford.edu/pubs/clark2019what.pdf)
38. [What Does BERT Look At Hugging Face](https://huggingface.co/papers/1906.04341)
39. [What Does BERT Look At arXiv](https://arxiv.org/abs/1906.04341)
40. [LLM Transformer Book: Multi-Head Attention](https://www.waylandz.com/llm-transformer-book-en/chapter-11-multi-head-attention)
41. [How Transformers Compute Attention in Parallel](https://medium.com/@punya8147_26846/multi-head-attention-how-transformers-compute-attention-in-parallel-using-multiple-attention-heads-ca64cd09eaa9)
42. [Attention Head Specialization Emergent Mind](https://www.emergentmind.com/topics/attention-head-specialization)
43. [Multi-Head Attention in Transformers ProjectPro](https://www.projectpro.io/article/multi-head-attention-in-transformers/1166)
44. [Multi-Head Attention Mechanism GeeksforGeeks](https://www.geeksforgeeks.org/nlp/multi-head-attention-mechanism/)
45. [Current Time Japan Search](https://www.google.com/search?q=time+in+Japan)
46. [How to interpret attention heatmap](https://attentioninsight.com/knowledgebase/how-to-interpret-attention-heatmap/)
47. [How to Improve Readability with Attention Heatmaps](https://mouseflow.com/blog/how-to-improve-readability-with-attention-heatmaps/)
48. [Alpha.one Glossary: Attention Heatmaps](https://www.alpha.one/resources/glossary/attention-heatmaps)
49. [Understanding Text with Attention Heatmaps](https://muneebsa.medium.com/deep-learning-101-lesson-30-understanding-text-with-attention-heatmaps-efe968a51bc2)
50. [Contentsquare Heatmaps Examples](https://contentsquare.com/guides/heatmaps/examples/)
51. [NTT Scientists Present Breakthrough Research at ICLR 2025](https://ntt-research.com/ntt-scientists-present-breakthrough-research-on-ai-deep-learning-at-iclr-2025/)
52. [NTT CHI 2025 Papers Accepted](https://group.ntt/en/topics/2025/04/25/chi2025.html)
53. [NTT Applied Neuroscience Technology](https://www.rd.ntt/e/research/JN202311_23740.html)
54. [Attention Mechanism Neural Networks Survey](https://arxiv.org/pdf/2204.13154)
55. [NTT Scientists Contribute to NeurIPS 2025](https://ntt-research.com/ntt-scientists-contribute-fifteen-research-papers-to-neurips-2025/)
56. [Naver HyperCLOVA X Tech Report](https://ai-scholar.tech/en/articles/large-language-models/HyperCLOVA-X)
57. [Making HyperCLOVA X Lightweight and Efficient](https://clova.ai/en/tech-blog/small-but-mighty-making-hyperclova-x-both-lightweight-and-efficient)
58. [HyperCLOVA X ArXiv Paper](https://arxiv.org/html/2404.01954v1)
59. [Introducing HyperCLOVA X](https://clova.ai/en/tech-blog/introducing-hyperclova-x-our-state-of-the-art-ai-models-optimized-for-the-korean-language)
60. [Naver Unveils HyperCLOVA X w.media](https://w.media/korean-search-giant-naver-unveils-ai-model-hyperclova-x/)
61. [Time to Take AI Consciousness Seriously](https://www.secondbest.ca/p/time-to-take-ai-consciousness-seriously)
62. [Implementing Attention Mechanisms LLM Log](https://datahacker.rs/llm_log-005-implementing-attention-mechanisms-from-simplified-self-attention-to-multi-head-attention/)
63. [Cognitive Heads Research arXiv](https://arxiv.org/html/2512.10978v1)
64. [Attention heads of large language models PubMed](https://pubmed.ncbi.nlm.nih.gov/40041856/)
65. [Visual Attention Variants Raschka](https://magazine.sebastianraschka.com/p/visual-attention-variants)
66. [ACL Anthology W19-4828](https://aclanthology.org/W19-4828/)
67. [A Primer in BERTology](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00349/96482/A-Primer-in-BERTology-What-We-Know-About-How-BERT)
68. [What Does BERT Look at? Analysis](https://www.researchgate.net/publication/335778955_What_Does_BERT_Look_at_An_Analysis_of_BERT's_Attention)
69. [What Does BERT Look At? Analysis Alternate Link](https://www.researchgate.net/publication/333717618_What_Does_BERT_Look_At_An_Analysis_of_BERT's_Attention)
70. [What Does BERT Look At? arXiv Abstract](https://arxiv.org/abs/1906.04341)
71. [Current Time San Francisco Search](https://www.google.com/search?q=time+in+San+Francisco,+CA,+US)
72. [Current Time Riverside Search](https://www.google.com/search?q=time+in+Riverside-San+Bernardino-Ontario+Metropolitan+Area,+US)
73. [Infini-attention YouTube Explanation](https://www.youtube.com/watch?v=pn_TcECWm1E)
74. [Infini-attention Louis Bouchard](https://www.louisbouchard.ai/infini-attention/)
75. [Insights on Infini-attention Medium](https://onlyoneaman.medium.com/insights-from-paper-by-google-on-infinite-context-length-e13acca1cf2c)
76. [Google Infini-attention Search Engine Journal](https://www.searchenginejournal.com/google-infini-attention/514869/)
77. [Leave No Context Behind Infini-attention arXiv](https://arxiv.org/html/2404.07143v1)
78. [Ring Attention OpenReview](https://openreview.net/forum?id=WsRHpHH4s0)
79. [Ring Attention arXiv HTML](https://arxiv.org/html/2310.01889v1)
80. [Ring Attention ICLR Paper](https://proceedings.iclr.cc/paper_files/paper/2024/file/1119587863e78451f080da2a768c4935-Paper-Conference.pdf)
81. [Ring Attention arXiv](https://arxiv.org/abs/2310.01889)
82. [Ring Attention YouTube](https://www.youtube.com/watch?v=jTJcP8iyoOM)
83. [Best Prompt Engineering Techniques](https://blog.prateekanand.com/best-prompt-engineering-techniques-the-practical-guide-to-llm-strategies-and-ai-thinking)
84. [Complete Prompt Engineering Guide 2025](https://aloaguilar20.medium.com/the-complete-prompt-engineering-guide-for-2025-mastering-cutting-edge-techniques-dfe0591b1d31)
85. [Prompt Engineering Guide 2025](https://garrettlanders.com/prompt-engineering-guide-2025/)
86. [K2View Prompt Engineering Techniques](https://www.k2view.com/blog/prompt-engineering-techniques/)
87. [Dev.to Guide to Prompt Engineering 2025](https://dev.to/fonyuygita/the-complete-guide-to-prompt-engineering-in-2025-master-the-art-of-ai-communication-4n30)

**Sources:**
1. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGKMOmu11PnUaOQB1sc6a_93PhvgHzNwseJRNk0QhWe8CieCGyLtL78RuoY5pHI_aaLseNYq6M8OlMyloWFHNG2qFttgz-2iQZBFhbhd_uLuoB2l9f508S0HSHukM7vEj3Ec7vnbUmrq9ZrsjqlKdUzeKuyOvnmAGzKnv5xyP0_FlCvA0Xkx2j61Be4cF2ZvMtAHoMbn6xC1AzGbpSfNbh7-bZpqqtmMCSoy__zrEjX9AFnLcPXrJTpV1bG0uvY5__Z)
2. [wikipedia.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQERNjtK2avYBECW0DhTnJtEK61oo3HwgqoHwOkNd57iC36pybG-PIKWlB7zBfffqW-Nn6A0dv2CqPEujICTW96Z-1WtzMRcxAOHNk1H76m-0OYzgeghv-mmN0ad2vg2oUO9ObcFNK07sVYGnuAY)
3. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQF8cOjv1JVL6IxweFNpOyRsNeenVSReGxcbyWoQepBmb_UIlFBthPy3f_hBVmYk8Z0ttIU7OtaMdQzY7fb3BRc7geAH50WpwA-o9Ez3Vtnnzugx4OUuSWu135lX9pMhFUKTpiNZFUStXA67mifGJcadyWKiRQM7Lb5tJZOD0vb_4u1_GQ0ajuf6o6WKqtiVxuTLwyIXztQ=)
4. [stanford.edu](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGF3Ks5VRrqeVg-Ctpvx-eWWIqrtj1p_Ozu4LFkijWT0QWr8C_ztgJNjvK9swU1_-B3D5KyqXHYl9S5ufudM2Gkf4BSw-e-r3Lt04p-g3tC3v2pm07YAhlPOhiI7K_MM55NDNoVGXqurSmViv2OmjB1rGonv7zrkRmP-4gP7P4WrEYxulSEuJBB9ObTCQ==)
5. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQERolUEdZTq7Qfd4Thv2PGDkuxjFofKk1a4-aiTN3JEZeri5TXvPjHLi6UoHBNXR_IH1AhbTMwF-QAsN29SceVt6j92C0vrVE-pHXUPAgl6WNPJrRyZMPezS1jn5oUCdYMtw3hScj6GvhcCgb5OztQtYivTC9vb-5Fe3FAFDEu1FgOBBNPfxdvXx0myHTW8nr6qfLPJTkqForIV3vt_HBde)
6. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGeiTkEv791_ldBU4rXxYKXaup6EO-PDF4VCBIa9lGqvxc6lj3doKRywQfT0WIDkVIO11mt9cliWboMukgbtyLzJLNORiTQf2-7uEYTF4DvQMl0C-AnUhezAEoD054qauz03cZfQXFKdsL9GFSpENlKTqR1bXIC9_Ctof4HkrddCJ3kGN8Ow6rwZKQiVBqlqfrWpzGo2KfFj8NvpfU5gvtPY3q3TD0yTHKnucU7WSU=)
7. [louisbouchard.ai](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHqcdIeDlIiDVV6FzaoV6-k5PpQVJfBZtMJ1ToQkAGT4T960M2oPGaWAeFKSKDkws-RJajLdM2gfThl4tQ7rysOABe4CSs1aHDX27CWp0AdmonZi9sFy6A2v85ZCDTPYNpcyNoa)
8. [sebastianraschka.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEmsAZUBsdKfmGF3LKMUslcLD-VXARea-og0x8-cFpqpcg62QFHtV8sc1exRRCDtDX3Eq-tIJH6SqArIzvvNRkEU6viCEi7grRM9Nuw0BFV5GcBLON6vyh1HFeLijH-1dby7xfD0W9a51rz2ici3pHmNiIU31S9rg==)
9. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHVPwqJK9nkNkRtyaIe9dbf7G7GN426PSIRuvJgrXFxqGHC0K8CoRYkwg2DlzdIE8ji0gRdd8Ivo7hH1H-LdhbmDAu4KP-Hnmj8OXfp2ODkE_NwP1vVIGOyTqUaxXgiBrxqG1DdYHn2ColvXJPCSx2Eh1FvOkrAATqN8El65vOiyQJ43sbWoGXLOKyHRE6T5qEpWBk8KUzUINSdKRrWiJYCjz4WWRzalQ==)
10. [youtube.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQF-D8gFN5_pb2c56jV9nl6kjQwaxYkMEScaqXGF0Ur6DV_VJWQcohVkPf_iA_M_lAo4fvk05SdlniVNn2Ti_AovQ4e6bu9V4fDMSpgMZd8GBbVQUpgvXys6Kt9WrmscPaFM)
11. [youtube.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEcAkD6oiLdQqKdq1uOBO27LhYNlPupQlECz0RAr4JFwutXj1pIha3sWzC2SVK4EHKIzbYboPlfSjKwPdMRKa70l8h5Ha56yNbUCA-kEafVyPrcVgxEx3_TDtkGc8Qhu72y)
12. [neurips.cc](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHOL7Pkb4dtG-SsB8VoKSVbjVIYsG2DkSM7BAKli4RyHsaGGQNqObIYVBFzR3W9LHHzgX3S1an-Tql7MPnyJy2gUkNQ1eJS-I5yt_UVNlj5mN6MCqu_nQxV7fiM7eyCXfTN2PkgrfwD3AW2j9pPgnHAJMv57qzTvou3i-8UUQ==)
13. [researchgate.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGGa_UyMbzFV1S8a3E4FghUnLYSvOuU0gXVwOQMxZR6vMgNX7YWGgikPQeO5MtP5rEX0A5vGJURP2s_xTPNpSx6kz-Cy25fPf0OlAot5ynkfoXmdD8c_kMv2bN2Dlew2TqlAU6WzrE7LrMjwETIcepBAnme22zDytXwXz1f_iwZ4lKTY17Z8NmVZpjRq3hYjYJk9vHJnEdKV6KcUm9Ulya5EEDXzpJxqr5uPxCYEA==)
14. [ubc.ca](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFLp29AqCE4xxGV2IWCVQcB28PuEz_AgW4JNvzjetoefVVFmPtnJ4J8mzX7QqFnbMOOfaMJ68noSbzoH6K96uXzw8TtAHKjTzlNHxF40Cp1Wzn62_7tvUpBasgJpdOl8tF3M86hjb-t-8gsXue20jfbPFAS)
15. [waylandz.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFPoL7DBa0hw7tLkFAdZ9zr7gEBCC4c32XJTv54TxzeSCl7HjPXt76yNJox5wYqXpkfbCvHPo_j40ojYmxQ9KxkSP0w9O7OVvvZ3NW6KH3_lq-qqPdzXvMDIM_eljqSpEXjh2c0s8ctCQ-hHVxNZt9YG77S0fFK0INhPghP_R9frqH8E9pp7A==)
16. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHOiJCjb49xtIVawW0UOb03S3vT8qR8R35VHRG8k4bHjvM_Ju4jIL4WvCjzh30iY3nnCxLGsnSRKSGM1wfN9AnFttF0R5blWNgF81qbkid05iXBAkKJsZQ24Ga3eg1rOIxvE_Z0yKgiIdY-8HGJRNGdj02DuwUR5UMm5ffVNArafF1UHRUMy8TAB8pAvYJc5zlzUmAZKjwVFLZNiY_m2JuEOA-o3zLB4i0wbVj851vaaa7pYEQ57hd_1J_X-ca9ia6CNMe2-9-nx59v)
17. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGJKzYEVhDnGQFV_p7XkfijQVBjeQ0eeYAX7Ti8o7R0FNCtcAcZon8YE6vpWgEYire40quCEKf6C5enkiw9VbV0Z81ONRLfSVc4NyIZ6L1KQJ0BeZQgPFXQ1tVt7g1XbBtgVJlNQEBLY6m-gEGQL6s_x7FsK4qleBJj8gVP7fej6tnOGAzEgklStZO_eXbcnycQz4YXFPsJrGxRbukUllBnd5AiL7YDhlQqVz83Ab4=)
18. [stanford.edu](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQED5UjbkER2K-v0KBjsIyBYeMMvfVj7q8vCrbcSweDQHpTut1S1T1Gr3PHik9CpCIrqP3ZgXcn0_8Yi5ZDDc9luzFXjOK8EthDbshEZOIX4VYuFJN-lMuvN_uwbn4D463NdpFn-42lI4FU=)
19. [emergentmind.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQENqfS2cso5aLH74EQuP7hQ6oXm5PNwaKwoET9rezwQJEF3yWdePUhUwlf-KEEOPdywoYMIvuDsr90jtPjJaC0Dth_dX1ufX1wys9cxtXIn_Xx-IeZn6GziajIBFM2ybU6uuOnFkBxJaVuaVjATTGs5xnaAjUx0kQ==)
20. [aclanthology.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFwArXEOr8ZoZp9u6WBDo_WSNUTJKRr0EsihCsjhnI0biJQvQkJ9M01P5zC-f3KmUQHJZq22lwfzL87MJknUIPfm4H6uGs2qQ19ZkCYJm5xmzMgR0vQ8mAh)
21. [bibbase.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH2AoowvQnfvwh51mHrnWkU5BFGbnhagsTY-XfYiKa8VFRutp6GG_TpOr76ADdwBY2WXzJi8wnWFFMp4-ix_qG4LYIU59AiPs9CmfoPClW9FaSbmeWlIabgzZ-6noch3eyXYXGthvrRADDt1JVcKGgfK5oxN9aYX4GpzpwXf7p4ohkAeMh4ddSTrHPT97NLj5sv5uHEvCC2xoPRMkhCMqSjsR0xMrjxorPNbIfhow==)
22. [researchgate.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE7ekDtwFJ_r7bxXWmYsR5bc54h4RwTeVT_VetSrjE7k-WzQCryumd7Y2ZdqbbJB2k55twn0btO_J7CEPrb4CsCDwPnskJgSbk_ARhboSRJjMirPLpFmA5hJbQ9uYDYXFqGI7HWLPip75ZnAXmwy7jfMpdJG1NvJ8qyPSkLpZDNt4GoVAAweBcmXviPwNylawxrAnWIf48Ud3xGk31YXw4=)
23. [mit.edu](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFP-JYlRQd3CbQUpCqo6Uw99J1-nb2OPtX21ueWN6vJpuVUv9Ve-r_MXgsfcFQA7Zhp4WCJQc6lIkkzmH-yCEFOfZnDCgPhtykmvlaVKxjEeHWko7soHX2BdPMtamBa-O3sa548X9qiKjSYxXHBbAlSNJKoiHcN3UZlv7ts-xSvqCqfwOlQE_AKYFlfQNvg2pUbk3IYD7G-0kcrCmUTy6RXpfUct3J1aVV7gw==)
24. [attentioninsight.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGyrGmuM_kx93dXGJfvoSJcpBY7DZfeBXITxgqo0OPkEIDuJ_kh1ATeAgl8a9kagGqh2IKpDtlbAl-It_EU6wdRGPxUVyjjX8cSKJz2qmrbJIY6m4-SoJvqhS_sLx2B3pQTMWmsqq_Mik10Ub58EhrlBpqGEseIcJiKhFZWKMVE7gCY1Us=)
25. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHSiNGKhz3BsJpJZfYw-2qCKNOmw_q6Y_bY459e4sJkS3KYz-PYCQrsoNwfjrtzpDbmQ1UzQ7F_uMqb1uGKlqlK3UmQDuIvwyY7C8WbJUGteyNq3EqcNUGwAMKtOcliZnVZEuwW0svfYxd11YlTNwI5h5y_8RW-lIEtPWu9isFeQp4pJn6xwmoiwrtGbtPbCU3_aElcAP682wVFvUG4tvJjqtXx_-Q=)
26. [nih.gov](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEfF_Zrzp4f6Z40JGaDZ3bZNw8XaiD9KbivtpysZ2KCK_gnsmOVC7YUB3HexeuyLlhBKEW000SMZ_vKf7tjFYiUKNf4Y6PjhUXq0dUrLsnaTMkufByLYHv5nc8qFoD2nQ==)
27. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFX8jtHEOWLQcWpFSulyouzGxpxVRLmj1qVRmYOSKjipekB76KB_pll17edIb-1_YMiN_73mA_9mlfGr9ZevZ4gRH8tXgvdIzJkNjMFnY7j-ITmePFpyA==)
28. [anthropic.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE-vi79n5O5WflrWjIq8DxxL26H0Kdrf2LlK4WafQWGaboukvDKAXb1gUAWMp4T1oypXYyDcZwl4GDiGoSieXDCOzfU4ju-K_8gMOsUUjyxA5IRUTmhVaao5_LdrGiXHUCIXXbylaQfdu-J0Zqt_rbcdbyRHQMjOhidAoXQy23ylDXgmuAdz2g=)
29. [openreview.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEP2h468Q2wcIsPIErCXWgZJAVhEYzOZZftFvcVAjhXML7vE0iZ3Wqf5_m5AjThFSqBTRMIdyTIYV_IMkMX4tNT6gtZ5lQhTTqL3h-lRgFcSTBMuFlI948iI8VR0QutXQ==)
30. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHR9xbA3EVABgTMrFq3-Pofepu075S4L0iumibRbPfpM-zI3v6yVfEF88cN7ybtISdDuWOsx2j_quOLAegZJJHjnQuPeAoNYCPXSF-l7v9yi1NnZvJX5A==)
31. [tridao.me](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEqFjihivBlrSBSC8v5SvwDXMY9WQFIQv-kZf_3x5DfPRyW5n6A7SjAvvA8rQorzGrEF8-yfrNtamU3QU5UyanuMZsVspWuavhChUTRgLYZS-Zu0xoME_erFA==)
32. [towardsai.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGumyFRHWx96mC2Sx6t8CwWhp7Gh866JFiUMGQ9f6DLfeSRYeRi22FaBKK4Hqt_dLbt-oi-dvSRSXM2xUUCOJcHMruhvszcj9_vWo0na5MfF7cQZlfxB24qM2kl8DkIZlB7PgXkqRl9mpAM7TvUEn2pfbg1SQIHrTxMgL1mmtfUM-WuMxIvgfoYrCBXpxRu)
33. [scribd.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHHydtW_fCa0gHUeDwYG68NqGePVPuQK2HSu6IAq4jdLe3O2lo4_e9qO4HAAD8AUFXsyi2o_jbrKOi2a8d_oEaSSZlEcxiEjRisMEsmDpZEattlp3pFI56JRVYzfgIxA5sKDw17ZbBCTRd_xfTcgPzY)
34. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE9N0PW1L2iUJNdeeNaYGF1_K6kcyUc9h_ARvhFIm-Z48H0LQ7PQLgSLId8uaqKce48KavFRaX_xdV8Wsxhu4VNEXymWu-RJk5QKSXmDHp4Zfa-ZtqvSxb8FA==)
35. [typefully.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHEKlIWvEYQ-_9lZfZ3IvkO_-s9RICACoqRqkB0zrmeHG90PY1GlpdH7IeaZHm_HjhRvBe02IjGanLtkkQHpkIbY37xFmXVUajlLeE4iCRvoLT_Quk2TClAlezw0wY73_Hq3G7xYuKFa3OPyI7riGsb69ypga2LLL3oRG_indeu2K99uWiC9UM=)
36. [openreview.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEOs1ZseYCxnN_dUIogXq9x0M9a6N9HUz0hmzIiCYad_vmK7CFcYPpJxUvZCZbtGdjCkaeD8ufkwFUU9HWafihBjETiyUoOace0WNWeK41dXNDqYqwQtnAhPOz9DyGilKg=)
37. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGGNYZwtrTfg_jO9WgXTc0Q_CSvfBqSWeq1bR5Fzj9dlMQIVNS2mzAOz2udebYVPzVGrfVWqV-i0mrG4U-1ONlmjUt6Yo7nNEZr2pLr-GJcHH8pXwCQjMvS8Q==)
38. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQF2i2aiHqX9fWb6xH34VZAVSlygE59MUQGImjtS0-MGuOjpuo_F-YHf-VaMSFS_Pejy0AofVY2Qhi8PxcQBnAnw_QbSOGmT8Zu8RrvSJEAdu_-W4UH_kg==)
39. [iclr.cc](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEsflQ6iy-5SFYMF__6eUr0NgIO4Rr5qQXOI8io4U6OZbmBtoju1qXQTFKCijpMjDxeS-Y61XbYlcecYx44Yq1zPef57Sw7hTomwK6YTnOJKXgnCaj_eBzdv1_RHD2R7Fy9eJ37HINKi9jX2RH-Ly329aY3VGGwWJOzOVAeifQjAstq5s0_TyIz3pTZJevuYeGDMwmwayZVILndw5z90oASl3X7)
40. [youtube.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFntp5h7DMw40NJmGLB0Acw1SFb1PHvjXCzrUCgvU1HJgfG6nPi8w3dX0W-v8MZO3gq2nCW4kOfSex7UnyZosEtoYHNq_kWjJDhMq3MXGdB2oT_YfpJT5_GRRoyu8-yvhnZ)
41. [youtube.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHxqHWbME0dkxygVLwG7QErGsey1xi3O5_lsICnrSyW3lv78c8XaCVlx-lhuSm2VGq_G3Q0kSRLznvtSiCk-QwusgHDfX9xu8z0DIpHsrXv4QO_jVgSLPBmSEbrwlZiPDK7)
42. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGqc_DyZWDyHllzGaJTqcJAeoWZlOIUFygwHtzqzScNgfYGGQsHxyPnSdKHHSjRkNrHzYz09HsBmLELwx32cd7DsDK_sEXMFReyJqoFRx4SJNBWVPh9qXbDG7YiLQkxftJJufanBPfpOLfuQY1iOovyLqo76QQ1IofPkSbt8fgjOFY76bQ7k93wfg_vodxjDSRLOEbiPAcd-SD1)
43. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQG07ZGAUUhvAaUDRndGjnOW5IbkrNWenQGPsFzjhhDBahvWdNFlD__umW2wv350F6yCqsWkSlruqcSI9b1zcpNgK_x5f5vLLN5FAnVda3FWbrRYbwZCZEEnnA==)
44. [searchenginejournal.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGwwwU_hSWBIrYQMBTJfDyoLWoNrcCvIXxKtCIbC4a6rMZd1qNao1w6Ms3zWYAitdxl6b445QHOe7HQbB1uEIgGyk92z_H1okLwqsC8zLRUovLq2XaTscmORXy3CQUm1XxmxAibtowztSP5qHcMBSAZbzfC7I4guA5y)
45. [ai-scholar.tech](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQF7dcaD-orfwhbx7RSx8I1I_Mz2GnS7VCvmSPYP-623H414dYNTflKm9eyWjKonvjhoIYN7q2FRZTa85JzZOpWkDSOWSUlozfXgawYOG0R4sgfzzrHHwbYxv4vaZp8-CsaNs3L8mgcQ7AW0UHuPweXZ1eoYvUd-4soOaTy_)
46. [clova.ai](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFK0-NN2E2PY0DOn7h-WPwMjAa-yfulsBWhFoI2i6SqoO9-nlT8BhKj6g_4wjLI6YTfiw_-2Tj1nuGrOEoEVzJ-mc0OiKoEOi6kqAKDNU1yvDGBHaVDZhin2lBBTjeIY4A4g0cXuT8mr3-9WmN7i_hkvT1PALO5gonZ_Xzxmxv_-wSlQTCF0MYrC8YaZbc52Fwu_HKccWNbW-sxToOpIwrPd0fjFuqm4QTiRWtEow==)
47. [w.media](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQG-qcxa9hlsMJeuLsbHgRqbiZVE0bGPFWv7uhH24chCKIj7hrt8ka3A6j94FPXBJOJiQ9mkfPInjZYCsyuSy9FJOZojC1_Mk3GLMOWSHDVXDnKC8fakGgXej7Q4h-8OsosEEGCeN4O0Pg9sF9UbBnU3P4ER6pcty5Z3ozYPe-g=)
48. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE64dwTdY4haUyzDNsOdg3p7r_VC_hK-QsaTBcWxrhN-brm2R3GiaS4q8HoGgJ5Rs0fcxnDYpkfEaifffSF_H8TrooyMoVNPJv8QUZZuvWJ3q_Pha-eX37I0Q==)
49. [clova.ai](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFxTPZjajqvRG_S6XS_4BxGXvP_lVkZafCZYD1rauPJ6CmNT-8Lwn2WSWiyx2nsoYl4ziwjENuiLY4Pb-9l0BG1uF3e6qV977vHvHjVxFyf4WXEYOOBEAM1x0f-cWE849k6Q44hhgAilCqwazBjyBg9KkFZtPggCzibW3QiP3DJRLTwjaTcI9_4Kk4bqrUfjnHThVFT00uF)
50. [prateekanand.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH5Ka91euVF9jYrFXI0H5OYNUvWYqpZoLrbnezGDjpl7WIdd0jROmVTaNCVZKJyUiqeXyDGpLlewHZjwhUsg-xmgMd878qEL_pGcm-wjCQhlGniP4j82uXbVvKOaQc_iZAlGLwdDWDE_1y36Bdp-5KBHl13TwlgH2r0_yMPOf18GURi6eITVNcyz1PHjpfXbczphphWMt5GcOhuD6mIE_ndxHmuAnpJPFv4DOlO)
51. [garrettlanders.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHHpf8FRtj-ifcL97myAkMSRthEFOm_asX5XX_bMFsaVr5oQ8yR1efesBpOo6cYSZHXkLaDXiyl0Fb_D01yBmPAc2Wt46AMsJ9uTGo06NF9VPza5F-nsu3Yz0uiWjkKv5Gojqj545kyw3Q3Yy9U3F8=)
52. [aiwithgrant.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEY724uKHxn_3Mp8LGWtBoIEJOwP8Oqhnqj2Ng_-u-T_RcOy6oe_RA5FTn88ng0XjEL1KoafA9FilC1f72BrKERHa9UMQjedOvcUAnDZ9rjOswlwjT_NJCKjxN22pmyN2A7SdxJ0y_iZZZSlVa2EIp4cyQHvhrnHCV5AYdcYpg=)
53. [promptlayer.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFcBCubF0FpDuszOImj7huMfsGYDyNdDdLUBg2VRPEDoXjIUI59vi98b5aMW0J3cY-vO2IvCwf0V2lOzCW-_pBqmDwZcfgLpaqmQ7cHwTOo2XSA1guVxAdvz4PAW0udxmtgeD7v2Kk__zu6px1oE2L9OQx8qniI4ozpJFuh8A==)
54. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH82E4C7hUIcq6SjQw4Wc3KSNSTVzmPwbTNJ6BeToGF7fPwDzcuKEx3A-u2HHWM5q9ioeejERFOlVc4O0bzBbtZ73ZiaH9CVY-kn89W5ogd8WAhZG1IrDo8IQ8BdMrxteJL41L99bx5c8E0akhwAISpg0vVWYSVeJgYfaDd7VZNFOJNlwCnC_sP9p4=)
55. [k2view.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEuMiwoYPsmyNn66CU-jyaUJ_HzpyFOdHWycO6z47mMgQKZUr6nItJbr-_c-GbLIDbdQaFD-WL5d6dy6UKsqRLDjeCXTAnyfzwAOaN2o4KAUgVywXZ0gmmh0X6VErNyAy7n1mgGBM8fW49CiGS6XanF)
56. [reddit.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFZ7Ox8cCqql_vJL06nCdJaYyU0zxrfk-lIP-9AEA1GJNfWhMwr5PkLwz0D-NXDTPVrhgdQzLlM3EQP54SPat95IyxcfxGWOU79eNpIwbse5CiyMZiQQ-ediqIYqJNzgazoVXS2HJi8j3LX0PqA8q1z4wK6M2CcwSgdpnUWaujhfLNQHU4qhGQVMpTXQGcUI7O6A7SR7jtkm5qqE_F1lYjA)
57. [secondbest.ca](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQES3lQiR8i_sXwk0ym7xKySrXdMTRMXxZXI_eIH2jVp92gzw_XOcgWC_iv6lHU26SYO2elusKLsK6aZdakMppR9irmKGtIFkbwmvI1Ramc7bThdfosmlFVwyJWyFlwAuDIywsoc2cLnj92D3C8eEMZZhfdjlscL6drJ)
58. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHD-_n2CcOdRBPuME_BV_-L2ZKElRCdGeJvhP0nd5Y3vuQ22Gf30XHb2Cy5wlu19cibFfAaa_40tvfL266RRpfkuT3WtJkDG2jJ125FcyvNogx6Xify2-VFsg==)
59. [youtube.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE_ctdo9KnppEaWsMiHCArb3jqccDpR3htKOpfvpY7QpQtF8znCro8k9I4PBYgvjo9zRHfhJjFJTjSAZ1mgdLUQ_AAHlityHonZ1HmEo6szYtO3pRuN2DgWvzJrETKWChTx)
60. [ntt-research.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFtMy9o7bNph0gRjofLGzwPLqTEloDgO7bx_rSh78I-FOYlK_Hl9lXIkannapQyv0y-86SiMpY3EHfZ1nSR4sswTodC0RlE43jhZ9ApjC3iFg8m5fkB4yApThoG0VdN-QO5hrPTUFtO8-RIVSPb0aao1hccnEPIBuih5fwyWtXGPw1ZrcB294NBJOQRSJuyTx0eRjIPrA5I51xjMqGZ)
61. [ntt-research.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHPsuGMcJaM8Ogn8O1ONI515i5z0vy5vSQeBcMG-HW4xMFbfmiUQCX6LuGM1WiODjQWvCPFLPk7x6DWVH3IMiAGdRBl14jsgXXfJ-GPjty3wvBfRoiPlU3bWbOHhhH_lVLeio46Z-1az5FwDWNTw0Efcwxs8OCc8TVmYikHAAWJIF4BUXMmbUtJ1a7hLwlfq_n0)
