# Financial and General Large Language Models for Alpha Generation

## Introduction to Semantic Alpha and Market Prediction

The integration of natural language processing into quantitative finance has fundamentally altered the landscape of algorithmic trading. Historically, financial sentiment analysis relied upon rigid, lexicon-based approaches, such as the Loughran-McDonald dictionary or VADER, to parse corporate disclosures and financial news [cite: 1, 2]. These methodologies mapped words to predefined sentiment scores, offering high interpretability but failing to capture context-dependent modifiers, complex financial terminology, and subtle shifts in corporate tone. The advent of transformer-based architectures has introduced a new paradigm of semantic arbitrage, enabling market participants to extract highly nuanced, context-aware signals from unstructured data streams, including earnings call transcripts, regulatory filings, and macroeconomic news.

A central debate within contemporary quantitative research centers on the comparative efficacy of domain-specific financial language models against state-of-the-art general-purpose large language models. Domain-specific models, such as FinBERT and BloombergGPT, are explicitly trained on financial corpora to master the unique lexicon of capital markets. Conversely, general-purpose models, including OpenAI's GPT-4, Meta's Llama 3, and Anthropic's Claude 3, leverage massive parameter scales and vast internet-scale training data to achieve emergent analytical capabilities. Evaluating these models for alpha generation requires an exhaustive analysis of their zero-shot accuracy on financial benchmarks, their translation into actionable trading signals, their resilience to transaction costs, their latency profiles in real-time execution, and their susceptibility to data contamination.

As institutional investors increasingly rely on these computational tools, understanding the structural advantages and limitations of each architectural approach is paramount. The objective is not merely to classify text accurately, but to synthesize unstructured narratives into profitable, risk-adjusted trading strategies while navigating the complex market microstructures that dictate trade execution.

## Architectural Paradigms and Computational Scale

The performance differential between financial and general-purpose models stems directly from their architectural scale, pre-training methodologies, and the composition of their underlying datasets. The landscape is broadly categorized into compact domain-adapted encoder models, massive domain-specific decoder models, and general-purpose frontier models.

### Compact Domain-Specific Encoders

Domain-specific models are engineered to address the distinct linguistic characteristics of financial texts, which frequently feature highly specialized terminology, forward-looking statements, and numerical contextualization. FinBERT represents the canonical encoder-only architecture in this space. Built upon the foundational BERT framework, FinBERT models typically operate with approximately 110 million parameters [cite: 3, 4]. Rather than being trained from scratch, the model is pre-trained and fine-tuned on specialized financial corpora, such as Reuters news, corporate reports, and the Financial PhraseBank [cite: 3, 4]. Due to its compact size, FinBERT offers highly efficient inference and serves as a robust baseline for basic sentiment classification, capturing financial semantics far more effectively than traditional dictionaries [cite: 3, 5].

### Large-Scale Domain-Specific Decoders

BloombergGPT represents a monumental scaling of the domain-specific paradigm. Announced in March 2023, it is a 50.6-billion-parameter decoder-only model trained on a hybrid dataset totaling 709 billion tokens [cite: 6, 7]. The model architecture features 70 transformer layers with 40 attention heads and a hidden dimension of 7,680 [cite: 7]. To achieve training efficiency, the development team utilized ZeRO-3 optimization, combining SMP and MiCS sharding with activation checkpointing, and employed mixed-precision training with BF16 computation over 1.3 million GPU hours on 512 NVIDIA A100 GPUs [cite: 7].

The training corpus, named FinPile, constitutes 51.27% of the total data (363 billion tokens) and comprises four decades of proprietary Bloomberg archives, including news, SEC filings, press releases, and financial web-scraped documents [cite: 6, 7]. The remaining 48.73% (345 billion tokens) was drawn from public datasets such as The Pile, C4, and Wikipedia [cite: 8]. This mixed-dataset approach was explicitly designed to retain general natural language capabilities while optimizing the model for financial tasks, yielding superior performance over similarly-sized open models like GPT-NeoX and OPT-66B on named entity recognition and financial news classification [cite: 6, 9, 10].

Other notable open-source financial frameworks include FinMA (part of the PIXIU project), which is instruction-tuned on broad financial datasets to support multi-task financial context awareness, and FinGPT, a framework utilizing Low-Rank Adaptation (LoRA) to fine-tune existing models on financial data for rapid, cost-effective updates [cite: 11].

### General-Purpose Frontier Models

In contrast, general-purpose models achieve financial competency as a byproduct of their immense scale and diverse training data. Models such as Llama 3.1 and GPT-4 operate on vastly different scales compared to legacy encoders. Llama 3.1, for instance, encompasses models ranging from 8 billion to 405 billion parameters, trained on over 15 trillion tokens [cite: 12, 13]. GPT-4, while its exact parameter count remains proprietary, is estimated to operate at the trillion-parameter scale [cite: 7]. 

These models rely on extensive multi-domain exposure rather than exclusive financial pre-training. Their massive context windows—up to 128,000 tokens for GPT-4o and 131,072 for Llama 3.1 8B Instruct—allow them to process entire 10-K filings, extensive earnings call transcripts, or multiple research reports in a single prompt [cite: 14, 15]. While BloombergGPT outperforms comparable mid-sized general models on specific financial extractions, frontier models like GPT-4 consistently surpass domain-specific models in complex problem-solving, largely due to their parameter scale and advanced instruction-tuning [cite: 7].

| Model Specification | BloombergGPT | FinBERT | Llama 3.1 8B Instruct | GPT-4o |
| :--- | :--- | :--- | :--- | :--- |
| **Parameter Count** | 50.6 Billion | ~110 Million | 8 Billion | Proprietary (Est. >1T) |
| **Training Tokens** | 709 Billion | Task-Specific Finetuning | 15 Trillion | Proprietary |
| **Context Window** | 2,048 tokens | 512 tokens | 131,072 tokens | 128,000 tokens |
| **Financial Data Mix** | 51.27% Proprietary FinPile | 100% Financial Text | Unknown % of 15T | Unknown % |
| **Access Model** | Proprietary | Open Source | Community License | Proprietary API |

## Performance on Financial Sentiment Benchmarks

To systematically quantify the capabilities of these models, the academic and quantitative finance communities utilize standardized natural language benchmarks, most notably the Financial PhraseBank (FPB) and the Financial Question Answering Sentiment Analysis (FiQA-SA) datasets. These datasets provide a rigorous environment for assessing a model's ability to interpret market sentiment, separating positive, negative, and neutral financial phrasing.

### Zero-Shot and Few-Shot Evaluation Accuracy

Empirical evaluations demonstrate that advanced general-purpose models match or exceed the performance of domain-specific models on basic sentiment classification. In target-based financial sentiment analysis tasks, generative models such as ChatGPT-4, ChatGPT-o1, and DeepSeek-R1 have been shown to outperform discriminative transformer models like FinBERT and DistilFinRoBERTa across precision, recall, and F1-score metrics without requiring task-specific fine-tuning [cite: 2, 16, 17].

A comparative evaluation of FinMA, BloombergGPT, and GPT-4 reveals highly competitive zero-shot and few-shot capabilities. FinMA-7B achieves F1-scores of 87.0% on FPB and 79.0% on FiQA-SA in zero-shot settings, improving to 93.4% and 82.6% respectively in 5-shot configurations [cite: 18]. However, GPT-4 and Llama-3 variants frequently approach or exceed these metrics. In specific sector-based news sentiment analyses, prompt-engineered implementations of GPT-4o outperformed FinBERT by up to 10% in accuracy depending on the sector [cite: 19]. Even smaller open-weight models, such as Llama-3-8B, have demonstrated stable accuracy nearing 88.9% on PhraseBank and 81.7% on FiQA-SA when utilizing parameter-efficient fine-tuning techniques like QLoRA [cite: 20].

### The Overthinking Paradox in Sentiment Classification

While general models exhibit superior complex problem-solving capabilities, explicit cognitive mechanisms can occasionally degrade performance on straightforward financial sentiment tasks. A comprehensive study comparing GPT-4o, GPT-4.1, o3-mini, and FinBERT on the Financial PhraseBank dataset tested the efficacy of Chain-of-Thought (CoT) prompting (simulating deliberate, step-by-step logic) versus direct classification (simulating fast, intuitive deduction) [cite: 4].

The analysis revealed a counterintuitive pattern: prompting large models to engage in explicit step-by-step logic reduced their alignment with human-annotated sentiment labels, particularly in low-ambiguity cases [cite: 4]. The highest agreement with human annotations was achieved by GPT-4o utilizing a "No-CoT" strategy, effectively mirroring fast, intuitive decision-making [cite: 4]. Models inherently optimized for internal logical chains, such as o3-mini, yielded the lowest performance on these direct classification tasks. This suggests that excessive cognitive overhead introduces misalignment in binary or ternary classification environments—a phenomenon researchers categorize as "overthinking" [cite: 4]. FinBERT, trained exclusively for direct classification, remains highly competitive in these environments by bypassing unnecessary generative logic sequences [cite: 4, 21].

| Task & Dataset | FinMA-7B (Zero-Shot) | FinMA-7B (5-Shot) | GPT-4o / Llama 3 (Prompted) | FinBERT (Fine-Tuned Baseline) |
| :--- | :--- | :--- | :--- | :--- |
| **FiQA-SA (F1-Score)** | 79.0% | 82.6% | ~81.7% (Llama-3-8B) | ~75.0% |
| **FPB (F1-Score)** | 87.0% | 93.4% | ~88.9% (Llama-3-8B) | ~86.0% |
| **News Headline Classification** | 97.0% | 93.5% | Dominant across tests | Moderately lower accuracy |

## Signal Generation and Simulated Portfolio Performance

The ultimate utility of financial language models is not measured by cross-entropy loss or static F1-scores, but by their capacity to generate economically meaningful trading signals (alpha) when deployed in simulated or live markets. These models transform unstructured text—such as news headlines, earnings calls, or macroeconomic reports—into directional market vectors.

### Directional Forecasting from Financial News

A common quantitative methodology involves instructing a model to evaluate whether a news item is positive, negative, or neutral for a specific company's stock price, focusing the analysis on the immediate short-term horizon. Academic research investigating GPT-4's predictive capacity on U.S. equities demonstrated striking results using post-knowledge-cutoff headlines. By parsing these headlines, GPT-4 achieved a portfolio-day hit rate of approximately 90% for capturing the initial market reaction [cite: 22, 23]. Furthermore, the sentiment scores significantly predicted subsequent price drift over the following trading days, particularly for small-cap stocks and negative news events [cite: 22, 23]. 

When formalized into systematic long-short portfolios, these sentiment signals yield substantial theoretical returns. A study utilizing the GPT-3-based OPT model alongside FinBERT to analyze 965,375 U.S. financial news articles over a 13-year period found that the OPT model predicted stock market returns with 74.4% accuracy [cite: 1]. A daily-rebalanced, zero-cost long-short strategy based on these signals yielded a cumulative return of 355% over a two-year out-of-sample period (August 2021 to July 2023), achieving an exceptional Sharpe ratio of 3.05 [cite: 1, 24]. FinBERT also demonstrated strong predictive power, achieving a Sharpe ratio of 2.07 in the same framework, which significantly outperformed traditional lexicon dictionaries that posted a Sharpe ratio of only 1.23 [cite: 24].

[image delta #1, 0 bytes]





### Hybrid Integration and Multi-Stage Filtering

Despite generating high absolute returns in isolated backtests, standalone sentiment strategies possess inherent vulnerabilities. Sentiment evaluation alone is often noisy, as models may fail to differentiate between genuinely market-moving news and routine corporate announcements [cite: 25]. To mitigate this, advanced quantitative frameworks deploy hybrid architectures that merge high-throughput classification with deep contextual filtering.

One such multi-stage AI framework integrated FinBERT for high-throughput initial filtering and Google Gemini for deep contextual evaluation to surface high-conviction signals from over 9 million SEC filings and financial news items [cite: 26]. These signals were subsequently executed within a dollar-neutral long/short framework, augmented by macroeconomic regime filters and technical trend confirmations. Over a 16-year testing period, this hybrid model generated a mean excess return of 51.02% per annum net of transaction costs, achieving a Sharpe ratio of 1.06 and a Sortino ratio of 2.61 [cite: 26]. The significant divergence between the Sharpe and Sortino ratios highlights the strategy's highly asymmetric risk profile, effectively capturing upside volatility while strictly limiting downside risk [cite: 26].

Other studies confirm that combining real-time sentiment analysis from GPT-2 and FinBERT with classic technical indicators (such as dual MACD configurations and time-series models like ARIMA) creates a synergistic effect. The sentiment layer captures the fundamental "why" behind market moves, while the technical layer filters out sluggishness and provides the specific trade timing [cite: 27].

## Market Regime Dependency and Drawdown Risks

Empirical evaluations spanning broader cross-sections—analyzing decades of data across hundreds of equity symbols—reveal that autonomous trading strategies exhibit severe market regime dependency. Research utilizing the FINSABER backtesting framework found that timing-based strategies are overly conservative during bull markets, frequently underperforming passive long-only benchmarks [cite: 28, 29]. Conversely, these same strategies become excessively aggressive during bear markets, failing to implement adequate risk controls and incurring heavy losses [cite: 28, 29].

To achieve consistent alpha generation, sentiment signals must be strictly conditioned on market regimes rather than applied unconditionally. Strategies that employ volatility indices (such as the VIX) as threshold indicators can dynamically switch between sentiment-prone and sentiment-immune portfolios depending on prevailing market conditions [cite: 24]. By exploiting behavioral inefficiencies—specifically delayed arbitrage—through systematic regime detection, researchers successfully extracted 20% to 40% annualized returns in environments where unconditional sentiment strategies generated net losses [cite: 24]. 

Furthermore, cross-market tests indicate that regime dependency breaks down when the underlying market microstructure differs. Replications of VIX-conditioned sentiment strategies in Chinese equity markets failed to produce alpha due to differences in valuation variance and strict short-selling constraints, confirming that semantic signals are fundamentally tethered to the specific structural mechanics of the target market [cite: 24].

## Transaction Costs and Execution Friction

A critical flaw in much of the academic literature surrounding simulated alpha generation is the gross underestimation of trading frictions. Discovering a semantic signal that correlates with asset returns is entirely insufficient if that correlation cannot overcome the economic realities of brokerage commissions, execution slippage, and market impact [cite: 30, 31].

### The Illusion of Zero-Cost Execution

Backtests assuming zero transaction costs vastly overstate achievable returns, manufacturing theoretical profits out of minor statistical edges [cite: 32]. Realistic quantitative modeling requires the rigorous integration of execution costs. While flat transaction cost models accurately reflect basic brokerage fees, they completely fail to capture slippage (the difference in price between decision time and execution time) and liquidity constraints [cite: 31]. Large block trades require quadratic transaction cost models to accurately represent non-linear market impact, as trading substantial volume inevitably moves the underlying asset price unfavorably [cite: 31].

When researchers impose strict transaction cost hurdles on sentiment strategies—such as a standard 10 basis point (bps) per-trade assumption (20 bps round-trip) combined with daily rebalancing—the economic viability of many semantic signals contracts sharply [cite: 24, 33]. However, highly performant models still demonstrate validity under stress. In the aforementioned Lopez-Lira and Tang study, increasing transaction costs from 5 bps to 10 bps, and eventually up to 25 bps per transaction, systematically eroded total profits [cite: 33]. Yet, even under these conservative slippage assumptions, the GPT-4 strategy maintained positive cumulative returns [cite: 33].

### The Correlation Profitability Threshold

To avoid expensive and misleading backtests, quantitative analysts rely on a correlation-based profitability threshold. This framework assesses the intercept from the signal regression, the correlation coefficient between the semantic signal and asset returns, return volatility, and signal volatility [cite: 30]. By calculating the explicit correlation threshold required for profitability, analysts can definitively reject weak semantic signals early in the research pipeline. If the empirical correlation falls below the calculated threshold, the signal is mathematically guaranteed to fail against trading costs, regardless of subsequent portfolio optimization [cite: 30]. 

Furthermore, empirical observations demonstrate that semantic alpha exhibits rapid decay. Strategy returns consistently decline as the adoption of advanced models rises across the institutional industry, driving faster information incorporation into asset prices and establishing new, higher baselines for market efficiency [cite: 22, 23].

## Inference Latency and Computational Infrastructure

For institutional deployment in live markets, the absolute analytical accuracy of a model must be continuously balanced against the physical constraints of infrastructure costs and inference latency. The deployment of dedicated computational infrastructure is bound by a fundamental trilemma: optimizing for throughput, minimizing latency, and controlling hardware capital and operational expenditures [cite: 34].

### Speed and Latency in High-Frequency Environments

In high-frequency and latency-sensitive trading environments, processing delays measured in milliseconds dictate the ability to capture alpha. Massive general-purpose models like GPT-4 face substantial operational bottlenecks compared to smaller, specialized counterparts due to the stateful nature of inference and the limits of memory bandwidth [cite: 34]. 

FinBERT, operating at roughly 110 million parameters, processes textual inputs with minimal computational overhead, making it highly suitable for sweeping thousands of daily headlines in real-time [cite: 3, 21]. Conversely, massive API-bound models generate significant Inter-Token Latency (ITL) and limit overall throughput. Standard benchmarks indicate that the GPT-4 API throughput hovers around 36 tokens per second [cite: 35, 36]. 

However, recent advancements in open-weight models and specialized hardware have disrupted this dichotomy. Meta's Llama 3.1 8B Instruct, when deployed on highly optimized processing units (such as Groq's LPU architecture), achieves extraordinary throughput ranging from 147 to over 300 tokens per second [cite: 35, 36, 37].

[image delta #2, 0 bytes]

 For real-time applications such as live document parsing or immediate headline reaction trading, the 9x speed advantage of an 8-billion parameter open-weight model over a massive proprietary model represents a definitive structural edge in trade execution [cite: 36].



### The Economics of API versus Local Deployment

The financial economics of inference heavily favor open-weight and smaller-scale models for high-volume tasks. Analyzing millions of financial documents incurs prohibitive costs when utilizing proprietary APIs. 

Processing data via the GPT-4o API costs approximately $2.50 per 1 million input tokens and $10.00 per 1 million output tokens [cite: 15]. By contrast, locally hosted instances of Llama 3.1 8B drop these costs to roughly $0.03 per million tokens, rendering GPT-4o over 83x more expensive for input processing and 333x more expensive for generation [cite: 15]. Even the highly efficient GPT-4o-mini, priced at $0.15 (input) and $0.60 (output) per million tokens, remains 5x to 20x more expensive than deploying an 8-billion parameter open model [cite: 13, 38]. While self-hosting requires significant capital expenditure—a robust cluster with H100 GPUs can cost upwards of $70,000 monthly in cloud infrastructure—the crossover point where open-source economics surpass API fees typically occurs at a volume of 50 to 100 million tokens per month [cite: 36].

Furthermore, attempts to bridge the knowledge gap of smaller models via Retrieval-Augmented Generation (RAG) introduce new operational costs. Implementing RAG pipelines on models like GPT-4o-mini can increase absolute accuracy on financial reasoning benchmarks by roughly 10 percentage points, but this augmentation expands total token consumption by 18x and increases total execution time by a factor of 20 [cite: 39]. Consequently, deploying a fine-tuned, lightweight model locally often yields the optimal frontier for cost, speed, and data privacy in systemic quantitative workflows [cite: 36, 40].

## Data Leakage and the Contamination of Benchmarks

Perhaps the most critical threat to evaluating general-purpose language models in finance is the pervasive issue of data contamination, also known as data leakage. Because models like GPT-4, Llama 3, and Claude 3 are trained on vast, opaque snapshots of the public internet (such as Common Crawl snapshots from 2021 through 2023), prominent financial datasets like FPB, FiQA, and extensive historical pricing data frequently bleed into their pre-training corpora [cite: 41, 42, 43]. 

When a model is evaluated on a dataset it has already memorized during training, zero-shot accuracy metrics become artificially inflated. This violates the foundational machine learning principle of out-of-distribution testing, creating a false sense of model capability [cite: 44]. Researchers probing data contamination utilize specialized protocols like "Testset Slot Guessing" (TS-Guessing), wherein models are prompted to fill in missing metadata or blank options within a benchmark. Applying these probes, researchers discovered that GPT-4 could guess missing benchmark data with an exact match rate of 57%, indicating severe memorization of the underlying test sets [cite: 42]. 

In live trading simulations, this contamination manifests as the "Profit Mirage." Autonomous trading agents report extraordinary double or triple-digit annualized returns when backtested on historical data that aligns with their pre-training window [cite: 45]. However, the FinLake-Bench evaluation framework demonstrated that moving these agents just one step beyond their knowledge cutoff triggers a dramatic performance collapse. Re-evaluating agents on fresh market data released after the underlying model's cutoff date resulted in Sharpe ratio decays ranging from 51.48% to 62.23%, with total returns decaying by up to 71.85% for specific agent architectures [cite: 45]. 

These models suffer from high "Prediction Consistency" against counterfactual perturbations, proving they memorize historical outcomes rather than learning robust, causal financial principles [cite: 45]. To combat this, the industry must rely heavily on air-gapped deployment, specialized counterfactual testing simulators (such as the FactFin framework), and uncompromising temporal validation—ensuring backtests strictly utilize data published after the model's exact knowledge cutoff date [cite: 33, 45, 46].

## Global Generalization and Geographic Bias

The performance of sentiment models degrades precipitously when shifted away from U.S. equity markets and English-language corpora, highlighting systemic geographic and linguistic biases embedded within global training sets.

General-purpose frontier models developed by U.S. technology firms predominantly rely on English-centric pre-training data. Consequently, they exhibit profound "foreign bias" in global financial prediction [cite: 47]. Research comparing U.S.-based ChatGPT to China-based DeepSeek revealed that ChatGPT is systematically more optimistic about Chinese equities than local models, yet significantly less accurate in directional forecasts [cite: 47]. This discrepancy is driven by an information-availability mechanism: Western models lack exposure to granular local media coverage and analyst sentiment, forcing them to fall back on broader, generalized heuristics. Crucially, researchers demonstrated that artificially injecting translated local Chinese financial news into the context window eliminates this prediction gap, proving the limitation is data-driven rather than architectural [cite: 47].

The imbalance extends to standard non-English datasets. The Language Ranker metric—which benchmarks internal model representations across languages—confirms a strong correlation between a model's operational accuracy and the volumetric proportion of a language within its pre-training corpus [cite: 48, 49]. High-resource languages (such as English, German, and French) yield superior sentiment extraction, while models struggle significantly with cultural nuances, idioms, and local financial jargon in low-resource linguistic environments [cite: 48]. 

For global applicability, localized domain adaptation is essential. In European and Asian markets, local organizations either fine-tune open-weight models on regional financial documents or develop bespoke systems capable of understanding region-specific regulations and central bank communications [cite: 50]. For example, the development of KPI-BERT allowed for the advanced extraction of key performance indicators from German financial documents, a task general models handled poorly due to linguistic drift [cite: 50]. When assessing non-U.S. markets, models explicitly aligned with local information ecosystems inherently exhibit better calibration and alpha-generation capability than generalized models operating out-of-distribution [cite: 47, 51].

## Conclusion

The pursuit of alpha generation through natural language processing has successfully evolved from elementary lexicon counting to deep semantic inference. General-purpose models, particularly GPT-4 and Llama 3, demonstrate extraordinary capabilities in parsing unstructured financial text, frequently matching or exceeding the baseline accuracy of domain-specific models like FinBERT and BloombergGPT on standardized evaluation benchmarks. Their massive scale grants them deep emergent analytical capabilities, allowing for advanced synthesis of complex market dynamics.

However, the application of general language models to systematic trading is fraught with systemic risks. The profit mirage caused by data contamination and look-ahead bias severely inflates backtested performance, demanding rigorous out-of-sample testing strictly past the model's knowledge cutoff date. Furthermore, raw sentiment accuracy is entirely insufficient for profitability; semantic signals must be integrated with dynamic risk management, conditional market regime filtering, and uncompromising execution cost analysis to overcome the friction of live trading. 

Ultimately, the choice of model is dictated by infrastructure economics and latency requirements. While massive frontier models dominate complex, low-frequency analytical tasks—such as in-depth report summarization and regulatory document parsing—parameter-efficient models like FinBERT and 8-billion-parameter open-weight models offer the superior throughput, sub-second latency, and cost-effectiveness required for high-frequency, real-time alpha generation. As financial technology matures, the most robust quantitative strategies will likely feature hybrid architectures: deploying lightweight, highly efficient models for continuous news stream ingestion, escalated to larger proprietary models exclusively for high-conviction contextual verification.

## Sources
1. [Stanford CRFM BloombergGPT](https://crfm.stanford.edu/ecosystem-graphs/index.html?asset=BloombergGPT)
2. [BloombergGPT arXiv v3](https://arxiv.org/html/2303.17564v3)
3. [BloombergGPT PDF](https://arxiv.org/pdf/2305.05862)
4. [BloombergGPT Abstract](https://arxiv.org/abs/2303.17564)
5. [HuggingFace BloombergGPT](https://huggingface.co/papers/2303.17564)
6. [ACL Anthology Sentiment Trading](https://aclanthology.org/2025.jeptalnrecital-industrielle.2.pdf)
7. [MDPI Hybrid Sentiment Trading](https://www.mdpi.com/2673-2688/7/4/138)
8. [LLMQuant Signal Strength](https://llmquant.substack.com/p/how-strong-must-an-alpha-signal-be)
9. [ResearchGate Backtesting Sentiment](https://www.researchgate.net/publication/393476731_Backtesting_Sentiment_Signals_for_Trading_Evaluating_the_Viability_of_Alpha_Generation_from_Sentiment_Analysis)
10. [Permutable AI Alpha Generation](https://permutable.ai/llm-driven-alpha-generation/)
11. [FinGPT Data Leakage](https://arxiv.org/pdf/2602.19073)
12. [FinGPT Data Leakage HTML](https://arxiv.org/html/2602.19073v1)
13. [Medium Benchmarking FinQA](https://medium.com/@jacqueline.garrahan/lessons-in-benchmarking-finqa-0a5e810b8d15)
14. [NeurIPS FinBen Benchmark](https://proceedings.neurips.cc/paper_files/paper/2024/file/adb1d9fa8be4576d28703b396b82ba1b-Paper-Datasets_and_Benchmarks_Track.pdf)
15. [AI Benchmarks Broken](https://forum.gnoppix.org/t/ai-benchmarks-are-broken-and-the-industry-keeps-using-them-anyway-study-finds/3890)
16. [AIMultiple Finance LLM](https://aimultiple.com/finance-llm)
17. [DigitalOcean Inference Tradeoffs](https://www.digitalocean.com/blog/llm-inference-tradeoffs)
18. [Economics of LLM Inference](https://mlechner.substack.com/p/the-economics-of-llm-inference-batch)
19. [Economic Evaluation LLMs](https://arxiv.org/html/2507.03834v1)
20. [Dell Inferencing Economics](https://www.delltechnologies.com/asset/en-in/solutions/business-solutions/industry-market/esg-inferencing-on-premises-with-dell-technologies-analyst-paper.pdf)
21. [PM Research Global LLMs](https://www.pm-research.com/content/iijpormgmt/51/2/162)
22. [MIT Media Lab LLMs in Finance](https://web.media.mit.edu/~xdong/paper/jpm24b.pdf)
23. [ACL Anthology Global LLMs](https://aclanthology.org/2025.clicit-1.74.pdf)
24. [NIH LLM Equity Research](https://pmc.ncbi.nlm.nih.gov/articles/PMC12421730/)
25. [Microsoft Wall Street Benchmarking](https://techcommunity.microsoft.com/blog/microsoft365copilotblog/llms-can-read-but-can-they-understand-wall-street-benchmarking-their-financial-i/4412043)
26. [MDPI FinBERT vs GPT-4](https://www.mdpi.com/2079-9292/14/6/1090)
27. [SHURA Sentiment Analysis](https://shura.shu.ac.uk/34390/1/BDCC-08-00143.pdf)
28. [CLIC Target Sentiment Analysis](https://clic2025.unica.it/wp-content/uploads/2025/09/73_main_long.pdf)
29. [ResearchGate Sentiment Sectors](https://www.researchgate.net/publication/389734886_Comparative_Investigation_of_GPT_and_FinBERT's_Sentiment_Analysis_Performance_in_News_Across_Different_Sectors)
30. [System 1 vs System 2 Thinking](https://arxiv.org/html/2506.04574v1)
31. [FailSafeQA Financial Benchmark](https://ajithp.com/2025/02/15/failsafeqa-evaluating-ai-hallucinations-robustness-and-compliance-in-financial-llms/)
32. [FinLake-Bench Leakage](https://arxiv.org/html/2510.07920v1)
33. [Benchmark Transparency](https://arxiv.org/abs/2404.18824)
34. [Hybrid LLMs Llama 3 GPT-4](https://medium.com/@oren.dinai/hybrid-llms-for-confidential-financial-analysis-blending-gpt-4-and-llama-3-81b6951bdd2a)
35. [Daloopa Financial Retrieval](https://daloopa.com/benchmark/an-open-source-benchmark-to-measure-llm-accuracy-in-financial-retrieval)
36. [Transaction Costs Sentiment Trading](https://arxiv.org/html/2507.09739v1)
37. [LLM Sentiment Generation](https://navnoorbawa.substack.com/p/how-llm-sentiment-analysis-generated)
38. [ResearchGate Trading with LLMs](https://www.researchgate.net/publication/378995378_Sentiment_trading_with_large_language_models)
39. [MDPI Strategy Development](https://www.mdpi.com/0718-1876/20/2/77)
40. [QuantStart Backtesting](https://www.quantstart.com/articles/Successful-Backtesting-of-Algorithmic-Trading-Strategies-Part-II/)
41. [Google Time in USA](https://www.google.com/search?q=time+in+United+States+of+America)
42. [Harvard Business School Foreign Bias](https://www.hbs.edu/ris/Publication%20Files/26-013_c6b42163-3a78-4df0-8275-dcb6d1b089e1.pdf)
43. [NIH LLM Applications Stock Investing](https://pmc.ncbi.nlm.nih.gov/articles/PMC12421730/)
44. [Market Regimes Long Run LLMs](https://arxiv.org/html/2505.07078v5)
45. [Market Regimes FINSABER](https://arxiv.org/html/2505.07078v2)
46. [PM Research European LLMs](https://www.pm-research.com/content/iijpormgmt/51/2/162)
47. [Microsoft Multilingual Benchmarking](https://www.microsoft.com/en-us/research/publication/benchmarking-large-language-models-across-languages-modalities-models-and-tasks/)
48. [IBM Japanese Financial Benchmarks](https://research.ibm.com/publications/large-language-model-evaluation-on-financial-benchmarks)
49. [Language Ranker Low-Resource](https://arxiv.org/html/2404.11553v2)
50. [Language Ranker OpenReview](https://openreview.net/forum?id=tkbIJpb6tO)
51. [LLM-Stats GPT-4 vs Llama 3.1](https://llm-stats.com/models/compare/gpt-4-0613-vs-llama-3.1-8b-instruct)
52. [Vellum Llama 3 vs GPT-4](https://www.vellum.ai/blog/llama-3-70b-vs-gpt-4-comparison-analysis)
53. [Labellerr Language Models](https://www.labellerr.com/blog/comparing-language-models-through-parameters-vs-real-life-experiments/)
54. [OpenXcell Llama 3 vs GPT-4](https://www.openxcell.com/blog/llama-3-vs-gpt-4/)
55. [PromptLayer Llama 3 vs GPT-4](https://blog.promptlayer.com/llama-3-vs-gpt-4/)
56. [Rajat Gautam Latency Comparison](https://rajatgautam.com/blog/llama-3-vs-gpt-4/)
57. [FinGPT Evaluation](https://arxiv.org/html/2507.08015v1)
58. [SourceForge BERT vs GPT-4 vs Llama 3](https://sourceforge.net/software/compare/BERT-vs-GPT-4-vs-Llama-3/)
59. [ACL Data Leakage HumanEval](https://aclanthology.org/2024.findings-emnlp.772.pdf)
60. [Data Contamination LLMs](https://arxiv.org/pdf/2311.09783)
61. [Frontiers Privacy Leakage](https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2026.1761624/full)
62. [Leak-LLM Survey](https://leak-llm.github.io/)
63. [TS-Guessing OpenReview](https://openreview.net/pdf?id=a34bgvner1)
64. [IBM Japanese European Finance](https://research.ibm.com/publications/large-language-model-evaluation-on-financial-benchmarks)
65. [Japanese Sentiment Biases](https://arxiv.org/html/2411.00420v1)
66. [MDPI TLFSA Bloomberg](https://www.mdpi.com/1999-4893/18/1/46)
67. [ResearchGate QLoRA Sentiment](https://www.researchgate.net/publication/398882676_Financial_Sentiment_Analysis_with_Large_Language_Models)
68. [GBSPress Market Sentiment](https://www.gbspress.com/index.php/JCSSR/article/view/473)
69. [ChatGPT Return Predictability](http://wp.lancs.ac.uk/fofi2024/files/2024/04/FoFI-2024-139-Alejandro-Lopez-Lira.pdf)
70. [RePEc ChatGPT Forecasting](https://ideas.repec.org/p/arx/papers/2304.07619.html)
71. [ChatGPT Stock Movements](https://www.dirk.org/wp-content/uploads/2023/06/Can-ChatGPT-Forecast-Stock-Price-Movements.pdf)
72. [ChatGPT Return Predictability arXiv](https://arxiv.org/pdf/2304.07619)
73. [SCIRP Risk-Aware PPO](https://www.scirp.org/reference/referencespapers?referenceid=3981842)
74. [ResearchGate FOMC Llama 3](https://www.researchgate.net/publication/385819587_Is_Small_Really_Beautiful_for_Central_Bank_Communication_Evaluating_Language_Models_for_Finance_Llama-3-70B_GPT-4_FinBERT-FOMC_FinBERT_and_VADER)
75. [FinMA Benchmarks](https://arxiv.org/html/2510.05151v1)
76. [Medium S&P 500 Trading](https://antonio-velazquez-bustamante.medium.com/how-large-language-models-like-finbert-are-changing-s-p-500-trading-1be88af703c8)
77. [FinLLM Leaderboard FPB](https://finllm-leaderboard.readthedocs.io/en/latest/datasets/bloomberggpt/fpb.html)
78. [HuggingFace FinBench](https://huggingface.co/blog/leaderboard-finbench)
79. [FinGPT Real World](https://arxiv.org/html/2507.08015v1)
80. [MDPI Sector Sentiment](https://www.mdpi.com/2079-9292/14/6/1090)
81. [SambaNova Llama 3 RAG](https://sambanova.ai/blog/outperforming-gpt-4o-with-llama-3-8b-fine-tuning-rag)
82. [LLM-Stats GPT-4o vs Llama 3.1](https://llm-stats.com/models/compare/gpt-4o-2024-08-06-vs-llama-3.1-8b-instruct)
83. [ResearchGate FinBERT GPT](https://www.researchgate.net/publication/389734886_Comparative_Investigation_of_GPT_and_FinBERT's_Sentiment_Analysis_Performance_in_News_Across_Different_Sectors)
84. [Frontiers LSTM Sentiment](https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1559900/full)
85. [UVT Earnings Calls](http://arno.uvt.nl/show.cgi?fid=188999)
86. [ResearchGate Informer Prediction](https://www.researchgate.net/publication/388092371_Stock_Price_Prediction_Using_LLM-Based_Sentiment_Analysis)
87. [Medium Stock Market GPT-4o](https://medium.com/@mollelmike/stock-market-analysis-with-gpt-4o-and-llama-index-a-deep-dive-into-ai-powered-insights-a-case-of-d63fff3f7dcd)
88. [DIVA Social Media Sentiment](https://www.diva-portal.org/smash/get/diva2:1985458/FULLTEXT01.pdf)
89. [ODSC LLM Financial Markets](https://odsc.com/speakers/large-language-model-applications-for-financial-markets/)
90. [PM Research KPI-BERT](https://www.pm-research.com/content/iijpormgmt/51/2/162)
91. [ESMA Turing Report](https://www.esma.europa.eu/sites/default/files/2025-06/LLMs_in_finance_-_ILB_ESMA_Turing_Report.pdf)
92. [Survey LLMs Financial Apps](https://arxiv.org/html/2406.11903v1)
93. [KeyMakr Specialized LLMs](https://keymakr.com/blog/llm-finance-specialized-language-models-for-banking-investment/)
94. [BloombergGPT Statistics](https://www.quantumrun.com/consulting/bloomberggpt-statistics/)
95. [BloombergGPT arXiv Tokens](https://arxiv.org/html/2303.17564v3)
96. [AlphaArchitect BloombergGPT](https://alphaarchitect.com/where-large-language-models-and-finance-meet/)
97. [Reddit Financial LLMs](https://www.reddit.com/r/AIportfolio/comments/1pmhl3k/financial_large_language_models_for_investing_key/)
98. [CLIC Target Target](https://clic2025.unica.it/wp-content/uploads/2025/09/73_main_long.pdf)
99. [CometAPI Llama 3 vs GPT-4o](https://www.cometapi.com/comparing-llama-3-1-8b-and-gpt-4o-mini/)
100. [AIMLAPI Benchmarks Speed](https://aimlapi.com/comparisons/llama-3-1-8b-vs-chatgpt-4o-mini)
101. [Medium Mistral GPT-4o Llama](https://medium.com/ai-business-asia/a-three-way-fight-gpt-4o-mini-vs-llama-3-1-405b-vs-large-2-961a2615ea6e)
102. [DocsBot Llama vs GPT-4o](https://docsbot.ai/models/compare/llama-3-8b-instruct/gpt-4o-mini)
103. [LLM-Stats GPT-4o mini](https://llm-stats.com/models/compare/gpt-4o-mini-2024-07-18-vs-llama-3.1-8b-instruct)

**Sources:**
1. [researchgate.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGi88PdEeNSlyL5fef63OwxMixhjxnH-e3vYpKRGlPMnVlkOpMKhQhKZ4M_-vLdzTyoTtFfdJMFvIpUV14jLEbEjIfX77BNrlS4WAbyo-NO2YIrgHmeqqx0I95OYc2238Hrs5mcuxhTW0uIu2VNodv0vncNkHeccwyFPT6VELESzgUDJOb0tyjzk05V-_SWWh88TIno)
2. [mdpi.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGz4dW0otydRkcW1nS8113Td7lYPUF8dUfWgbePriMxnfYTg4GJFQYsDvWjHhBrOjdjuuaXBDusYYVNqGMcafqOoHmBf8zNj6IRp9_YE10uWEKouWFzv85p9RKy)
3. [mdpi.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGlngzOqRfg2E1o1qkDy7uv12o11cskh2AuHvZjhD_lKRXLkbJ9_V5qcteAK-0L8WhN8wJhvJBAKwNyDHuGBQUJotyEEGRH9nKnPTNgvD0YaQd1dJ0zlAIICAHhFDM=)
4. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEc36eyUojhrwQ71sI9KTudqs-XuaKgVxIgtZ5UHe891T4MLwmT1NGlDl11HXDe7SkdgD0VHwTQ8igDfN287P7SLnKI3P8w4gwLa0qtKNVubaQb0WddaatF)
5. [uvt.nl](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE6lzDkptxbKq37Cx6Sbo7ULf-Zga19N4arSwB-d3jW9r_qEFtABHXGPI5U4bUuCjkOHpguNfvprYls2TnjKisPYiwEEn5ksSLgc4BsDjPrYCyXucDMk_7KdzwX)
6. [stanford.edu](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGwwvsfoHhvWfxbqp1i7aqwLGAdSWYSMImuouNgcktbLHg1DYssbZVPyrHN1FyIBmaHNfUnORaHZWC_IvOVLckxHjOVMNqs0FWZdcwvNP14xZfiTvW56VqVwr5zMzyBlZuf5i3daV_QTUX08aJX8O59fpT2rFA594_3dv6yfA==)
7. [quantumrun.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHYyyO4-zfoFbTar-agI3I3q7aIbN9fI17z0rbO0q-QlFx6NzSdbNzOXXHbQtPGd67h5Njp19-4C33DCYp_wD9TpdEEPpil7m3f7agw5XYynT3d1C13f54lQL2cTZXs_zcEyluE3HzESgqJHo-6_W7sewqE)
8. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHKywIJWlOBUdTZZ8jQmRjhxdVEsRLjWpa0aLLNLOnIYn2RsoyWSnCAdVyaBdDDm1R1mlGJyRgJdNb0bTATvPK0C2tSofEGeP5IF7fo1NMus3fI-HHFjd3p)
9. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGpatQka_uC_BpD_z7DH9odvYHjMBLobHB32fWpSZo3TB0Y9pedMbuaghvaVhmc-dpbgXBBUbHaRPs63hBkadEAGRAt54Xu7PR_GfIxAeW68KcuyInV)
10. [alphaarchitect.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGFprbk4mKndgWMKr0jsA1Ie4CbJ-vuMVHn--XH5I6lEBD-PWNmaEmBesXqwd77VTkNKo4nkIyEC1WYYLIGOPvoGGX4FLJdSMZX4BXdW7GwZYLoL2PtMu41UY_xvgh0ZvnUnBm0oxgsAiCWXM1o6JcPXSzNPKJEdSfKm1TaGQ==)
11. [reddit.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFOoKAg1LcmaUxakR0qxee2YxtPOxPYID2q38-E9F2V9A8XlNN5O9aO0uEqDD3b16peBThCjNNyBUzi8Vx808UnNhhDqX0-CrOfpiwBfxiR0KHDwFgR8eggAHVJ9oZ14UwSyQL9HcdQHzAsMvVNDLOHeEUHUZ2ZM_hrhWvZHeQ_8i9XwirEgf74X9RzG7HceQY1EcOAX8SNT9YxEs_z)
12. [openxcell.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFjMrwwsjVaFWx11adzVi3wdOVR6sBeqnMY8e1c1zDmi37by7xHLma7HftwXTDQeFHUBA2RwlDSlNTJ8g3Tb8uklPLcSejLT1fpRqWfJyYGf9rSE9BgB-6LPkOT-IHR_kApxZYGXg==)
13. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGWaQvDrurX5jj5nuL6FBS46g2vN1E0DfHQGktfosio1W5ybupbF7ZxxVbIKm2qlIu3OeCbuzpP8ZA1s-32tJcO00cezznulcnH2AKNB6uKWJcE52Veg219rQvpVriqSvAt-6G6LAqA1Xh-VWnWYsYq-6hyVfV78oHqwIQYGkTLbrgDGUSm6oo1XxeHfEkV8yNS1sXwQO1w_d1kQTTAIc9l)
14. [llm-stats.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHnyO6ekuQmd_6pqgsEfJ_LDHCYUerJSOD0y9hJy2brxmEfS_c-7kLyq0Zo3uZiIvGj0UjEkKO3kc0eSw-MplY7BgRu0GcaVpG8q7HLU2f2lSu8F0EJZURvBHvqIyf9S-i9ssySTW89OqlzTn8mv0S6JF7WKEsG8kSOw2vCOg==)
15. [llm-stats.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFSi7MMdGtlfO7ioLIUzbc19wWy0ppwXD5hqbl-FT2pCj7bTidbpwXMK-nomI5ASiwiNZgl9b26BpuqC6EMlBOMbbZnEWvX-Q2JhWi3v-Tk6MoUKhac9DtSj3CG5-dzh6iSVTBkp-kV45JTfSnNGCJewMiFq7rpw9pccx3ZyfkCwHsv7r4=)
16. [aclanthology.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGbsV1Ef-DPJ47psU43Iy5yfLdVyYdpypMMXS1LPOVgb2AlGlJ13KQLGFM_jt7oWUiffKOKyRX0bbv_Nt0kL3fHVG3WvKmyhzNuq8oCLERlv7u3lRNGXW0Y2404xRUBWbPM7g==)
17. [unica.it](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH9aNqIsbBruD1sSghC035Q2uGd6JIMnyfL1PHEEFFkFL0yfulqE42-jcIv_arhMnjXhZACdX9MnaeMcmzuxvxIlXaUpoOKIWFZ6g5nvU0v3oVKf1iCk8NoW5-f2cKFp9ePrnNvlDr-Z9mH3HNs7cXWLdf4IpAxcbJSdA==)
18. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGSVTs7-BhOxCJok4C4EjVYzxEMVbxG9McODP9Fj1Cyuth_9f_EIt5FCq49e2Y7bBJdLLzv_WdD1-OItSPebN7KAYwBO84aHfhMVHplsEPdM6_lqO4gd4aP)
19. [researchgate.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE63jCrqQFw39qIqbhsmhP_pJLqeB-wV5eXRb33fEzdmednNEp90PE5t5p5mqayz-iHP57pkERDS8qSrDerpRT6diPPIEsxZFplbVgTZyU09N0qsuJEnzTNNiEqoXn9WKp_WLnlAW3dOSN7scH5rHnMsS-icwKj4VV5W4Q7CqVqRrMr_3D_vLTYtVpTREncNox1EIlCSWaOwUFX63N6EExMxgRLoU-LmMEY-uCeANWCeRU0W173CXlf0p7DTkrR2S5g6Ujv5P15imuPGWAZdTq9vqlT3Vd8VA==)
20. [researchgate.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEKjjrt_Mpc1mWgDoJV9B2f7i0aTr2WAh_IRfe0rBOwon7N2LCTHcjU4UHQoXInKUZez39unwKzWRuMMgVdQKJBu2FdQHEUVSsOzteT1zTnQa27qbXfMHIjwDhzboXZWToOx56U2rHwd4zaXl9N_GIUGy3sg6GBprODggNaLcbG_uPYx0KVip1b8_5GA2cUnr6wCvTvtc5XvSIfsgMdwRo=)
21. [shu.ac.uk](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHqrGMxCZ1wKJIixu7Jf8xaYa1FFOyK_JEWOjYe-xhkGV--qPwQIqVKQquSI3FW6wrwWsJOSmyMo6i1CwWMWkldsVWxTWnpeM7Z5mhRno_BRTF-dGl-WRUqbcYBBnojH6IkMPtgC_Y=)
22. [repec.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEmR077awmFbY4zgRtbZSQuuPjO802WHL4U6CssojOu2AoxT3RR6jNv8mrGIvK52d9FsJz-L5V0WjaTyDt2Tv6kd0GPMdOooKvUkDipmylxHoZNps0OreTZvZg669t6tZho36biTj5GJHg=)
23. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEA6kJ048Zgx7q744yCrbzjV4CWW3yTMWtKmPcta2JK_skwk7pNse6aL_4_mhjIBkCB1AIiGk_lRelRc5cWCTpVT7X0byyeeom3YILXdmmbYcCtbYY1)
24. [substack.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHoWyiCbh0_3obS9gIZ0D7leX-VSelgJ55r8H6A-wx-7r9qJXVrqNIfqsfwWNv2l-oSZUW_J0NRNsthuCVNrqend-VFREp_2O7tEwIpnukSPInG_x9fGjVUomzRLiNvojQatSHPoczR80cfjyWpM4SsCNUzyYVZLRc59Fr6)
25. [permutable.ai](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGkNM_3Y1s_3GEeVTgQ2IDE558RfQ2fm9qMetlp4ELcrwBzAbwU55JmBDr0meYmerYmf8U_b4L9stsacz-Lff4czvX1_vI7GaXfIvyB9xsuVEZ_FTPtsm9K8At5HKTGqorNLXPF4XNi)
26. [mdpi.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEs4tJsxs4xJrm2QK5BIrkX82M7zLnU9e7qyjDaoA9p-hMtBNF6QFGaG1z8pgW0QVO-3ttnICbAoYdSgkf7tS-sMRwQ_V4P_g5kyHY1nM7_L5u6WlXTAL1Evjp8)
27. [medium.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEuzatsVJz6hOAcbQkyr8G-W9LBrozDQPNm4I56AAis_ClLwc0ohsdZ1Dfl2xIaGPc9p--3cvjPuWxs5rrqTKncbD4osW8DsgGMwu6uYkoG2bcOjMYy2jZwr3U9CvrdlJ3Uc3opq1Vf3bYydgBK1RQ_EjB1ae-CPxURVg19e4IQt-IOG7_x9IobOku1SWes-ElK5Q2nBflYLMYNGU5kdU2FUEbFP68gyyE6OxZkT5hmqy8TrOZ4tg==)
28. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEZbR9GkE1JgX1Z5E2ZDDLxxZ13JK0qG4XHdCNgz3Gg9adHNTXgZoKPV49E_5wnbzogmdcg10jigVlimLBecSR0YedoXjA9TkppBAGxBMpaW5iJ2Oc6v3DS)
29. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQH1oNX3Zg_cykwhHXiolu7uJtXc6ghUTfs3XVBLOizdAMuHXNLxBNml2AKTQu3wRlBAfJTFkR0Bl7zj1sEwutzVXYRb8WILtYIpl0akPiJMw4cwSt9FC5kk)
30. [substack.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHef_Kcd1Dx1VulgkZ3TN3NjTSxp2YaGK2Uahu43-7kuc07_o8ZD9K9ujnCYdK0gdhF3VUawL7zo59LPVQufgtssLCkRHO0t5hUhR2rYsCvKV6S--7bhGGLEnlrqQiNWoPynddFVB1pG5v21q_jYOuco8-3fBZoAg==)
31. [quantstart.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEJq8tvX8pJhCIui-jM8CLV_U9xUiJNAIZo5aeKCXQGnLqt7-trb3pjtinEEgEFFfMRygxKke_2HGgUX07gmURt_V306fbuAX2Hsu94nms-hS_3YY3qo9EOKbwetkRzbw355rQXhUgOrxDI94rrYt-S4yZswSmaB0f01WCI_Wk-oEJoNYfxO_zNoUvKqOxpRsgTDf7fUmu8c35z)
32. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFH5N2klpYgOp7t8vzwje41xIL-9f8X0aAp5jylUBzYY8gRXrx5q0B1TaMrYAKRZ9RKOSRGa-DcWFY1V6WnBDNp4_YZcwj5oItIqomjG_NXZSE_4YLQXALt)
33. [lancs.ac.uk](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQG15ba8e1U3o2aaiouwTl3BmZPWUSjO3OKrLPfkElDwHaEwytKAimhtq0adSfqLEA_2IwMdPpSS8Ema2knUVlMgi4Yy_WGBwSTgoNHLf0ZCx8Q-Vtt81_Hu7DX0tHmrem-Wwao-PGaEWs54sPYSDqarXKHR1nqTkUcIBL-t-M0YggmdEkJNtueB)
34. [digitalocean.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHa1Gr76ULPPIbYdvky8kqlC0t_IMtamnozrYxhmist8pN6cH35fRmvbAuEox5BfnBmD8NMnbNA-9nsHCbtq4sYBf0cQXZJMUu00gKth1iHmIfCYy9GMyUGsDiV7sFWJb2Td2UeLZH79D6zwt7cEQ==)
35. [vellum.ai](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFK7qaGqXqjfmWi6ubYsFLOJaWkigyG_dUcXwBFFp0wkI7DgkeJNIReIhqrMogVCZEdTF_7cOJ5V_0XJr4kFVd1M4d3sLY6229Ul3S8khIRlen8dubjKFDDiW2scSat7xrjech8xoZRylItVB81JtezlFP57Uah_Jk=)
36. [rajatgautam.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE8FmM7oa9LZNE-osFlKgBgqXbDXmkwAkcT7zeBdZ9KKea1D48vbwtYIW3cSmqTfswArtLhImqmgz5pohrXQRFnCjKmxaVJVJnRD0XId6y62OxO18bVh0f53xTF1H7WvLQnPss=)
37. [cometapi.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHJK7czXd-VWF0Lhs2MGobU1idWqB-EVM4gQBk8_G1wMDqtHEVhnAjx0nJIwZNm-HXcpLOibvix7V7nsT5CVBjHU-l4NCv7mkbGLGcXrrJXaumYJMMWEwNaah2dIQrf15svRDlCkBnK4Ye3EwVVbygy9aumGoo=)
38. [llm-stats.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFnSWvFtbs6Fjpbsgdf-8rx6WjgKFmBBFGDEUu6CwnDJgam8kpT2_0a4XAuAg3bURLsUBQytMLxtKQVM7sT8qctpt8FVn1E7Qjg2SrrE07eWOA5Z7DAM7JsfBZjkYCQuvjJ0cI3WWDN_DONvdjAV3OJHK_1DZ53Ckekd0BLjav_yAVBcE6I3Jl2lg==)
39. [aimultiple.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEjpzbLuv8Qs_KHNnoQlTma1W3ZCo7l2FEoD2t23KDeyZx0_LeOfBZWfYNMEHeF9bfSVavGPOi6aHyQAt8aBnh_bHQjyviiuXhN2UfTXdXJ6iDW3_M-lo0=)
40. [sambanova.ai](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHvUVW6ItCeBscK5bR6R20_wyZpFBZYHsNjyJIOL9KbLwfUsC5ExR3_wv6HDgOGDlwzIroKyjKyawcaUWDDMnVJySbzAmYrJbqo3aqVz0k48aDeyOjm6ZO_hGm1EViVcQbTjGeqI-JY2qnTFfhNg28URFR6jVQiPlsF61NGgo41ALGzaA==)
41. [gnoppix.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHaF_LSwQGL_8Sjrrk1XFLt6fPWOYvuxtSpSi0khdDOM5Yq6R6S4--RlYsO_TFv6-d_XiCTMoliPm3DGDzOuGaR9ceRUsOz7_rJlqBp-thvQewd4qfDN99ltQFw21TlWsKpIf414RfeFukWcmeRNfzpnixlS-w8ZHLd-etvl8jzxcQ3eXtMn9A9azGXcOhvT2kp3DRxJvPGv7Uz4BzlPDXt-Bb5)
42. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFcRqUDtWU0JRKvPijqUoE1SucUscG00Q6AKkRktL4Kh5lUrr0C7qrpBn9NXyxepCqS3fDwnWbpy_qTKfb_d2mWIjbNzaH7BAZhwuKmU03jqY5VCAvJ)
43. [github.io](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGzJQGdwC2grUdo7uCu5oGEtVFMPLK_50zVq9cPqXV_zwc96heCYLGVdIZA_z-LVbuVwN2orJwmeUnhPsDfREowuuGBckpJ8luQe_RJA42gGg==)
44. [aclanthology.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFj2kbm0Fy37f5CGvppo3o7d6XBQOw_Le_fSiOG7QqtKJZBw3VmjU0hzY7O3WFFBakhZOoVVkuwOG6tF5mhO3GjVb18nUX-orYcXUW1MIwAwXMhHveZ5zh2R36mTfScEUnnIFpmyI2eDZ0=)
45. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE5axsExx47q8DBZbHqOF3aO9huADY3cSd6__oNzy0Q4KNHkTw3E54MZCVLc1VLcjTSz4Cv9FREadoZiTE9EoAw7sJRXDFZoztzaSaWySv-VR-OPerTYW11)
46. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFQhlVG4W8F8rAmCOS_1_VXoRCRBkRMGT0QuxYDb-5X7Q7bFIzfdaospiHmprR4kHj_4aIdb4OFrFbaQCNApZjlOgKl-wtbYznDYG9uy5afCpvMbcSB)
47. [hbs.edu](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEPlhU7ZVCSYvShukXUzweuhkVzlfIk9KsRNMyHmcLnP9ou3iSlB-9PiZ6YmypuNDYv8UXlK0I5P4JeI2Vd9C98FiqxeXf-nPhu-oIgiPPpPIa87x4PAsXr944eWCxYngob1metK2MyWhjwjQuH2oi5Ev0rnx5RNF1XO-6sFbhRCbVkFHqG9E4eYTmqPs6alU4=)
48. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHXm2vj9WCcGPYD7gRL4pjBKv1o385XEZKoXQuHqNKtcZW5P3lV9idN0x4_kdc1Y6lBWpa5vS2cNVBGDh7-7CRsgXlD1afmT8mfiZJoWOcZ9gJVhtE4NVak)
49. [openreview.net](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFQlo3XCchissM3kkeVRzkgcvqmN2p45KzRLCD3lzQyTikm9aDUAzIjQrsEvHyTLQxwDJqG2kLwHX86vgVIQn5yaU0fuuanOJcGK4U2CEIJzOxoLNjsrIhzPKzHJTX83w==)
50. [pm-research.com](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQFEOWlQ8vg6lL3h8VH6uUK_RRlebCEbUS-BIAGFV7suLzavaHD8GBuB1VmGmMEnjCDzpxAyEtMrfiN1vMJlbvgmctN-tGXsraFZDFMGyzpLuXfVJDeMf6z50zD9PdSGf6tlFcjyMQVO5Qt8_t4=)
51. [arxiv.org](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQElM-NLu_lONE2N-TsR7X12zBY2bOXYbEqUWhWp3nZHofrwmyKSHVZU8rD_RuvFG7cUOlS9AHAukh_YbeiW4MyPlERuASNLwTTxWDlP2jyYYffQ8gG1yO6n)
