Why do AI trading strategies struggle to beat a passive buy-and-hold approach?

AI strategies often fail due to high portfolio turnover, overfitting historical data, and transaction costs like bid-ask spreads and slippage, which erode gross profits.

Do AI-managed ETFs consistently outperform standard market benchmarks?

No. Longitudinal data and academic analyses show that AI-managed ETFs generally mirror or slightly underperform passive index benchmarks once active management fees are deducted.

How well do Large Language Models perform as stock pickers?

A 2026 NBER study showed that LLM-managed portfolios produced excess returns that were statistically insignificant compared to the S&P 500, while introducing risky concentration in specific tech sectors.

What is overfitting in AI trading models?

Overfitting occurs when a complex machine learning model memorizes historical market noise rather than capturing structural trends, causing it to fail when deployed on live, unseen data.

Updated 2026-06-14

Key takeaways

AI trading strategies typically fail to outperform a passive buy-and-hold approach for average investors over long time horizons once fees and risks are accounted for.
Up to 70% of machine learning models that succeed in theoretical backtesting fail in live markets due to overfitting, survivorship bias, and look-ahead bias.
Large language models and professionally managed AI ETFs generally fail to generate consistent alpha, often just mirroring standard market benchmarks.
The gross profits of sophisticated AI models are frequently erased by the high transaction costs, bid-ask spreads, and slippage caused by excessive portfolio turnover.
As more AI algorithms interact in the market, they create endogenous noise and learning externalities that degrade overall performance and reduce profitability.

Despite its sophisticated data processing, artificial intelligence generally fails to beat a basic buy-and-hold trading strategy for the average long-term investor. While machine learning models often boast massive hypothetical returns, these results are typically skewed by flawed backtesting and overfitting to historical noise. In live real-world markets, the fleeting profits of AI algorithms are quickly erased by high transaction costs, slippage, and portfolio turnover drag. Consequently, passive investing remains the most reliable and statistically superior path to wealth.

Can AI Trading Strategies Beat Buy-and-Hold

Artificial intelligence trading strategies excel at identifying short-term market inefficiencies and executing complex technical patterns, but they frequently fail to beat a basic buy-and-hold approach for the average investor over long time horizons. The gross profits generated by machine learning models are typically erased by high portfolio turnover, bid-ask spreads, and algorithms that overfit historical data. While institutional quantitative funds achieve distinct advantages using vast infrastructure, retail AI tools and AI-managed funds tend to either underperform or merely mirror standard market benchmarks once transaction costs and risk are fully accounted for.

The New Era of Algorithmic Dominance

Artificial intelligence has fundamentally reshaped the mechanics of modern financial markets. By 2025, algorithmic systems were responsible for handling nearly 89% of global trading volume, a dramatic paradigm shift from the days of manual human execution and traditional floor trading ¹. This ongoing transformation is driven by the theoretical promise of machine learning: the ability to process terabytes of unstructured data - ranging from global news sentiment and social media velocity to satellite imagery and macroeconomic indicators - vastly faster than any human analyst ²³.

For institutional giants, the integration of advanced quantitative models has been historically lucrative. Top-tier quantitative hedge funds utilizing deep reinforcement learning and natural language processing (NLP) have posted remarkable returns. For example, some leading systematic funds, such as D.E. Shaw's Oculus fund, returned upwards of 36% in a single year, while the broader quantitative hedge fund industry delivered a record $543 billion in investor gains in 2025 ¹¹. These institutions employ specialized teams of physicists, mathematicians, and computer scientists, leveraging massive computational infrastructure to execute trades in microseconds and capture fleeting arbitrage opportunities before the broader market can react.

However, a massive retail industry has emerged in the wake of this institutional success, offering "AI-powered" trading bots, large language model (LLM) portfolio managers, and machine-learning exchange-traded funds (ETFs) to everyday investors ²⁶. The underlying marketing narrative suggests that artificial intelligence will democratize high finance, allowing the average retail participant to consistently generate "alpha" - returns that exceed a benchmark like the S&P 500.

Yet, when financial researchers strip away the marketing terminology and analyze peer-reviewed academic studies, live market data, and net-of-cost performance metrics, the reality of AI trading becomes far more nuanced. While artificial intelligence is an incredibly powerful tool for data synthesis, transitioning a successful mathematical model into a profitable live trading strategy is hindered by profound structural barriers ¹³⁸.

The Seductive Trap of Backtesting

To understand why AI trading strategies frequently fail to beat a passive buy-and-hold approach, it is essential to understand how these strategies are developed and validated. Machine learning models are trained on historical market data in a process known as backtesting. The algorithm is fed years of historical price action, volume data, and alternative metrics, and is tasked with optimizing a set of trading rules to maximize hypothetical returns ⁴¹⁰.

On paper, backtested AI strategies look flawless. It is not uncommon to see retail AI bots or newly developed algorithmic models boasting historical win rates of 98% or hypothetical annualized returns exceeding 100% ⁸. However, financial researchers consistently note that up to 70% of strategies that perform exceptionally well in a backtest fail entirely during live forward testing ⁴.

This massive divergence between theoretical and actual performance stems from three core pillars of machine learning failure: overfitting, survivorship bias, and look-ahead bias.

Overfitting: Memorizing the Exam

The most pervasive problem in applying machine learning to financial markets is overfitting. Financial data is notoriously noisy, meaning it is filled with random, unpredictable price fluctuations that do not represent underlying structural trends. The signal-to-noise ratio in equities, commodities, and foreign exchange is incredibly low ¹¹.

When an AI model is allowed to become too complex - incorporating dozens of parameters and indicators - it does not just learn the underlying market signal; it learns the noise. In data science, overfitting is often compared to a student "memorizing the exam" rather than actually understanding the subject material ⁵. Alternatively, the CFA Institute describes overfitting as tailoring a custom suit so specifically that it only fits one person in one exact, rigid pose ⁶. On historical data, the overfitted model looks like a genius because it has perfectly memorized every random market twitch of the past decade.

But markets are dynamic, adaptive systems. When the overfitted AI is deployed in live trading and encounters new, unseen data, its rigidly memorized patterns fall apart ¹¹⁷.

Research chart 1

A model might discover a statistical anomaly showing that buying technology stocks on a Tuesday when market volatility spikes yielded a 15% return in 2021, but this spurious correlation has zero predictive power for 2026. Therefore, while a simple buy-and-hold investor quietly accumulates market-average returns, the overfitted AI bot executes a series of highly confident, entirely incorrect trades, rapidly bleeding capital ¹⁰⁸.

Look-Ahead and Survivorship Bias

Beyond overfitting, theoretical AI returns are frequently contaminated by flawed data sets. Look-ahead bias occurs when a backtest accidentally uses information that was not actually available to the public at the exact time of the simulated trade ⁴⁹¹⁷. For example, a model might trigger a buy signal based on a company's end-of-quarter earnings report, but in reality, that report is often not published and verified until weeks after the quarter ends. If the AI is trading "today" on data that will not be released until "tomorrow," the backtest creates a mathematically impossible scenario.

Survivorship bias is an equally dangerous pitfall for machine learning models. Many retail and amateur quantitative backtests only train their AI models against companies that are currently listed in major indices like the S&P 500 ¹⁷¹⁰. They ignore the hundreds of companies that went bankrupt, were delisted, or plummeted in value over the last twenty years. By only testing the AI against companies that "survived," the historical returns are artificially inflated. This preinclusion bias leads investors to believe the AI is a stock-picking savant, when it is actually benefiting from retroactive filtering ¹⁷.

Real-World Performance: AI vs. Buy-and-Hold

When quantitative analysts force AI models to transition from theoretical backtests to live, real-world execution, the efficacy of algorithmic trading can be evaluated across three distinct categories: Large Language Model (LLM) portfolio management, AI-managed ETFs, and retail trading bots.

Large Language Models as Stock Pickers

Following the mainstream adoption of advanced LLMs like ChatGPT-4, Claude 3.5, and Gemini, a wave of retail investors began treating chatbots as informal financial advisors. But can a text-prediction engine actually construct a market-beating portfolio?

A rigorous 2026 working paper from the National Bureau of Economic Research (NBER) tested this exact premise by tracking LLM-generated portfolios in real-time over an eight-month period ¹⁹. To completely eliminate the look-ahead bias inherent to historical testing, the researchers submitted daily prompts to several top-tier models (including ChatGPT 5.0, Anthropic's Claude Sonnet 4.5, Google's Gemini 2.5 Flash, and xAI's Grok 4), asking them to actively manage a portfolio. The researchers then tracked these forward-looking returns against the S&P 500 ¹⁹.

The live results dismantled the narrative of LLM stock-picking supremacy: * Buy-and-Hold LLM Portfolios: For portfolios where the AI selected stocks to hold long-term, the excess returns over the S&P 500 were mostly statistically insignificant across time horizons from one day to six months. After adjusting for peer-group performance, the AI models offered no measurable alpha ¹⁹. * Actively Managed LLM Portfolios: When models were prompted to actively trade and rebalance their holdings daily, the adjusted returns were positive on average but mathematically indistinguishable from zero ¹⁹. * Dangerous Concentration Risks: The LLMs exhibited massive, unhedged biases that a traditional human manager would flag as reckless. For instance, ChatGPT placed nearly 20% of its entire portfolio wealth into a single stock (Nvidia) throughout the sample period. The AI portfolios were heavily over-concentrated in the semiconductor and computer hardware industries (averaging 41% of the portfolio, compared to roughly 20% of the S&P 500) ¹⁹.

When retail investors rely on LLMs to pick stocks, they are not receiving proprietary algorithmic insights; they are generally receiving an echo chamber of recent financial news sentiment, heavily skewed toward the mega-cap tech stocks most prevalent in the model's training data ¹⁷¹⁹. Because LLMs process text rather than raw numerical order-book data, their outputs mirror the consensus rather than predicting future market inefficiencies.

AI-Managed Exchange Traded Funds (ETFs)

If LLM chatbots struggle to beat the market, how do professionally managed, legally regulated AI-driven ETFs perform? Over the last decade, several funds have launched that rely heavily on machine learning, natural language processing, and proprietary neural networks to actively pick stocks. Prominent examples include the Amplify AI Powered Equity ETF (AIEQ) utilizing IBM Watson technology, the QRAFT AI-Enhanced U.S. Large Cap Momentum ETF (AMOM), and the LG QRAFT AI-Powered U.S. Large Cap Core ETF (LQAI) ¹¹¹².

These institutional-grade funds analyze millions of data points daily - including global news sentiment, social media metrics, and real-time financial filings - to continuously adjust their portfolio weightings ¹¹¹². Yet, longitudinal data reveals a sobering reality: AI-managed ETFs generally do not exhibit a clear, consistent advantage over passive market benchmarks ²²¹³.

While certain funds experience brief periods of outperformance - such as AMOM outperforming the S&P 500 by over 10% during a specific 12-month window prior to late 2024 - their long-term, multi-year averages tend to simply mirror the broader market, often dragging slightly behind once their active management fees are deducted ²⁴. A comprehensive academic analysis of 47 different AI ETFs (holding over $32 billion in assets) through 2025 revealed no definitive advantage for active AI management versus passive index investing ²²¹³. In fact, some machine-learning strategies failed entirely, such as the Teucrium AiLA Long-Short Agriculture and Base Metals ETFs (OAIA and OAIB), which were liquidated and closed in 2024 due to poor performance and lack of assets ¹¹.

AI ETF Ticker	Strategy & Technology Engine	Live Performance Insight (vs. S&P 500 / Benchmarks)
AIEQ (Amplify AI Powered Equity)	IBM Watson; analyzes 10 years of historical data, news, and sentiment ¹².	Delivered ~7.9% average annual returns over a 3-year period ending in 2024, closely tracking but not definitively beating major indices ²⁴.
AMOM (QRAFT AI Momentum)	AI engine applied to momentum factor investing, heavily tech-weighted ¹².	Showed periods of strong outperformance (e.g., beating S&P 500 by 10% in a trailing 12-month 2024 window), but highly sensitive to tech sector volatility ¹²²⁴.
LQAI (LG QRAFT AI Core)	Rebalances every 4 weeks holding 100 U.S. large-cap stocks based on LG AI Research ¹².	Consistent performance adapting to different regimes, though lacks evidence of massive, sustained long-term alpha over passive alternatives ¹²²²²⁴.
OAIA / OAIB (Teucrium AiLA)	Long/short strategies in agriculture and base metals employing machine learning ¹¹.	Both funds died in 2024, highlighting that AI models cannot always survive real-world market dynamics ¹¹.

Note: Live performance data illustrates that while specific AI methodologies can be competitive, they are not immune to market drawdowns, and active management fees often close the gap between AI returns and passive benchmark returns ¹¹²²²⁴.

Retail Crypto Bots and Algorithmic Day Trading

The retail market is heavily saturated with algorithms promising automated, emotionless profits, particularly in the cryptocurrency space where markets operate 24/7. In these highly volatile environments, some AI-driven strategies - such as Dollar Cost Averaging (DCA) bots or grid trading bots - have shown absolute profitability. For instance, data from 2024 to 2026 showed that some DCA bots on retail platforms averaged 18.7% annualized returns across verified user accounts ¹¹.

However, absolute profitability is a flawed metric. The most important framing question for any investor is: profitable compared to what? ²⁵.

From January 2024 to January 2026, a simple buy-and-hold strategy for Bitcoin returned over 200% with zero management fees, zero software subscription costs, and no active trading risks ²⁵. During this massive bull run, a significant portion of retail bots that achieved "positive returns" actually drastically underperformed the underlying asset they were trading ²²⁵. While grid bots occasionally outperform during sideways, range-bound markets (where they buy the dip and sell the rip), their underlying logic severely caps upside potential during massive, sustained bull runs ². An AI bot that beats sitting in cash but loses heavily to a simple buy-and-hold strategy is actively destroying opportunity cost for the investor ²⁵.

Furthermore, statistical tracking indicates that 89% to 95% of retail day traders ultimately lose money. Out of the automated algo-traders who do manage to turn a profit, the vast majority still trail a basic buy-and-hold index strategy ¹¹⁴.

The "Toll Booth" Problem: Gross Returns vs. Net Reality

If institutional machine learning models can find genuine alpha, why do retail and active AI strategies fail to translate that theoretical edge into take-home wealth? The answer lies in the harsh mechanics of market microstructure and the devastating, compounding impact of transaction costs - often referred to as "turnover drag" or the "toll booth" effect.

The Invisible Tax of Algorithmic Trading

In financial markets, every transaction comes with a cost. These costs are categorized as explicit (brokerage commissions, exchange fees, taxes) and implicit (the bid-ask spread and price slippage) ¹⁴¹⁵.

Financial analysts often use the metaphor of a highway toll booth to explain this drag. Imagine a highway where a driver must pay a $1 toll every time they change lanes ²⁸. A passive buy-and-hold investor gets on the highway, stays in the middle lane for thirty years, and pays the toll exactly once. An AI algorithmic trader, attempting to capture micro-fluctuations in traffic speed, changes lanes hundreds of times a day. Even if the AI successfully finds a slightly faster lane, the cumulative, compounding cost of the toll booths completely wipes out the speed advantage ²⁸¹⁶.

This dynamic is perfectly illustrated in a landmark 2023 academic study by Azevedo, Hoegner, and Velikov, which analyzed the performance of various deep learning models, including Feedforward Neural Networks (FFNN) and Long Short-Term Memory (LSTM) models, across vast datasets ¹⁷.

The researchers found that highly sophisticated machine learning models could theoretically predict stock movements and generate strong gross returns. However, to achieve those returns, machine learning strategies are classified as "high-turnover anomalies." The AI constantly buys and sells assets as new data flows in. The study found that these AI models exhibited a two-sided monthly turnover rate between 120% and 140% ¹⁷. This means the AI was effectively liquidating and replacing its entire portfolio every few weeks.

Slippage and Market Impact

Even if a retail trader uses a "zero-fee" brokerage app, they cannot escape implicit costs. The bid-ask spread is the difference between the highest price a buyer is willing to pay and the lowest price a seller will accept ¹⁴²⁸. High-turnover AI strategies cross this spread continuously, starting every single trade at an immediate micro-deficit ²⁸.

Furthermore, there is the persistent issue of slippage and market impact. In a backtest, an AI algorithm assumes it can seamlessly buy 10,000 shares of a stock at exactly $50.00. In live trading, the very act of placing a large order consumes available market liquidity, driving the price up to $50.05 before the order is completely filled ¹⁴. This 5-cent difference seems trivial, but for an AI strategy designed to capture a 10-cent profit margin on a short-term momentum swing, half of the expected alpha has just evaporated into thin air ¹⁰³¹³².

When researchers attempt to mitigate these costs by forcing the AI to hold assets longer (reducing portfolio turnover), the gross returns of the AI drop so significantly that the strategy is no longer profitable ¹⁷. The AI is trapped in a paradox: it must trade rapidly to find an edge, but trading rapidly incurs structural costs that destroy the edge.

Research chart 2

Evaluating Net Returns of Machine Learning Models

When evaluating the actual performance of deep learning models in academia, the devastating impact of these costs becomes starkly clear. The table below details the performance of various neural network architectures, tracking their net monthly excess returns and their two-sided monthly turnover rates as identified in the 2023 Azevedo study ¹⁷.

Machine Learning Architecture	Net Monthly Excess Return	Two-Sided Monthly Turnover	Profitability Assessment (Net of Costs)
LSTM 1 (Long Short-Term Memory - 1 layer)	1.42%	129.56%	Statistically significant; the clear winner among models tested ¹⁷.
LSTM 2	1.06%	129.07%	Profitable, but lower net returns than the single-layer model ¹⁷.
ENSEMBLE (Average of Deep Models)	0.83%	129.91%	Profitable; useful for smoothing out individual model variance ¹⁷.
FFNN 3 (Feedforward Neural Network)	0.75%	128.40%	Moderate profitability after costs ¹⁷.
ENET (Elastic Net)	0.64%	139.76%	Lowest return among profitable models; highest turnover drag ¹⁷.
OLS-HUBER (Traditional Regression)	0.29%	122.11%	Not statistically significant ¹⁷.

While advanced architectures like the LSTM 1 achieve impressive net positive returns in highly controlled academic environments, their nearly 130% monthly turnover means they are extraordinarily difficult to execute for retail investors without access to institutional-grade, low-latency infrastructure and heavily negotiated clearing fee structures ¹⁷.

Emerging Markets: The Next Frontier for AI Trading?

One of the foundational concepts in modern finance is the Efficient Market Hypothesis (EMH), which suggests that all available information is instantly priced into an asset ³³. If the U.S. stock market is highly efficient, then AI algorithms will struggle to find any sustained edge, eventually reverting to average market performance.

However, many quantitative analysts look toward the Adaptive Market Hypothesis (AMH). This theory suggests that markets evolve biologically, and inefficiencies still exist, particularly in less transparent environments where information diffuses slowly ³³. Because of this, institutional AI trading strategies are increasingly being deployed in Emerging Markets (EM) like Latin America, Eastern Europe, and developing parts of Asia ³⁴.

The Paradox of Emerging Market Alpha

In developed markets like the United States, massive liquidity and algorithmic saturation mean that any price discrepancy is arbitraged away in milliseconds. Emerging markets, by contrast, have less structured data coverage, fewer institutional players, and heavily restricted liquidity ³⁴. For a machine learning model capable of analyzing alternative data - like local social media sentiment, supply chain shipping logs, or satellite imagery of agricultural yields - these inefficiencies offer genuine opportunities for generating alpha ³³³. Some major investment banks, including JPMorgan, heavily favored emerging market equities heading into 2026 for their AI exposure, citing better fundamental valuations and high growth potential in semiconductor ecosystems across Taiwan and Korea ¹⁸¹⁹.

Yet, the exact conditions that create alpha in emerging markets also trigger the "toll booth" problem exponentially. Lower liquidity means wider bid-ask spreads and severe slippage ³¹. For example, while the spread on a highly liquid developed currency pair like the EUR-USD might be negligible (roughly 0.02%), the spread on an emerging market currency like the Polish Zloty can be five times higher ³⁷.

An AI algorithm may easily spot a pricing anomaly in an Indian mid-cap stock or a Brazilian commodity provider, but executing the trade without moving the market price - and overcoming the massive transaction costs inherent to less liquid exchanges - often negates the entirety of the profit ³¹³⁴. Consequently, quantitative models operating in emerging markets are forced to adapt by trading less frequently, relying on longer-term directional holds rather than high-frequency execution ³⁴.

AI Interacting with AI: The Learning Externality

Perhaps the most fascinating element of the future AI trading landscape is the concept of market equilibrium and algorithmic feedback loops. In the 1980s, standard algorithmic trading revolutionized finance. Today, as AI chatbots, sentiment scrapers, and neural networks become widely accessible to both institutions and retail traders, the market is approaching a saturation point where algorithms are primarily trading against other algorithms.

When an individual AI agent acts in isolation, it can learn efficiently from price signals, capitalize on human behavioral errors, and enhance market efficiency ²⁰. But financial markets are multi-agent environments. A 2025 Wharton research study explored what happens when thousands of deep reinforcement learning (DRL) algorithms are deployed simultaneously. The researchers found that dense algorithmic ecosystems generate "learning externalities" ²⁰.

A learning externality occurs when the noise generated by one algorithm's exploratory trading interferes with the learning process of another algorithm ²⁰. If AI Agent A executes a trade merely to test a market hypothesis, it subtly alters the price of the stock. AI Agent B detects this price movement, incorrectly interprets it as a genuine, fundamental market signal, and adjusts its own behavior accordingly.

When multiple AI agents interact and adapt jointly, they inject massive endogenous noise into the price process ²⁰. The algorithms begin chasing each other's shadows rather than responding to economic reality. In robust economic simulations, researchers have found that while a single AI trader can match a rational benchmark, the presence of many interacting AI traders significantly distorts return signals ²⁰. This endogenous feedback loop degrades the performance of the algorithms, reduces overall market liquidity, and ultimately lowers the profitability of the AI strategies ²⁰.

Interestingly, other studies by the Federal Reserve suggest that AI-powered agents make more "rational" decisions than humans, suppressing the herd behavior and "animal spirits" that traditionally cause massive asset bubbles ²¹. However, in a world where every market participant is armed with sophisticated machine learning tools, the competitive edge inherently vanishes. The market adapts, the structural alpha decays, and the baseline return of the market - captured flawlessly by low-cost, passive buy-and-hold investing - once again reigns supreme.

Bottom line

While artificial intelligence offers unparalleled data processing capabilities and rapid pattern recognition, the empirical evidence strongly suggests that retail AI trading tools, AI-managed ETFs, and large language models fail to consistently outperform a standard buy-and-hold strategy. Even when highly sophisticated institutional models identify genuine market inefficiencies, the resulting gross profits are severely diminished by real-world transaction costs, slippage, and high portfolio turnover. Until an investor can completely eliminate the "invisible tax" of market execution and guarantee an algorithm is not simply overfitting historical noise, passive long-term investing remains the statistically superior and far less volatile path to wealth accumulation.

About this research

This article was produced using AI-assisted research using mmresearch.app and reviewed by human. (CuriousCrane_56)