Can a Machine Learning Model Really Predict the Market? (Part 1) | by Luigi Cheng

I’ve labored with information for years. Constructed pipelines, cleaned messy datasets, and helped groups make sense of numbers.

However the robust half? Nearly every part I’ve achieved is non-public. I can’t present it on a portfolio or GitHub.

Now I’m at some extent the place I need to transfer ahead in my profession. Meaning exhibiting what I can do.

I wanted a mission that displays the talents I’ve constructed over time, but additionally lets me discover one thing new.

That’s the place this collection is available in.

I’ve at all times been curious concerning the inventory market. I by no means had a purpose to dig into it till now.

So I gave myself a problem:

Can I construct a machine studying mannequin that predicts if the market will go up or down tomorrow?

Sounds easy. It’s not.

The exhausting half wasn’t the modeling. It was understanding how the market works.

What counts as a “good” prediction? What sort of accuracy issues on this area?

In most ML tasks, you need your mannequin to hit 70 or 80 %. That’s the benchmark.

However in buying and selling, the bar is approach decrease. In case your mannequin hits even 51 % persistently, that may be sufficient to earn a living.

That blew my thoughts. And it modified the best way I considered the complete pipeline.

Instantly, an important job wasn’t choosing the right mannequin. It was selecting the best options.

It was deciding the right way to label the info, the right way to deal with noise, and when to belief the output.

The extra I discovered about monetary markets, the extra I noticed that area information makes or breaks the mission.

This weblog collection is about that journey. I’ll stroll via every part step-by-step.

From uncooked information to prediction. From mannequin accuracy to buying and selling logic.

And all of it begins with a fundamental query:

Can a machine actually predict the market utilizing simply previous costs?

This primary mission is all about constructing a baseline.

No fancy tips. No insider information. Simply historic costs from the S&P 500 index.

I needed to see how far you may get through the use of frequent indicators and customary fashions.

The purpose was easy:

Can we predict whether or not the market will shut greater or decrease tomorrow?

To reply that, I skilled three varieties of fashions:

Logistic Regression
Random Forest
XGBoost

These fashions go from easy to extra superior.

Logistic regression offers you an excellent baseline. Random forest provides some energy. XGBoost usually performs greatest in competitions, so I needed to see the way it handles market information.

However this isn’t only a modeling train.

Earlier than you’ll be able to prepare something helpful, that you must form the info the suitable approach.

Meaning choosing the right options, cleansing noisy information, and deciding the right way to label your outcomes.

Can we name it “up” if the market rises even 0.1%? What about flat days? These small selections matter.

On this first weblog, I stroll via that complete course of.

How I constructed the dataset, created options, dealt with class imbalance, and evaluated mannequin efficiency.

The outcomes would possibly shock you.

However the principle takeaway is that this: constructing monetary fashions takes extra than simply code. It’s a must to suppose like a dealer, too.

To maintain issues clear and centered, I began with historic information from the S&P 500 index.

This contains the standard worth information: open, excessive, low, shut, and quantity. No exterior indicators or information information, however simply uncooked worth motion.

I used this to construct a easy binary classification job:

Can we predict if the market will shut greater tomorrow?

To try this, I labeled every row within the dataset primarily based on whether or not the following day’s closing worth was greater than the present day.

# Label = 1 if the following day's shut is greater than at the moment
df["label"] = (df["Close"].shift(-1) > df["Close"]).astype(int)# Drop the final row because it has no future worth to check
df.dropna(inplace=True)

So if tomorrow’s shut is greater, label = 1.

If it’s the identical or decrease, label = 0.

It’s a easy rule, however even this tiny alternative shapes how your mannequin behaves.

For instance, most market days have small actions — up or down only a fraction of a %.

So the road between “up” and “down” will be noisy.

I checked the label distribution to see how balanced the courses have been:

df["label"].value_counts(normalize=True)

This reveals the ratio of up days to down days. Within the S&P 500, up days are a bit extra frequent.

You’ll usually see one thing like 52% up and 48% down. It’s shut, however not completely balanced.

Why does that matter? As a result of in case your mannequin simply predicts “up” day by day, it’ll nonetheless appear to be it’s proper half the time.

That’s a lure. And it’s why class steadiness is one thing to be careful for, even while you suppose the setup is easy.

Within the subsequent step, I added technical indicators to present the mannequin extra to work with.

Uncooked costs aren’t sufficient on their very own.

Now that I had labels, the following step was giving the mannequin one thing helpful to be taught from.

Uncooked costs alone aren’t sufficient. Fashions want patterns. And in buying and selling, these patterns usually come from technical indicators.

I used a number of well-known ones that many merchants depend on. These aren’t magic, however they assist seize momentum, pattern shifts, and volatility.

I used the ta library in Python, which makes it simple to calculate these indicators.

import ta
df["rsi"] = ta.momentum.RSIIndicator(shut=df["Close"]).rsi()
macd = ta.pattern.MACD(shut=df["Close"])
df["macd"] = macd.macd_diff()
df["ma_10"] = df["Close"].rolling(window=10).imply()
df["ma_50"] = df["Close"].rolling(window=50).imply()
df["ma_crossover"] = (df["ma_10"] > df["ma_50"]).astype(int)
df["daily_return"] = df["Close"].pct_change()

Relative Power Index (RSI)

RSI reveals whether or not the market is overbought or oversold.

It’s a quantity between 0 and 100. Above 70 usually means “overbought.” Under 30 means “oversold.”

MACD (Shifting Common Convergence Divergence)

MACD measures the power and path of a pattern.

I used the MACD distinction, which highlights momentum shifts.

Shifting Averages

I added two frequent shifting averages:

10-day (short-term)
50-day (long-term)

To make that extra helpful, I created a crossover characteristic.

This checks if the short-term common is above the long-term one, which some merchants use as a bullish sign.

Day by day Return

I additionally calculated each day returns as a fundamental momentum sign

Splitting the Information

I made positive to separate the info in a time-aware approach.

We’re making an attempt to foretell the longer term, so we don’t need to combine previous and future when coaching and testing.

split_index = int(len(df) * 0.8)X = df[["rsi", "macd", "ma_10", "ma_50", "ma_crossover", "daily_return"]]
y = df["label"]
X_train = X.iloc[:split_index]
y_train = y.iloc[:split_index]
X_test = X.iloc[split_index:]
y_test = y.iloc[split_index:]

Source link

Top Tools and Skills for AI/ML Engineers in 2025 | by Raviishankargarapti | Aug, 2025

How to Fine-Tune Large Language Models for Real-World Applications | by Aurangzeb Malik | Aug, 2025

Questioning Assumptions & (Inoculum) Potential | by Jake Winiski | Aug, 2025

Roleplay AI Chatbot Apps with the Best Memory: Tested

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Meta’s Executive Bonuses Will Increase Up to 200% This Year

Using Diffusion Models for BeliefMDPs | by Aritra Chakrabarty | Toward Humanoids | Dec, 2024

Unlocking the Potential of Generative AI for Marketing Success

Our Picks