How I trained a Neural Network to Play Tetris Using Reinforcement Learning | by Tim Hanewich

Over the Christmas/New Yr vacation break of December 2024, I discovered and utilized Q-Learning, a type of Reinforcement Learning to coach a neural community to play a simplified sport of Tetris. On this article I’ll element how I did this. I hope that is useful for something interested by making use of reinforcement studying to a brand new area!

As you may see within the GIF under, the mannequin efficiently “discovered” easy methods to successfully and effectively play a simplified sport of Tetris after coaching over ~6,000 video games, or ~85,000 particular person strikes:

To show the method of Q-Studying on as easy of a enjoying area as doable, I used a very simplified model of the sport of Tetris. The “board” is 4 rows by 4 columns, a 4×4 matrix. The sport could be very easy: the AI agent should select which of the 4 columns to “drop” a single field into, one-by-one, with the objective of ultimately filling all 16 squares of the board.

Regardless of this sport seeming extraordinarily simple and apparent to play to you and me, there may be really loads that may go flawed. If the AI will not be able to studying the character of the sport, it would repeatedly play random strikes, ultimately enjoying a transfer that’s unlawful.

For instance, think about the place under:

If the mannequin doesn’t perceive the sport and is actually enjoying at random, there’s a 25% probability it would select to drop a sq. into column 4 (the column on the far proper). This might be an unlawful transfer, invalidating the sport and never permitting the AI to succeed in the 100% rating of 16!

My objective in coaching this mannequin is for the mannequin to play the sport successfully and productively, avoiding unlawful strikes, and reaching a high rating of 16 in each sport.

To coach an AI to play this simplified sport of Tetris, we are going to use Q-Learning. Q-Studying is a sort of reinforcement studying algorithm in machine studying. The objective of Q-Studying is to seek out the optimum action-selection coverage for any potential sport state. However what does that imply?

Earlier than we dive into the Q-Studying methodology, let’s first perceive what a “Q-Desk” is, as this can be a core idea in Q-Studying.

You possibly can consider a Q-Desk as a protracted listing (a desk) {that a} system data and shops that maps, for any given state of affairs, what reward any doable subsequent motion it chooses will deliver. Contemplate the desk under for instance:

Utilizing the Q-table above, a system can “lookup” what one of the best subsequent motion is for any state of the sport (state of the sport is expressed as two integers right here, a quite simple illustration) by trying on the values for every of the potential subsequent actions for that state. For instance, for state (2,1), dropping in column 2 could be the following greatest motion as a result of that might result in an anticipated reward of 0.5, the best out of every doable transfer on this place.

By enjoying the sport time and again and over by itself, ultimately a system can play each potential transfer from each potential state of the sport! And as soon as it does this, it has a glossary of what strikes are greatest to choose for any situation within the sport. The one problem is, within the overwhelming majority of video games, the variety of potential eventualities is far too excessive. In most video games, there are thousands and thousands, if not billions, and if not trillions of potential distinctive states. It could be not possible to retailer a desk of such size. And to play via all of those examples? It could take perpetually!

As a result of a Q-Desk would simply be far too large to feasibly accumulate and retailer, for this reason we name upon neural networks for assist. Neural networks are a lot smaller in measurement and, via statement of the sport being performed and rewards being collected, study underlying patterns within the sport. These patterns permit the neural community to grasp and estimate the rewards which might be related to sure strikes in sure positions, with out needing to retailer a desk; this can be a lot like how we people study! Successfully, the mannequin is studying to emulate a “Q-Desk”.

Once we arrange a mannequin to coach by way of Q-Studying, we’re repeatedly permitting the mannequin to come across, act, and observe via self-play. In different phrases, via every “step” of a sport, the present state of affairs of the sport (that the mannequin has to decide towards) is named the state. The mannequin sees the state after which decides on what the following transfer ought to be; this transfer is named the motion. After the chosen transfer (motion) is executed towards the sport, the mannequin observes what occurred — did the transfer make the state of affairs higher? Worse? That is referred to as the reward.

The mannequin performs the sport by itself time and again and time and again… many hundreds of instances! Finally, the mannequin has collected so many state, motion, reward pairs (referred to as “experiences”), that it might study from these experiences and develop an honest understanding of which strikes through which states result in the best rewards (most success).

A TensorFlow sequential neural community shall be used to estimate the estimated rewards for every potential transfer in a given place described above. I used the Keras API to make the high-level manipulation of the neural community simpler.

As talked about above, each time the mannequin is requested to make the choice of what transfer to play subsequent, it’s offered with a present “state” of the sport. This can be a easy illustration of the complete state of affairs of the sport, together with all standards that the mannequin ought to think about when deciding what to do subsequent.

Within the case of this mini Tetris sport, the “state” is sort of easy: with a 4×4 board of 16 squares, that leaves 16 distinctive inputs. Every of the 16 inputs shall be represented and “proven” to the mannequin as a 1 or 0; if the actual sq. is occupied, that sq.’s place shall be expressed as a 1, whether it is empty, a 0.

For instance, think about the next board state:

The board state within the picture above will be represented as the next array of integers, with every sq. being expressed as a 0 or 1, with every sq.’s place within the array labeled within the picture above:

[0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1]

So, our neural community will think about 16 inputs when evaluating a place and what transfer to play subsequent. What about outputs? In Q-Studying, the neural community is designed to foretell the “Q-Worth”, or estimated present/future rewards, for each doable transfer in any given state. For the reason that sport we’re enjoying has 4 potential strikes (drop a sq. in column 1, 2, 3, or 4), our neural community could have 4 outputs.

Our neural community may even have a number of hidden layers, that are mathematical layers of neurons which might be linked between the enter and output layers. These hidden layers function “the mind” of the neural community, continuously adjusting via coaching to “study” the character of the sport and relationship between states, strikes, and their related reward.

Under is the code used to assemble the complete mannequin:

# construct layers
input_board = keras.layers.Enter(form=(16,), identify="input_board")
carry = keras.layers.Dense(64, "relu", identify="layer1")(input_board)
carry = keras.layers.Dense(64, "relu", identify="layer2")(carry)
carry = keras.layers.Dense(32, "relu", identify="layer3")(carry)
output = keras.layers.Dense(4, "linear", identify="output")(carry)# assemble the mannequin
self.mannequin = keras.Mannequin(inputs=input_board, outputs=output)
self.mannequin.compile(optimizer=keras.optimizers.Adam(learning_rate=0.003), loss="mse")

To see the code that constructs and trains the neural community, test it out on GitHub here. To see how a sport state is transformed to a flattened integer array illustration, try that conversion operate here.

As talked about above, the neural community is designed to approximate the Q-Worth of every potential transfer from any given state of the sport. “Q-Worth” is only a fancy time period for a mix of present and future rewards (i.e. “within the fast time period and long run, how properly will this transfer serve me?”). If the mannequin is ready to approximate the reward for all 4 doable strikes from any state, the mannequin can merely choose the transfer that it thinks will return the best reward because the steered subsequent optimum transfer.

However, how will the mannequin intrinsically know which strikes are “good”, that are “not so good”, and that are “unhealthy”? That’s the place we, because the human designing this course of, want to offer the AI agent a little bit of steerage. This steerage is named the reward operate.

In brief, the reward operate is only a easy operate we are going to write to mathematically calculate how good any potential transfer is. Keep in mind, for every transfer the AI makes, it observers what reward it received for making that transfer. We’re those that outline a high-level operate that may roughly calculate if the transfer was good or unhealthy.

The reward operate I used for this mini Tetris AI is sort of easy and will be discovered within the score_plus() operate within the GameState class within the tetris module:

def score_plus(self) -> float:# begin at rating
ToReturn:float = float(self.rating())
# penalize for normal deviation
stdev:float = statistics.pstdev(self.column_depths())
ToReturn = ToReturn - (stdev * 2)
return ToReturn

Firstly, I’ve arrange my system to find out the reward primarily based merely upon the distinction between this “score_plus” after the transfer and earlier than the transfer. In different phrases, the mannequin observes the score_plus earlier than the transfer is made, makes the transfer, after which once more observes the score_plus, with the distinction (improve) being the reward.

My reward operate is sort of easy. Firstly, the rating of the sport is tallied up – that is simply the variety of squares of the board which might be occupied. After this, I am utilizing a easy commonplace deviation operate to calculate the deviation in “column depth”, or what number of squares of every column aren’t occupied.

A higher quantity of normal deviation means the board is being developed in a really unbalanced method — i.e. one facet could be very tall whereas the opposite will not be; this isn’t good for a sport of Tetris. A really “degree” board will as a substitute equate to a small commonplace deviation. By subtracting the column depth commonplace deviation from the full rating, we will penalize the mannequin from constructing uneven, unbalanced boards, incentivizing the development of balanced boards.

With our underlying mannequin constructed and reward operate established, it’s now time to coach the mannequin! As talked about beforehand, the mannequin shall be left to its personal gadgets, enjoying the sport over-and-over by itself. It’s going to begin with zero information on easy methods to play the sport — simply the power to watch the sport, decide, and see what reward it received for that call.

By repeating this time and again via self-play and coaching on these outcomes, the neural community ultimately types the connection between the present state of the board, a possible choice that could possibly be made, and the everyday reward for such a choice. And as soon as this understanding is cemented, enjoying a sport as best-as-possible is straightforward; all we now have to do is all the time choose the transfer (motion) that the mannequin anticipates will reap the best reward!

Extra particularly, the next is the boiled-down coaching course of. The complete coaching script will be present in train.py.

Initialize a brand new neural community.
Gather a number of hundred state, motion, reward experiences via a mix of choosing the strikes the mannequin thinks are greatest (
Convert the state of the present sport to an inventory of integers.
Choose a transfer to play primarily based on what the mannequin thinks is greatest however might be not as a result of the mannequin doesn’t know something but!) or a random transfer. A random transfer is sometimes chosen to encourage exploration. Learn extra about exploration vs. exploitation here.
Execute (play) the transfer and observe the reward that was given for that transfer.
Retailer this state, motion, reward “expertise” into a group.
Loop via all of those collected state, motion, reward experiences, one after the other, every time coaching on the expertise (updating the neural networks weights) to raised approximate the right reward given a state and motion.
Calculate what the Q-Worth (fast/future reward) ought to be (a mixing of fast/future rewards that represents the full reward of this choice).
Ask the mannequin to foretell what it thinks the reward (Q-Worth) could be.
The mannequin’s prediction might be incorrect as a result of it doesn’t know something but.
“Right” the mannequin by coaching it towards the right Q-Worth calculated above in step 1.
Do that time and again for each expertise.
Repeat the above steps repeatedly till the mannequin learns easy methods to play the sport successfully and legally!

Try the complete coaching script on GitHub here.

After organising the coaching course of described above within the train.py module, I let this run for ~4 hours. After coaching on 85,000 state, motion, reward experiences over these 4 hours, my mannequin efficiently discovered to play the sport completely. The mannequin performs the sport completely from any state — from a brand new sport place (clean board) or perhaps a “randomized” place. Each time the sport performs, it all the time scores 16 (excellent rating) for every “sport” and by no means makes an unlawful transfer.

My mannequin educated on 85,000 experiences (strikes), however I don’t suppose this many was needed. Because it seems within the coaching log file, optimum efficiency appeared to be achieved across the 4,500 expertise (transfer) mark.

You possibly can obtain my educated mannequin from the Mannequin Checkpoints part under and run it within the evaluate.py script.

All code for this mission is open-source and accessible underneath the MIT license on GitHub here!

Source link

Current Landscape of Artificial Intelligence Threats | by Kosiyae Yussuf | CodeToDeploy : The Tech Digest | Aug, 2025

Optimizing ML Costs with Azure Machine Learning | by Joshua Fox | Aug, 2025

Top Tools and Skills for AI/ML Engineers in 2025 | by Raviishankargarapti | Aug, 2025

Can Machines Really Recreate “You”?

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

Predicting HVAC Motor Bearing Wear Using Machine Learning | by Javeria Hassan | Apr, 2025

5 Key Leadership Principles That Drive Real Results

Stellar Flare Detection and Prediction Using Clustering and Machine Learning

Our Picks

Can Machines Really Recreate “You”?

Meet the researcher hosting a scientific conference by and for AI

Current Landscape of Artificial Intelligence Threats | by Kosiyae Yussuf | CodeToDeploy : The Tech Digest | Aug, 2025

How I trained a Neural Network to Play Tetris Using Reinforcement Learning | by Tim Hanewich | Dec, 2024

Related Posts