😲 Quantifying Surprise – A Data Scientist’s Intro To Information Theory – Part 1/4: Foundations

Throughout the telecommunication increase, Claude Shannon, in his seminal 1948 paper¹, posed a query that will revolutionise know-how:

How can we quantify communication?

Shannon’s findings stay elementary to expressing data quantification, storage, and communication. These insights made main contributions to the creation of applied sciences starting from sign processing, information compression (e.g., Zip recordsdata and compact discs) to the Web and synthetic intelligence. Extra broadly, his work has considerably impacted various fields equivalent to neurobiology, statistical physics and laptop science (e.g, cybersecurity, cloud computing, and machine studying).

[Shannon’s paper is the]

Magna Carta of the Data Age

That is the primary article in a collection that explores data quantification – a vital device for information scientists. Its purposes vary from enhancing statistical analyses to serving as a go-to resolution heuristic in cutting-edge machine studying algorithms.

Broadly talking, quantifying data is assessing uncertainty, which can be phrased as: “how shocking is an final result?”.

This text thought shortly grew right into a collection since I discovered this matter each fascinating and various. Most researchers, at one stage or one other, come throughout generally used metrics equivalent to entropy, cross-entropy/KL-divergence and mutual-information. Diving into this matter I discovered that so as to totally admire these one must be taught a bit concerning the fundamentals which we cowl on this first article.

By studying this collection you’ll achieve an instinct and instruments to quantify:

Bits/Nats – Unit measures of data.
Self-Data – **** The quantity of data in a particular occasion.
Pointwise Mutual Data – The quantity of data shared between two particular occasions.
Entropy – The typical quantity of data of a variable’s final result.
Cross-entropy – The misalignment between two chance distributions (additionally expressed by its by-product KL-Divergence – a distance measure).
Mutual Data – The co-dependency of two variables by their conditional chance distributions. It expresses the data achieve of 1 variable given one other.

No prior data is required – only a fundamental understanding of chances.

I display utilizing frequent statistics equivalent to coin and cube 🎲 tosses in addition to machine studying purposes equivalent to in supervised classification, characteristic choice, mannequin monitoring and clustering evaluation. As for actual world purposes I’ll talk about a case research of quantifying DNA variety 🧬. Lastly, for enjoyable, I additionally apply to the favored mind tornado generally generally known as the Monty Corridor drawback 🚪🚪 🐐 .

All through I present python code 🐍 , and attempt to preserve formulation as intuitive as attainable. You probably have entry to an built-in improvement atmosphere (IDE) 🖥 you would possibly wish to plug 🔌 and play 🕹 round with the numbers to achieve a greater instinct.

This collection is split into 4 articles, every exploring a key facet of Information Theory:

😲 Quantifying Shock: 👈 👈 👈 YOU ARE HERE On this opening article, you’ll discover ways to quantify the “shock” of an occasion utilizing _self-informatio_n and perceive its items of measurement, equivalent to _bit_s and _nat_s. Mastering self-information is crucial for constructing instinct concerning the subsequent ideas, as all later heuristics are derived from it.
🤷 Quantifying Uncertainty: Constructing on self-information, this text shifts focus to the uncertainty – or “common shock” – related to a variable, generally known as entropy. We’ll dive into entropy’s wide-ranging purposes, from Machine Learning and information evaluation to fixing enjoyable puzzles, showcasing its adaptability.
📏 Quantifying Misalignment: Right here, we’ll discover measure the space between two chance distributions utilizing entropy-based metrics like cross-entropy and KL-divergence. These measures are notably helpful for duties like evaluating predicted versus true distributions, as in classification loss capabilities and different alignment-critical situations.
💸 Quantifying Acquire: Increasing from single-variable measures, this text investigates the relationships between two. You’ll uncover quantify the data gained about one variable (e.g, goal Y) by realizing one other (e.g., predictor X). Functions embrace assessing variable associations, characteristic choice, and evaluating clustering efficiency.

Every article is crafted to face alone whereas providing cross-references for deeper exploration. Collectively, they supply a sensible, data-driven introduction to data idea, tailor-made for information scientists, analysts and machine studying practitioners.

Disclaimer: Until in any other case talked about the formulation analysed are for categorical variables with c≥2 lessons (2 that means binary). Steady variables can be addressed in a separate article.

🚧 Articles (3) and (4) are at the moment underneath building. I’ll share hyperlinks as soon as out there. Follow me to be notified 🚧

Quantifying Shock with Self-Data

Self-information is taken into account the constructing block of data quantification.

It’s a method of quantifying the quantity of “shock” of a particular final result.

Formally self-information, or additionally known as Shannon Data or data content material, quantifies the shock of an occasion x occurring primarily based on its chance, p(x). Right here we denote it as hₓ:

Self-information _h_ₓ is the information of event x that occurs with probability p(x). — Self-information _h_ₓ is the data of occasion x that happens with chance p(x).

The items of measure are referred to as bits. One bit (binary digit) is the quantity of data for an occasion x that has chance of p(x)=½. Let’s plug in to confirm: hₓ=-log₂(½)= log₂(2)=1 bit.

This heuristic serves as a substitute for chances, odds and log-odds, with sure mathematical properties that are advantageous for data idea. We talk about these under when studying about Shannon’s axioms behind this selection.

It’s all the time informative to discover how an equation behaves with a graph:

Bernoulli trial self-information h(p). Key features: Monotonic, h(p=1)=0, h(p →)→∞. — Bernoulli trial self-information h(p). Key options: Monotonic, h(p=1)=0, h(p →)→∞.

To deepen our understanding of self-information, we’ll use this graph to discover the mentioned axioms that justify its logarithmic formulation. Alongside the best way, we’ll additionally construct instinct about key options of this heuristic.

To stress the logarithmic nature of self-information, I’ve highlighted three factors of curiosity on the graph:

At p=1 an occasion is assured, yielding no shock and therefore zero bits of data (zero bits). A helpful analogy is a trick coin (the place either side present HEAD).
Lowering the chance by an element of two (p=½) will increase the data to _hₓ=_1 bit. This, in fact, is the case of a good coin.
Additional lowering it by an element of 4 leads to hₓ(p=⅛)=3 bits.

If you’re involved in coding the graph here’s a python script:

To summarise this part:

Self-Data hₓ=-log₂(p(x)) quantifies the quantity of “shock” of a particular final result x.

Three Axioms

Referencing prior work by Ralph Hartley, Shannon selected -log₂(p) as a fashion to fulfill three axioms. We’ll use the equation and graph to look at how these are manifested:

An occasion with chance 100% isn’t a surprise and therefore doesn’t yield any data. Within the trick coin case that is evident by p(x)=1 yielding hₓ=0.
Much less possible occasions are extra shocking and supply extra data. That is obvious by self-information reducing monotonically with rising chance.
The property of Additivity – the overall self-information of two unbiased occasions equals the sum of particular person contributions. This can be explored additional within the upcoming fourth article on Mutual Data.

There are mathematical proofs (that are past the scope of this collection) that present that solely the log perform adheres to all three².

The applying of those axioms reveals a number of intriguing and sensible properties of self-information:

Vital properties :

Minimal certain: The primary axiom hₓ(p=1)=0 establishes that self-information is non-negative, with zero as its decrease certain. That is extremely sensible for a lot of purposes.
Monotonically reducing: The second axiom ensures that self-information decreases monotonically with rising chance.
No Most certain: On the excessive the place _p→_0, monotonicity results in self-information rising with out certain hₓ(_p→0) →_ ∞, a characteristic that requires cautious consideration in some contexts. Nevertheless, when averaging self-information – as we’ll later see within the calculation of entropy – chances act as weights, successfully limiting the contribution of extremely unbelievable occasions to the general common. This relationship will grow to be clearer once we discover entropy intimately.

It’s helpful to grasp the shut relationship to log-odds. To take action we outline p(x) because the chance of occasion x to occur and p(¬x)=1-p(x) of it to not occur. log-odds(x) = log₂(p(x)/p(¬x))= h(¬x) – h(x).

The principle takeaways from this part are

Axiom 1: An occasion with chance 100% isn’t a surprise

Axiom 2: Much less possible occasions are extra shocking and, once they happen, present extra data.

Self data (1) monotonically decreases (2) with a minimal certain of zero and (3) no higher certain.

Within the subsequent two sections we additional talk about items of measure and selection of normalisation.

Data Items of Measure

Bits or Shannons?

A bit, as talked about, represents the quantity of data related to an occasion that has a 50% chance of occurring.

The time period can also be typically known as a Shannon, a naming conference proposed by mathematician and physicist David MacKay to keep away from confusion with the time period ‘bit’ within the context of digital processing and storage.

After some deliberation, I made a decision to make use of ‘bit’ all through this collection for a number of causes:

This collection focuses on quantifying data, not on digital processing or storage, so ambiguity is minimal.
Shannon himself, inspired by mathematician and statistician John Tukey, used the time period ‘bit’ in his landmark paper.
‘Bit’ is the usual time period in a lot of the literature on data idea.
For comfort – it’s extra concise

Normalisation: Log Base 2 vs. Pure

All through this collection we use base 2 for logarithms, reflecting the intuitive notion of a 50% probability of an occasion as a elementary unit of data.

Another generally utilized in machine studying is the pure logarithm, which introduces a special unit of measure referred to as nats (brief for natural items of data). One nat corresponds to the data gained from an occasion occurring with a chance of 1/e the place e is Euler’s quantity (≈2.71828). In different phrases, 1 nat = -ln(p=(1/e)).

The connection between bits (base 2) and nats (pure log) is as follows:

1 bit = ln(2) nats ≈ 0.693 nats.

Consider it as just like a financial present alternate or changing centimeters to inches.

In his seminal publication Shanon defined that the optimum selection of base is determined by the particular system being analysed (paraphrased barely from his unique work):

“A tool with two secure positions […] can retailer one bit of data” (bit as in binary digit).
“A digit wheel on a desk computing machine that has ten secure positions […] has a storage capability of 1 decimal digit.”³
“In analytical work the place integration and differentiation are concerned the bottom e is typically helpful. The ensuing items of data can be referred to as pure items.“

Key features of machine studying, equivalent to fashionable loss capabilities, usually depend on integrals and derivatives. The pure logarithm is a sensible selection in these contexts as a result of it may be derived and built-in with out introducing extra constants. This doubtless explains why the machine studying group steadily makes use of nats because the unit of data – it simplifies the arithmetic by avoiding the necessity to account for elements like ln(2).

As proven earlier, I personally discover base 2 extra intuitive for interpretation. In instances the place normalisation to a different base is extra handy, I’ll make an effort to clarify the reasoning behind the selection.

To summarise this part of items of measure:

bit = quantity of data to tell apart between two equally doubtless outcomes.

Now that we’re aware of self-information and its unit of measure let’s look at a couple of use instances.

Quantifying Occasion Data with Cash and Cube

On this part, we’ll discover examples to assist internalise the self-information axioms and key options demonstrated within the graph. Gaining a stable understanding of self-information is crucial for greedy its derivatives, equivalent to entropy, cross-entropy (or KL divergence), and mutual data – all of that are averages over self-information.

The examples are designed to be easy, approachable, and lighthearted, accompanied by sensible Python code that will help you experiment and construct instinct.

Word: When you really feel comfy with self-information, be happy to skip these examples and go straight to the Quantifying Uncertainty article.

Generated using Gemini. — Generated utilizing Gemini.

To additional discover the self-information and bits, I discover analogies like coin flips and cube rolls notably efficient, as they’re usually helpful analogies for real-world phenomena. Formally, these may be described as multinomial trials with n=1 trial. Particularly:

A coin flip is a Bernoulli trial, the place there are c=2 attainable outcomes (e.g., heads or tails).
Rolling a die represents a categorical trial, the place c≥3 outcomes are attainable (e.g., rolling a six-sided or eight-sided die).

As a use case we’ll use simplistic climate studies restricted to that includes solar 🌞 , rain 🌧 , and snow ⛄️.

Now, let’s flip some digital cash 👍 and roll some funky-looking cube 🎲 …

Truthful Cash and Cube

We’ll begin with the best case of a good coin (i.e, 50% probability for achievement/Heads or failure/Tails).

Think about an space for which at any given day there’s a 50:50 probability for solar or rain. We will write the chance of every occasion be: p(🌞 )=p(🌧 )=½.

As seen above, in accordance the the self-information formulation, when 🌞 or 🌧 is reported we’re offered with h(🌞 __ )=h(🌧 )=-log₂(½)=1 bit of data.

We’ll proceed to construct on this analogy in a while, however for now let’s flip to a variable that has greater than two outcomes (c≥3).

Earlier than we tackle the usual six sided die, to simplify the maths and instinct, let’s assume an 8 sided one (_c=_8) as in Dungeons Dragons and different tabletop video games. On this case every occasion (i.e, touchdown on all sides) has a chance of p(🔲 ) = ⅛.

When a die lands on one aspect going through up, e.g, worth 7️⃣, we’re supplied with h(🔲 =7️⃣)=-log₂(⅛)=3 bits of data.

For the standard six sided truthful die: p(🔲 ) = ⅙ → an occasion yields __ h(🔲 )=-log₂(⅙)=2.58 bits.

Evaluating the quantity of data from the truthful coin (1 bit), 6 sided die (2.58 bits) and eight sided (3 bits) we establish the second axiom: The much less possible an occasion is, the extra shocking it’s and the extra data it yields.

Self data turns into much more attention-grabbing when chances are skewed to want sure occasions.

Loaded Cash and Cube

Let’s assume a area the place p(🌞 ) = ¾ and p(🌧 )= ¼.

When rain is reported the quantity of data conveyed is just not 1 bit however fairly h(🌧 )=-log₂(¼)=2 bits.

When solar is reported much less data is conveyed: h(🌞 )=-log₂(¾)=0.41 bits.

As per the second axiom— a rarer occasion, like p(🌧 )=¼, reveals extra data than a extra doubtless one, like p(🌞 )=¾ – and vice versa.

To additional drive this level let’s now assume a desert area the place p(🌞 ) =99% and p(🌧 )= 1%.

If sunshine is reported – that’s sort of anticipated – so nothing a lot is learnt (“nothing new underneath the solar” 🥁) and that is quantified as h(🌞 )=0.01 bits. If rain is reported, nonetheless, you’ll be able to think about being fairly shocked. That is quantified as h(🌧 )=6.64 bits.

Within the following python scripts you’ll be able to look at all of the above examples, and I encourage you to play with your individual to get a sense.

First let’s outline the calculation and printout perform:

import numpy as np

def print_events_self_information(probs):
    for ps in probs:
        print(f"Given distribution {ps}")
        for occasion in ps:
            if ps[event] != 0:
                self_information = -np.log2(ps[event]) #identical as: -np.log(ps[event])/np.log(2) 
                text_ = f'When `{occasion}` happens {self_information:0.2f} bits of data is communicated'
                print(text_)
            else:
                print(f'a `{occasion}` occasion can not occur p=0 ')
        print("=" * 20)

Subsequent we’ll set a couple of instance distributions of climate frequencies

# Setting a number of chance distributions (every sums to 100%)
# Enjoyable reality - 🐍  💚  Emojis!
probs = [{'🌞   ': 0.5, '🌧   ': 0.5},   # half-half
        {'🌞   ': 0.75, '🌧   ': 0.25},  # more sun than rain
        {'🌞   ': 0.99, '🌧   ': 0.01} , # mostly sunshine
]

print_events_self_information(probs)

This yields printout

Given distribution {'🌞      ': 0.5, '🌧      ': 0.5}
When `🌞      ` happens 1.00 bits of data is communicated 
When `🌧      ` happens 1.00 bits of data is communicated 
====================
Given distribution {'🌞      ': 0.75, '🌧      ': 0.25}
When `🌞      ` happens 0.42 bits of data is communicated 
When `🌧      ` happens 2.00 bits of data is communicated 
====================
Given distribution {'🌞      ': 0.99, '🌧      ': 0.01}
When `🌞      ` happens 0.01 bits of data is communicated 
When `🌧      ` happens 6.64 bits of data is communicated

Let’s look at a case of a loaded three sided die. E.g, data of a climate in an space that studies solar, rain and snow at uneven chances: p(🌞 ) = 0.2, p(🌧 )=0.7, p(⛄️)=0.1.

Operating the next

print_events_self_information([{'🌞 ': 0.2, '🌧 ': 0.7, '⛄️': 0.1}])

yields

Given distribution {'🌞  ': 0.2, '🌧  ': 0.7, '⛄️': 0.1}
When `🌞  ` happens 2.32 bits of data is communicated 
When `🌧  ` happens 0.51 bits of data is communicated 
When `⛄️` happens 3.32 bits of data is communicated

What we noticed for the binary case applies to larger dimensions.

To summarise – we clearly see the implications of the second axiom:

When a extremely anticipated occasion happens – we don’t be taught a lot, the bit rely is low.
When an surprising occasion happens – we be taught loads, the bit rely is excessive.

Occasion Data Abstract

On this article we launched into a journey into the foundational ideas of data idea, defining measure the shock of an occasion. Notions launched function the bedrock of many instruments in data idea, from assessing information distributions to unraveling the internal workings of machine studying algorithms.

By means of easy but insightful examples like coin flips and cube rolls, we explored how self-information quantifies the unpredictability of particular outcomes. Expressed in bits, this measure encapsulates Shannon’s second axiom: rarer occasions convey extra data.

Whereas we’ve centered on the data content material of particular occasions, this naturally results in a broader query: what’s the common quantity of data related to all attainable outcomes of a variable?

Within the subsequent article, Quantifying Uncertainty, we construct on the muse of self-information and bits to discover entropy – the measure of common uncertainty. Removed from being only a stunning theoretical assemble, it has sensible purposes in information evaluation and machine studying, powering duties like resolution tree optimisation, estimating variety and extra.

Claude Shannon. Credit: Wikipedia — Claude Shannon. Credit score: Wikipedia

Cherished this put up? ❤️🍕

💌 Comply with me right here, be a part of me on LinkedIn or 🍕 buy me a pizza slice!

About This Collection

Despite the fact that I’ve twenty years of expertise in information evaluation and predictive modelling I all the time felt fairly uneasy about utilizing ideas in data idea with out really understanding them.

The aim of this collection was to place me extra comfy with ideas of data idea and hopefully present for others the reasons I wanted.

🤷 Quantifying Uncertainty – A Data Scientist’s Intro To Information Theory – Part 2/4: EntropyGa_in intuition into Entropy and master its applications in Machine Learning and Data Analysis. Python code included. 🐍 me_dium.com

Take a look at my different articles which I wrote to higher perceive Causality and Bayesian Statistics:

Footnotes

¹ A Mathematical Idea of Communication, Claude E. Shannon, Bell System Technical Journal 1948.

It was later renamed to a e book The Mathematical Idea of Communication in 1949.

[Shannon’s “A Mathematical Theory of Communication”] the blueprint for the digital period – Historian James Gleick

² See Wikipedia web page on Information Content (i.e, self-information) for an in depth derivation that solely the log perform meets all three axioms.

³ The decimal-digit was later renamed to a hartley (image Hart), a ban or a dit. See Hartley (unit) Wikipedia web page.

Credit

Until in any other case famous, all photos had been created by the creator.

Many because of Will Reynolds and Pascal Bugnion for his or her helpful feedback.

Source link

Tried an AI Text Humanizer That Passes Copyscape Checker

Bots Are Taking Over the Internet—And They’re Not Asking for Permission

Can Machines Really Recreate “You”?

AI is nothing but all Software Engineering: you have no place in the industry without software engineering | by Irfan Ullah | Aug, 2025

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

How white-tailed deer came back from the brink of extinction

Looking to Sell Your Company? Here’s a Potentially Lucrative Exit Plan Every Business Needs to Consider.

5 Ancient Asian Values Every Entrepreneur Should Know

Our Picks