I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms

Lately, DeepSeek introduced their newest mannequin, R1, and article after article got here out praising its efficiency relative to price, and the way the discharge of such open-source fashions might genuinely change the course of LLMs eternally. That’s actually thrilling! And likewise, too massive of a scope to put in writing about… however when a mannequin like DeepSeek comes out of nowhere with a metal chair, boasting related efficiency ranges to different fashions, what does efficiency actually imply on this context?

When you observe AI releases, you’ve seen this dance earlier than. Each new mannequin drops with its graphs displaying the way it’s one way or the other concurrently higher than GPT-4 on math issues whereas being smaller and extra environment friendly. However what precisely are these benchmarks measuring? How are they created? And extra importantly, how can we minimize by way of the hype to create our personal benchmarks for particular use instances?

I needed to be taught extra about LLM Benchmarking.

Half 1: What’s a Benchmark? (in 3 seconds)

TL:DR — The SATs (a number of, truly) for LLMs.

Half 1.1: What’s a Benchmark? (in additional than 3 seconds)

Earlier than we dive into the nitty-gritty of particular benchmarks, let’s take a second to unpack what we even imply by “LLM Benchmark.” As a result of calling them the “SATs for AI” feels each proper and in addition barely oversimplified.

LLM benchmarks are, at their core, structured checks used to measure how properly giant language fashions carry out on sure duties. These duties may be something from figuring out if an announcement is true or false, to summarizing a authorized doc, to producing legitimate Python capabilities. Consider them as curated impediment programs specifically designed by AI researchers to check each related muscle these fashions might need. These frameworks sometimes present a dataset of inputs with recognized appropriate outputs, permitting for constant comparability between fashions.

Trendy benchmarks make use of varied analysis methodologies. Classification metrics like accuracy work for duties with discrete appropriate solutions, whereas overlap-based metrics (BLEU, ROUGE) consider free-form textual content era. Some benchmarks use useful testing for code era, or make use of different LLMs as judges to judge response high quality.

A typical benchmark normally comes packaged as:

A standardized dataset of questions, prompts, or duties (with appropriate or reference solutions).
An analysis protocol specifying find out how to measure success, like accuracy, F1 rating, BLEU/ROUGE for textual content era, or cross/fail charges for coding duties.
A leaderboard or some type of comparative scoreboard, usually with massive flashy graphs.

Some actually well-known benchmarks embrace MMLU for testing multitask language understanding, TruthfulQA for assessing factual accuracy, and HumanEval for measuring coding capabilities. Outcomes are fairly usually printed on public leaderboards, which let’s folks carry out some clear comparability between totally different fashions.

What Makes a Good Benchmark?

A Clear Process Definition: We would like duties which might be unambiguous. The extra simple and well-specified the problem, the better it’s to belief the outcomes.
Knowledge Integrity: The take a look at set shouldn’t be floating round within the coaching knowledge. As a result of if the mannequin’s seen the very same query 50 instances earlier than, the analysis is about as helpful as giving a math quiz to somebody who already has the reply key.
Quantifiable Metrics: You want a typical for scoring efficiency — like what number of instances the mannequin’s code passes take a look at instances or how shut the generated abstract is to a “ground-truth” abstract.
Process Variety & Problem: If a benchmark is just too straightforward, everybody simply ACES it on day one, and we be taught… properly, nothing. If it’s too area of interest (like “We take a look at solely the mannequin’s means to rely the digits of Pi for 20 minutes”), that’s additionally not so useful.

Life Ain’t All about The Grades

Benchmarks seize solely a slice of what LLMs can do. In the true world, your chatbot may must juggle area information, maintain monitor of dialog context, abide by your organization’s insurance policies, and produce fluent, non-offensive replies. No single standardized take a look at on the market totally covers that. As we’ll see within the upcoming case research, the design and execution of a benchmark can closely form the image you get of your mannequin’s efficiency… and generally lead you astray in case you’re not cautious with the way you measure success.

Now that we now have a way of what Llm Benchmarks are designed to perform (and the place they could fall brief), let’s discover a few examples to see how folks truly construct and use them in follow — with blended outcomes!

Case Research #1: Leetcode as an LLM Benchmark

As a pupil within the tech area, the phrase “Leetcode” popping up throughout my seek for cool benchmarks raised by blood stress by a statistically important quantity. Not like Leetcode, which sucks, the paper “Efficiency Research of LLM-Generated Code on Leetcode” was very fascinating — it asks a deceptively easy query: can we use Leetcode to benchmark LLM code era? Their findings reveal each the promise and pitfalls of this strategy.

The Benchmark Design

The researchers constructed a three-stage validation system. Native checks catch fundamental errors, Leetcode’s decide verifies correctness, and a customized benchmarking setup measures efficiency. This setup revealed one thing important: benchmarking code efficiency is tougher than it appears to be like.

After they in contrast native measurements to Leetcode’s metrics, they discovered solely a 0.28 correlation. Leetcode’s measurements confirmed a lot greater variation (0.089 vs 0.035 domestically). Even worse, Leetcode’s rankings proved unstable — an identical options might drop from the 77th to 54th percentile simply primarily based on submission timing.

A Efficiency Research of LLM-Generated Code on Leetcode,” In twenty eighth Worldwide Convention on Analysis and Evaluation in Software program Engineering (EASE 2024), Salerno, Italy (2024)

The Actual Issues

Three main points emerged that problem Leetcode’s viability as a benchmark:

Knowledge Contamination: Utilizing public issues dangers LLMs having seen the options throughout coaching. The researchers had to make use of solely issues from 2023 to mitigate this.

Platform Instability: Leetcode’s metrics drift over time — reminiscence measurements confirmed a -0.24 correlation with take a look at date. This makes reproducible benchmarking almost unattainable.

Measurement Reliability: The weak correlation between native and platform measurements raises questions on what we’re truly testing.

What It Means for LLM Benchmarking

This research doesn’t simply critique Leetcode — it highlights what we’d like in a code era benchmark: reproducible measurements, dependable efficiency metrics, and assured training-test separation. Till we now have platforms constructed particularly for this objective, we should be extraordinarily cautious about utilizing competitors platforms as benchmarks.

So! We all know that not all benchmarks are viable benchmarks — what a couple of extra mainstream one?

Case Research #2: SuperGLUE — Constructing a Higher Language Understanding Benchmark

The SuperGLUE paper tackles an enchanting drawback in AI benchmarking: what do you do when fashions get too good at your checks? When GLUE turned inadequate (with fashions surpassing human efficiency), the researchers needed to rethink how we measure language understanding.

The Benchmark Design

SuperGLUE’s core innovation is its job choice methodology. The researchers collected job proposals from the NLP group and filtered them by way of a rigorous course of: every job wanted clear analysis metrics, public coaching knowledge, and — most significantly — important headroom between machine and human efficiency.

This resulted in eight duties (I’ve simplified the desk from the doc right here, it’s rather less readable however it is best to get the sense of what the questions are asking):

SuperGLUE: A Stickier Benchmark for Basic-Objective Language Understanding Methods, In thirty third Convention on Neural Data Processing Methods (NeurIPS 2019), Vancouver, Canada (2019)

What makes these duties particular is their range in format. Not like GLUE’s concentrate on sentence classification, SuperGLUE consists of coreference decision, studying comprehension, and extra com plex reasoning duties. Every job measures totally different points of language understanding whereas sustaining clear, quantifiable metrics.

Half 2: Let’s Construct a Bodily Reasoning Benchmark: To Cheat at Escape Rooms

After taking a look at some benchmarks like SuperGLUE and Leetcode, I had an thought: what if we examined LLMs on one thing fully totally different — bodily reasoning… by way of escape room puzzles?

It’s a fairly legitimate thought — escape rooms poses prospects and penalties for failure — screw up one too many puzzles, and your pals will assume you’re fairly silly, and relegate you to spectator responsibility. Fortunately for us nevertheless, they (or the poor workers) don’t know that you may sneak a cellphone into an escape room — and you already know simply who to ask for the solutions. At this time, LLMs face off in opposition to the puzzles of a bodily escape room.

Be aware: That is NOT a rigorous tutorial benchmark (please don’t cite this in papers, why would you even need to try this?), and even near it, and it’s simply presupposed to be a enjoyable solution to take a look at LLM benchmarking and analysis. Please don’t destroy my prompts, I’m conscious they’re dangerous.

Why Bodily Reasoning?

For actual, although… most LLM benchmarks concentrate on linguistic duties (like SuperGLUE) or code era (like Leetcode). And for good cause — these are well-defined domains with clear analysis metrics. However real-world drawback fixing usually requires understanding bodily ideas and their interactions. The well-known “Can GPT-4 do physics?” debates normally focus on mathematical problem-solving, not sensible bodily reasoning.

present benchmarks taught me just a few key ideas:

Clear analysis metrics are essential (from SuperGLUE’s task-specific scores)
Issues ought to have unambiguous options (from HumanEval’s take a look at instances)
The benchmark ought to take a look at distinct capabilities (from MMLU’s topic classes)

Designing the Issues

I settled on escape room puzzles for 2 causes. First, they naturally mix bodily reasoning with clear objectives. Second, they’ve unambiguous success situations — both you clear up it by way of the meant manner, otherwise you don’t. Third, and most significantly, they let me embrace “crimson herrings” — irrelevant gadgets that take a look at if the LLM can determine what issues bodily. Fourth, I simply actually like doing escape rooms (did I point out that already?),

I’m conscious that that is greater than two causes, but when LLMs can’t rely what number of rs’ there are in strawberry, I’m allowed to mess up occasionally too.

Right here’s how I structured the 5 core issues:

Fluid Dynamics (FLUID_001) (Ping pong ball caught in a tube)

Assessments understanding of buoyancy and fluid displacement
Impressed by basic physics issues however in sensible context
Contains deliberately irrelevant gadgets (like squishy meals fashions)

Mild Properties (UV_001) (UV mild on a push numebr lock)

Assessments understanding of UV fluorescence and materials properties
Combines a number of bodily ideas (mild, materials science)
Requires understanding of environmental situations

Mechanical Understanding (CIPHER_001) (A cipher ring)

Assessments spatial reasoning and mechanical alignment
No crimson herrings — checks for correlating a dial to a cypher wheel
Requires understanding rotational symmetry

Pressure Software (VAC_001) (Can caught in gap)

Assessments understanding of vacuum forces and floor adhesion
A number of potential answer approaches
Requires understanding power multiplication

Collaborative Physics (COLLAB_001) (Can two folks shimmy a key?)

Assessments understanding of bodily constraints in multi-agent situations
Requires combining a number of bodily ideas
Assessments understanding of software creation and friction

Sounds actually fancy… nevertheless it’s just a few fundamental bodily puzzles. You’ll be able to entry them on my GitHub.

The Technical Half

The benchmark implementation has three most important parts:

Downside Definition Layer

Issues are outlined in a structured JSON format that enforces constant analysis:

{
    "problem_id": "FLUID_001",
    "setup": {
        "situation": "A ping pong ball is on the backside of a slender tube...",
        "available_items": ["bottle of water", "squishy food models"...],
        "constraints": ["tube too narrow for manual retrieval"]
    },
    "physical_principles": ["buoyancy", "fluid displacement"],
    "red_herrings": ["squishy food models", "milk carton"],
    "answer": {
        "steps": ["pour water into tube", "allow ball to float"],
        "key_insights": ["water displaces air", "ping pong ball less dense"]
    }
}

This construction attracts from SuperGLUE’s design — every element is clearly separated and machine-readable. The physical_principles discipline explicitly lists what’s being examined, whereas red_herrings helps in scoring the LLM’s means to disregard irrelevant data.

2. Analysis Framework

The analysis system makes use of Python’s asyncio for concurrent testing, with retry logic for just a little bit extra API stability:

@retry(cease=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
async def evaluate_response(self, standards: JudgingCriteria) -> Dict:
    """Consider a mannequin's response utilizing GPT-4 as decide."""
    async with aiohttp.ClientSession() as session:
        # ... analysis logic

The scoring system appears to be like at three parts:

Bodily Understanding Rating (PUS) ∈ [0,2]

Measures understanding of related bodily ideas
Calculated as normalized sum of demonstrated ideas

Resolution Path Rating (SPS) ∈ [0,2]

Evaluates completeness and correctness of answer steps
Considers sensible feasibility of proposed options

Crimson Herring Dealing with (RHH) ∈ {0,1}

A Binary rating for avoiding irrelevant gadgets
Assessments means to concentrate on bodily related elements

And sure, there are additionally so many different scoring strategies, higher and worse, that might be used! For instance, RHH might be about how many irrelevant gadgets are used within the answer, or it might be a measure of how viable the use is… the purpose is that choosing these metrics are sometimes instances fairly arbitrary, however are very essential to creating your benchmark is credible, which mine may be very a lot not.

Moreover, I didn’t need to rewrite any code after. Sue me.

3. Mannequin Interface Layer

The benchmark helps a number of LLM backends by way of a standard interface:

class ModelInterface:
    """Interface for various LLM APIs."""
    async def generate_response(self, immediate: str) -> str:
        elevate NotImplementedError

class GPT4Interface(ModelInterface):
    async def generate_response(self, immediate: str) -> str:
        # GPT-4 particular implementation
class ClaudeInterface(ModelInterface):
    async def generate_response(self, immediate: str) -> str:
        # Claude particular implementation

Two fashions… I can’t actually afford any extra, please perceive.

Let’s Take a look at It!

So after some refilling of API balances, I used to be able to go. I ran the benchmark, and after completely not losing credit on like, 5 totally different errors, the outcomes have been in. They usually have been… fascinating! I used GPT4 and Claude 3.5 Haiku (I’m conscious these should not equally spec-ed fashions) as our victims.

The Problem: A can is caught in a gap that nearly precisely matches its quantity. A mixture is written on the underside of the can that must be retrieved.

"VAC_001": {
        "problem_id": "VAC_001",
        "setup": {
            "available_items": [
                "ruler",
                "notebook",
                "suction cup",
                "UV light"
            ],
            "constraints": [
                "can fits too tightly for finger grip",
                "hole is same diameter as can",
                "can is clean and dry"
            ]
        },
        "physical_principles": [
            "vacuum suction",
            "surface adhesion",
            "force multiplication"
        ],
        "red_herrings": [
            "ruler",
            "notebook",
            "UV light"
        ],
        "answer": {
            "steps": [
                "inspect suction cup and can surface",
                "ensure can surface is clean",
                "firmly press suction cup onto can top",
                "create vacuum seal",
                "pull can out using suction cup"
            ],
            "key_insights": [
                "suction cup creates vacuum seal",
                "vacuum force allows extraction",
                "other items don't provide necessary force"
            ]
        }

GPT-4’s Resolution

Bodily Ideas: The primary bodily precept concerned on this answer is the idea of air stress differential, which is similar precept that permits a suction cup to stay to a floor. Air stress is greater on one aspect of the suction cup (outdoors) in comparison with the opposite (inside), and this distinction in stress causes the cup to stay to the floor.

Resolution Steps:

Connect the suction cup to the highest of the can. The flat floor of the can ought to present a great contact for the suction cup to stick to.

As soon as the suction cup is firmly connected, gently pull upwards. The air stress differential ought to trigger the suction cup (and thus the can) to be lifted out of the opening.

As soon as the can is out of the opening, take away the suction cup from the highest of the can.

Use the UV mild to light up the mixture written on the underside of the can. Some inks solely turn into seen underneath UV mild, which is why this step is critical.

Write down the mixture on the pocket book to make sure it’s not forgotten.

Required Gadgets:

Suction cup: That is required to stick to the can and pull it out of the opening, because the can matches too tightly for finger grip.

UV mild: That is required to disclose the mixture written on the underside of the can, as it might be written in ink that solely turns into seen underneath UV mild.

Pocket book: That is required to document the mixture as soon as it’s seen, making certain it’s not forgotten.”

Not dangerous in any respect! GPT-4 manages to determine the bodily precept behind the answer (even when it is vitally wordy), and manages to step-by-step describe find out how to clear up the issue. It does find yourself utilizing the UV mild and pocket book, however that’s extra of an issue on my finish, not punishing using pointless gadgets within the immediate sufficient. Additionally, the makes use of for this stuff appear decently affordable… so why not!

Haiku-3.5’s Resolution

Bodily Ideas: The important thing bodily ideas concerned on this situation are the idea of quantity and using suction to extract the can from the opening. Because the can matches virtually precisely into the opening, it’s doubtless that the can is caught as a result of tight match and the shortage of any air gaps or area between the can and the opening. This tight match creates a vacuum-like impact, making it tough to grip and pull the can out.

Resolution Steps:

Measure the diameter of the can utilizing the ruler. This may assist decide the scale of the opening and ensure that the can matches tightly inside it.

Examine the underside of the can to find the mixture written on it. Use the UV mild to light up the underside of the can and make the mixture seen, if it isn’t already clearly seen.

Connect the suction cup to the highest of the can. The suction cup will create a short lived seal on the can’s floor, permitting you to grip and pull it out of the opening.

Gently pull the can upwards utilizing the suction cup. The suction power must be robust sufficient to beat the vacuum-like impact and extract the can from the opening.

As soon as the can is out, document the mixture written on the underside of the can within the pocket book.

Required Gadgets:

Ruler: To measure the diameter of the can and ensure the tight match within the gap.

UV mild: To light up the underside of the can and make the mixture seen, if it isn’t already clearly seen.

Suction cup: To create a short lived seal on the can’s floor and supply the required grip to drag the can out of the opening.

Pocket book: To document the mixture written on the underside of the can.

However, Haiku’s answer was.. actually an answer of all time.

First, it desires to measure the diameter of the can utilizing the ruler??? To find out the scale of the opening and ensure that the can matches tightly inside it? Why would we have to do that? And do we’d like a ruler for that?

Second, it tells us to examine the underside of the can to find the mixture, when the whole drawback is about not having the ability to pull the can out of the opening conventionally. This may simply be a difficulty of order, however now I really perceive my buddies’ emotions every time I might inform them “simply repair it man” to their quite a few issues.

However it will definitely does get the answer. So… not the worst.

Right here’s a flowery radar graph of the outcomes!

We see that each fashions are fairly related of their capabilities, with GPT-4 being barely higher in bodily understanding and answer path, and Haiku being barely higher in crimson herring dealing with. General although, each fashions type of suck. Dang.

There are additionally solely… 5 questions.

When you’d wish to see the total breadth of questions, they’re on my GitHub.

LLM-as-a-Decide

By the best way, the tactic I used to generate the evaluations, LLM-as-a-judge, has gained important traction within the AI group, notably after the work of Zheng et al. of their 2023 paper “Judging LLM-as-a-Decide.” The method has confirmed remarkably efficient, attaining over 80% settlement with human evaluators in duties starting from code evaluation to dialogue high quality analysis!

Right here’s the place my experiment will get type of cool (arguably, possibly, subjectively) — I used this system and had GPT-4 decide different LLMs’ bodily reasoning talents. Sure, I’m utilizing an AI to guage different AIs.

Why does this work? Nicely, judging a response is definitely an easier job than producing one. When GPT-4 generates an answer to a bodily puzzle, it must:

Perceive the bodily ideas concerned
Plan a sequence of steps
Take into account all constraints
Generate a coherent rationalization

However when judging, it solely must test if particular standards are met in an present answer. The analysis immediate may be very centered:

def _create_evaluation_prompt(self, standards: JudgingCriteria) -> str:
    return f"""You might be an professional decide evaluating an LLM's understanding of bodily reasoning puzzles.

Consider primarily based on three standards:
2. Bodily Understanding Rating (0-2): Does the answer appropriately apply related bodily ideas?
3. Resolution Path Rating (0-2): Are the steps full and possible?
4. Crimson Herring Dealing with (0-1): Does it keep away from utilizing irrelevant gadgets?
Situation: {standards.situation}
Bodily Ideas Required: {standards.correct_principles}
Resolution Given: {standards.model_response}
"""

To validate this strategy, I adopted the validation framework instructed by Zheng et al., performing spot-checks of GPT-4’s evaluations in opposition to my very own judgments. Surprisingly (or maybe unsurprisingly, given the broader analysis on LLM analysis), it was remarkably constant in figuring out each appropriate bodily understanding and flawed reasoning.

Is that this excellent? Completely not. There’s one thing philosophically bizarre about utilizing one LLM to judge one other. However in follow, it might work surprisingly properly — similar to how I moan and groan concerning the visible presentation of a dish on Masterchef, whereas setting my kitchen aflame making an attempt to microwave a scorching canine.

What I Discovered

Constructing this benchmark taught me a number of issues about benchmark design:

Clear Metrics Matter: Even for complicated duties like bodily reasoning, you want unambiguous scoring standards.

Crimson Herrings Are Highly effective: Together with irrelevant gadgets reveals quite a bit about an LLM’s reasoning course of.

Context Management is Arduous: Making certain LLMs don’t “hallucinate” further bodily context is difficult.

Is that this an ideal benchmark? Not even shut. Please don’t rub it in. Is it scientifically rigorous? Undoubtedly not. However it’s been an enchanting exploration into a side of LLM capabilities, and generally the very best we will be taught can come from simply making an attempt issues out and seeing what occurs.

Now, in case you’ll excuse me, I will likely be sneaking in a cellphone with an web connection into my subsequent escape room, for causes that I’m legally unmotivated to reveal.

[1] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Stoica, “Judging LLM-as-a-Decide with MT-Bench and Chatbot Area,” Proceedings of the thirty seventh Convention on Neural Data Processing Methods (NeurIPS 2023), Datasets and Benchmarks Observe (2023)

[2] T. Coignion, C. Quinton, R. Rouvoy, “A Efficiency Research of LLM-Generated Code on Leetcode,” In twenty eighth Worldwide Convention on Analysis and Evaluation in Software program Engineering (EASE 2024), Salerno, Italy (2024)

[3] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, “SuperGLUE: A Stickier Benchmark for Basic-Objective Language Understanding Methods,” In thirty third Convention on Neural Data Processing Methods (NeurIPS 2019), Vancouver, Canada (2019)

[5] DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Track, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z.F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao et al., “DeepSeek-R1: Incentivizing Reasoning Functionality in LLMs through Reinforcement Studying,” arXiv preprint arXiv:2501.12948 (2025)

[6] Until in any other case acknowledged, all pictures are created by the creator.

Source link

How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

STOP Building Useless ML Projects – What Actually Works

Implementing IBCS rules in Power BI

How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

The Basics of Artificial Intelligence (AI) for Beginners | by Deekshitha Bhairav | Jan, 2025

How This Software Can Help You Boost Your Real Estate Profits

JPMorgan CEO Jamie Dimon Regrets Cursing But Stands By RTO

Our Picks

How to Access NASA’s Climate Data — And How It’s Powering the Fight Against Climate Change Pt. 1

From Training to Drift Monitoring: End-to-End Fraud Detection in Python | by Aakash Chavan Ravindranath, Ph.D | Jul, 2025

Using Graph Databases to Model Patient Journeys and Clinical Relationships

I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms

Half 1: What’s a Benchmark? (in 3 seconds)

Half 1.1: What’s a Benchmark? (in additional than 3 seconds)

What Makes a Good Benchmark?

Life Ain’t All about The Grades

Case Research #1: Leetcode as an LLM Benchmark

The Benchmark Design

The Actual Issues

What It Means for LLM Benchmarking

Case Research #2: SuperGLUE — Constructing a Higher Language Understanding Benchmark

The Benchmark Design

Half 2: Let’s Construct a Bodily Reasoning Benchmark: To Cheat at Escape Rooms

Why Bodily Reasoning?

Designing the Issues

The Technical Half

Downside Definition Layer

2. Analysis Framework

3. Mannequin Interface Layer

Let’s Take a look at It!

GPT-4’s Resolution

Haiku-3.5’s Resolution

LLM-as-a-Decide

What I Discovered

Related Posts