Semantically Compress Text to Save On LLM Costs | by Lou Kratz

LLMs are nice… if they’ll match all your information

Picture by Christopher Burns on Unsplash

Initially revealed at https://blog.developer.bazaarvoice.com on October 28, 2024.

Massive language fashions are improbable instruments for unstructured textual content, however what in case your textual content doesn’t match within the context window? Bazaarvoice confronted precisely this problem when constructing our AI Evaluate Summaries function: thousands and thousands of consumer opinions merely received’t match into the context window of even newer LLMs and, even when they did, it could be prohibitively costly.

On this submit, I share how Bazaarvoice tackled this drawback by compressing the enter textual content with out lack of semantics. Particularly, we use a multi-pass hierarchical clustering strategy that lets us explicitly regulate the extent of element we need to lose in trade for compression, whatever the embedding mannequin chosen. The ultimate method made our Evaluate Summaries function financially possible and set us as much as proceed to scale our enterprise sooner or later.

Bazaarvoice has been gathering user-generated product opinions for practically 20 years so we have now rather a lot of information. These product opinions are utterly unstructured, various in size and content material. Massive language fashions are wonderful instruments for unstructured textual content: they’ll deal with unstructured information and determine related items of data amongst distractors.

LLMs have their limitations, nonetheless, and one such limitation is the context window: what number of tokens (roughly the variety of phrases) might be put into the community directly. State-of-the-art massive language fashions, resembling Athropic’s Claude model 3, have extraordinarily massive context home windows of as much as 200,000 tokens. This implies you may match small novels into them, however the web remains to be an unlimited, every-growing assortment of information, and our user-generated product opinions are not any totally different.

We hit the context window restrict whereas constructing our Evaluate Summaries function that summarizes the entire opinions of a particular product on our purchasers web site. Over the previous 20 years, nonetheless, many merchandise have garnered hundreds of opinions that shortly overloaded the LLM context window. In truth, we even have merchandise with thousands and thousands of opinions that might require immense re-engineering of LLMs to have the ability to course of in a single immediate.

Even when it was technically possible, the prices can be fairly prohibitive. All LLM suppliers cost based mostly on the variety of enter and output tokens. As you strategy the context window limits for every product, of which we have now thousands and thousands, we will shortly run up cloud internet hosting payments in extra of six figures.

To ship Evaluate Summaries regardless of these technical, and monetary, limitations, we centered on a quite easy perception into our information: Many opinions say the identical factor. In truth, the entire concept of a abstract depends on this: evaluate summaries seize the recurring insights, themes, and sentiments of the reviewers. We realized that we will capitalize on this information duplication to scale back the quantity of textual content we have to ship to the LLM, saving us from hitting the context window restrict and decreasing the working price of our system.

To realize this, we wanted to determine segments of textual content that say the identical factor. Such a job is simpler mentioned than performed: usually individuals use totally different phrases or phrases to specific the identical factor.

Thankfully, the duty of figuring out if textual content is semantically related has been an lively space of analysis within the pure language processing subject. The work by Agirre et. al. 2013 (SEM 2013 shared job: Semantic Textual Similarity. In Second Joint Convention on Lexical and Computational Semantics) even revealed a human-labeled information of semantically related sentences often known as the STS Benchmark. In it, they ask people to point if textual sentences are semantically related or dissimilar on a scale of 1–5, as illustrated within the desk beneath (from Cer et. al., SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation):

The STSBenchmark dataset is commonly used to judge how nicely a textual content embedding mannequin can affiliate semantically related sentences in its high-dimensional house. Particularly, Pearson’s correlation is used to measure how nicely the embedding mannequin represents the human judgements.

Thus, we will use such an embedding mannequin to determine semantically related phrases from product opinions, after which take away repeated phrases earlier than sending them to the LLM.

Our strategy is as follows:

First, product opinions are segmented the into sentences.
An embedding vector is computed for every sentence utilizing a community that performs nicely on the STS benchmark
Agglomerative clustering is used on all embedding vectors for every product.
An instance sentence — the one closest to the cluster centroid — is retained from every cluster to ship to the LLM, and different sentences inside every cluster are dropped.
Any small clusters are thought-about outliers, and people are randomly sampled for inclusion within the LLM.
The variety of sentences every cluster represents is included within the LLM immediate to make sure the burden of every sentiment is taken into account.

This will likely appear simple when written in a bulleted listing, however there have been some devils within the particulars we needed to kind out earlier than we might belief this strategy.

First, we had to make sure the mannequin we used successfully embedded textual content in an area the place semantically related sentences are shut, and semantically dissimilar ones are far-off. To do that, we merely used the STS benchmark dataset and computed the Pearson correlation for the fashions we desired to contemplate. We use AWS as a cloud supplier, so naturally we needed to judge their Titan Text Embedding fashions.

Beneath is a desk exhibiting the Pearson’s correlation on the STS Benchmark for various Titan Embedding fashions:

Source link

STOP Building Useless ML Projects – What Actually Works

Implementing IBCS rules in Power BI

Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

STOP Building Useless ML Projects – What Actually Works

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

These States Have the Most Private Jet Flights: New Data

Building a Modern Dashboard with Python and Gradio

Build Your Own OCR Engine for Wingdings

Our Picks

STOP Building Useless ML Projects – What Actually Works

Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025

The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z

Semantically Compress Text to Save On LLM Costs | by Lou Kratz | Dec, 2024

LLMs are nice… if they’ll match all your information

Related Posts