Close Menu
    Trending
    • How Engineers Can Adapt to AI’s Growing Role in Coding
    • Here’s Why Anthropic Refuses to Offer 9-Figure Pay Like Meta
    • A Game-Changer in On-Device Creativity
    • This is How Machine Learning Changing the World | by Ashar Arif | Aug, 2025
    • GFT: Wynxx Reduces Time to Launch Financial Institutions’ AI and Cloud Projects
    • Humanoid Robot CHILD Mimics Parent-Child Motion
    • What Top Founders Know About Domains That Most Entrepreneurs Miss
    • I Tested Ourdream for 30 Days: Here’s what really happened
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»How I Fine-Tuned Granite-Vision 2B to Beat a 90B Model — Insights and Lessons Learned
    Artificial Intelligence

    How I Fine-Tuned Granite-Vision 2B to Beat a 90B Model — Insights and Lessons Learned

    Team_AIBS NewsBy Team_AIBS NewsJuly 26, 2025No Comments25 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    or vision-language fashions is a robust method that unlocks their potential on specialised duties. Nonetheless, regardless of their effectiveness, these approaches are sometimes out of attain for a lot of customers attributable to their excessive computational price and the necessity for GPUs with massive VRAM — assets that solely a small proportion of finish customers can entry.

    On this venture, I fine-tuned IBM’s Granite-Vision 2B, a comparatively small but highly effective vision-language mannequin, to deal with the problem of changing photographs of tables into clear, structured HTML code.

    What makes this venture notably thrilling is that the fine-tuning was carried out on a consumer-grade GPU — the NVIDIA RTX 4070 Ti Tremendous — and but, the ensuing 2-billion-parameter mannequin was capable of outperform a lot bigger fashions, together with meta-llama/Llama-3.2–90B-Vision, on this image-to-text era activity. This success not solely demonstrates the ability of parameter-efficient fine-tuning strategies like LoRA but additionally highlights the sensible worth of constructing specialised small fashions tailor-made to particular issues.

    On this put up, I’ll stroll you thru the motivation behind this work, the mannequin and dataset decisions, the customized HTML similarity metric I tailored, the experiments and outcomes, and at last, the important thing insights and classes discovered all through the method. Whether or not you’re concerned about vision-language fashions, fine-tuning strategies, or sensible AI purposes, I hope this journey presents helpful takeaways. The fine-tuning code used for this venture was tailored from HuggingFace’s Granite Vision fine-tuning cookbook, authored by Eli Schwartz, who in flip tailored the unique code from Sergio Paniego.

    Motivation

    Whereas engaged on Retrieval-Augmented Technology (RAG) initiatives, I encountered a serious problem: precisely extracting massive and sophisticated tables from PDFs, particularly when these tables appeared as photographs. Regardless of attempting completely different approaches — together with instruments like Unstructured and enormous vision-language fashions reminiscent of Meta’s Llama 90B — the outcomes typically fell wanting the accuracy wanted.

    This led me to think about a distinct method: a small, specialised vision-language mannequin centered completely on desk understanding and extraction. Such a mannequin may function a devoted preprocessing step to considerably enhance RAG pipelines that depend on correct desk extraction.

    Across the identical time, IBM launched Granite-Imaginative and prescient 2B — a vision-language mannequin with simply the proper steadiness of dimension and energy. It’s succesful sufficient to deal with complicated tables, but sufficiently small to be fine-tuned on consumer-grade GPUs with 16 GB of VRAM. This made it a super candidate for my venture.

    The Job: Picture to HTML (Desk Extraction)

    One vital design selection was the goal format: HTML. By changing tables into clear HTML code, we get hold of a structured and broadly supported illustration that may be simply transformed into different codecs. For instance, HTML tables will be readily imported into knowledge evaluation instruments like Pandas as dataframes, making downstream processing and evaluation way more environment friendly.

    The unique plan was to construct a customized dataset by extracting HTML desk tags, rendering them as photographs, and pairing every picture with its corresponding HTML code. Luckily, I discovered an answer: the PubTabNet-HTML dataset, which incorporates over 568,000 picture–HTML pairs, way over wanted for this venture.

    PubTabNet was developed by IBM and is predicated on scientific articles from the PubMed Central Open Entry Subset (industrial use assortment). The tables had been extracted by aligning PDF and XML variations of the articles. The annotations (i.e., the HTML labels) are licensed beneath the Neighborhood Knowledge License Settlement – Permissive – Model 1.0, and whereas IBM doesn’t personal the photographs, they’re utilized in accordance with the PMC Open Access Subset Terms of Use. This makes the dataset appropriate for each analysis and industrial purposes, offered the license phrases are adopted.

    Customized Metric: HTML Similarity

    Commonplace textual content similarity metrics like BLEU or ROUGE are inadequate for evaluating HTML desk era as a result of they primarily give attention to surface-level textual content matching and ignore vital structural and stylistic facets of HTML code.

    To raised seize the standard of generated HTML tables, I tailored a customized HTML Similarity metric that mixes a number of complementary elements, the place an important ones (type and construction) are imported from niteru:

    • Fashion similarity (S): Extracts CSS lessons of every html doc and calculates the jaccard similarity of the units of lessons.
    • Structural similarity (T): Makes use of sequence comparability of the html tags to compute the similarity.
    • Content material similarity (C): Based mostly on normalized edit distance between the extracted plain textual content content material of the tables.
    • Token overlap similarity (J): The Jaccard similarity between the units of content material tokens.

    The ultimate similarity rating M is a weighted sum of those elements:

    I manually examined the metric on varied instance outputs, iteratively adjusting the weighting coefficients to raised seize significant similarities. This course of resulted in a balanced analysis that pretty rewards correct desk construction and magnificence, alongside exact textual content material. Python implementation is as follows:

    from torchmetrics.textual content import EditDistance
    from niteru import style_similarity, structural_similarity
    
    ed_distance = EditDistance()
    
    def extract_table_text(html):
        """Extracts solely the textual content from an HTML desk in row-wise space-separated format."""
        soup = BeautifulSoup(html, "html.parser")
        desk = soup.discover("desk")  # Discover the primary desk
        if not desk:
            return ""
        # Extract rows and be part of cells with areas
        return "n".be part of(" ".be part of(cell.get_text(strip=True) for cell in row.find_all(["th", "td"])) for row in desk.find_all("tr"))
    
    def extract_html_table(html):
        """Extracts html desk from textual content"""
        match = re.search(r'', html, re.DOTALL | re.IGNORECASE)
        if match:
            table_html = match.group()
            return table_html
        else:
            return html
    
    def html_similarity(html1, html2):
        html1 = extract_html_table(html1)
        html2 = extract_html_table(html2)
        # Compute particular person similarity scores
        style_sim = style_similarity(html1, html2)  # Assume returns [0,1]
        struct_sim = structural_similarity(html1, html2)  # Assume returns [0,1]
        txt1, txt2 = extract_table_text(html1), extract_table_text(html2)
        content_sim = 1 - (ed_distance(txt1, txt2) /
                                       max(len(txt1), len(txt2) + 1e-10))  # Keep away from division by zero
        jaccard_sim = 1 - (len(set(txt1.cut up()).intersection(set(txt2.cut up()))) /
                            len(set(txt1.cut up()).union(set(txt2.cut up()))) + 1e-10)
        
        # Weighted sum of the similarities
        final_score = (0.10 * style_sim) + (0.40 * struct_sim) + (0.30 * content_sim) + (0.20 * jaccard_sim)
        # Guarantee last rating is in [0,1]
        final_score = max(0, min(1, final_score))
        return final_score
    
    
    
    

    The metric additionally features a regex-based operate to extract solely the HTML content material inside 

     tags. This was mandatory as a result of one of many reference fashions solely generated incomplete or further HTML exterior of the desk construction. By focusing the comparability strictly on the desk content material, the metric gives a extra truthful and significant analysis throughout fashions.

    Growing a customized analysis metric like that is essential for reliably monitoring mannequin enhancements and benchmarking efficiency in opposition to reference fashions.

    Coaching Setup

    To fine-tune the mannequin effectively on my NVIDIA RTX 4070 Ti Tremendous, which has 16 GB VRAM, I used LoRA (Low-Rank Adaptation). This allowed me to replace solely a small variety of parameters, considerably decreasing GPU reminiscence utilization. In truth, throughout coaching, the mannequin used solely about half of the obtainable VRAM — with sufficient headroom to mess around with longer sequences, however not sufficient to deal with a couple of batch. Moreover, LoRA is mostly sooner to coach than approaches like QLoRA.

    LoRA Setup

    I used the next LoRA configuration:

    # Setup LoRA
    target_modules = []
    for layer_type in layers_to_tune:
        target_modules.lengthen(
            identify for identify, _ in mannequin.named_modules()
            if (layer_type in identify) 
            and '_proj' in identify
        )
    peft_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.1,
        target_modules=target_modules,
        use_dora=True,
        init_lora_weights="gaussian"
    )

    Key factors:

    • r=16: This low-rank dimension gives steadiness between mannequin capability and GPU reminiscence utilization.
    • use_dora=True: DoRA (Weight-Decomposed Low Rank Adaptation) improves the training capability and stability of LoRA by decomposing the pretrained weights into magnitude and course elements, serving to the mannequin higher resemble the capability of full fine-tuning — all with out including inference overhead. Carried out barely higher than the default setting.
    • init_lora_weights="gaussian": No specific purpose, I didn’t wish to experiment with this parameter.
    • target_modules: This versatile setup permits selectively focusing on imaginative and prescient layers, language layers, or each, relying on the experiment. In observe, imaginative and prescient layers remained unaffected — even with use_dora=False— since DoRA at the moment helps solely embedding, linear, and Conv2d layers. In consequence, I fine-tuned solely the language layers.

    Dataset Setup

    Throughout my preliminary experiments, I saved operating into out-of-memory (OOM) errors — though there was nonetheless loads of obtainable GPU VRAM after loading mannequin, LoRA layers and optimizer parameters (round 4GB nonetheless free). There have been no reminiscence spikes throughout coaching, however the crashes persistently occurred on the identical coaching step.

    After some investigation, I noticed that the issue was brought on by massive tables, which resulted in extraordinarily lengthy token sequences. To handle this, I adjusted the max_seq_length parameter and filtered out samples that exceeded this restrict. After experimentation, I discovered that utilizing max_seq_length = 1024 allowed me to fine-tune the mannequin reliably with out triggering OOM errors.

    To filter out outsized tables, I wrote a easy knowledge processing operate that:

    • Filters out samples whose HTML token size exceeds max_seq_length
    • Routinely balances the variety of coaching and take a look at samples
    • Makes use of streaming to keep away from loading the complete dataset into reminiscence (PubTabNet-HTML is sort of massive, round 10 GB on disk)

    .

    def load_process_filter_dataset(dataset, max_seq_length, num_train_images, num_test_images, system_message):
        international processor
        ds = load_dataset(dataset, cut up='prepare', streaming=True)
        max_html_tokens = max_seq_length - len(processor.tokenizer.tokenize(system_message))
        num_total_needed = num_train_images + num_test_images
        filtered_samples = []
        p_bar = tqdm(complete=num_total_needed, desc="Filtering dataset samples")
        for pattern in ds:
            processed = process_and_filter_example(pattern, max_html_tokens)
            if processed:
                filtered_samples.append(processed)
                p_bar.replace(1)
            if len(filtered_samples) >= num_total_needed:
                break
        p_bar.shut()
        # Convert to in-memory dataset
        ds_filtered = Dataset.from_list(filtered_samples)
        # Cut up into prepare/take a look at
        ds_train = ds_filtered.choose(vary(num_train_images))
        ds_test = ds_filtered.choose(vary(num_train_images, num_total_needed))
        return ds_train, ds_test
    
    def process_and_filter_example(instance, max_html_tokens):
        international processor
        extracted_table = extract_html_table(instance['html_table'])
        token_count = len(processor.tokenizer.tokenize(extracted_table))
        if token_count < max_html_tokens:
            instance['html_table'] = extracted_table
            return instance
        return None

    The ultimate configuration included num_train_images=10000 and num_test_images=250 to compute the analysis loss.

    Positive-Tuning Configuration

    For coaching, I used the Transformers SFTTrainer to fine-tune the mannequin:

    # Coaching arguments
        training_args = SFTConfig(
            output_dir=f"src/fashions/{model_name.cut up('/')[-1].substitute('-', '_', 1).cut up('-')[0]}/checkpoints/{experiment_name}",
            num_train_epochs=1,
            per_device_train_batch_size=1,
            per_device_eval_batch_size=1,
            gradient_accumulation_steps=gradient_accumulation_steps,
            max_seq_length=max_seq_length,
            warmup_steps=10,
            learning_rate=3e-4,
            weight_decay=0.01,
            logging_strategy="steps",
            eval_strategy='steps',
            logging_steps=25,
            save_strategy="steps",
            save_steps=50,
            save_total_limit=1,
            greater_is_better=False,
            load_best_model_at_end=True,
            optim="adamw_torch_fused",
            bf16=True,
            push_to_hub=False,
            report_to="wandb" if not debug else "none",
            remove_unused_columns=False,
            gradient_checkpointing=True,
            dataset_text_field="",
            dataset_kwargs={"skip_prepare_dataset": True},
            dataset_num_proc=8
        )
    
    # Setup Coach
        coach = SFTTrainer(
            mannequin=mannequin,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=test_dataset,
            data_collator=collate_fn,
            peft_config=peft_config,
            processing_class=processor.tokenizer
        )

    Key factors:

    • num_train_epochs=1: The dataset could be very massive, and to run a number of experiments effectively, I selected to coach for just one full epoch whereas maximizing studying per pattern and variety of coaching samples.
    • per_device_train_batch_size=1: Bigger batch sizes wouldn't slot in GPU reminiscence with out considerably decreasing max_seq_length — which might harm efficiency on massive tables. Protecting longer sequences was extra vital for this activity.
    • gradient_accumulation_steps=8: Used to successfully simulate a bigger batch dimension and assist stabilize the training course of, compensating for the small bodily batch. That is the ultimate worth, however experimented with gradient_accumulation_steps=4 as effectively.
    • optim="adamw_torch_fused" and bf16=True: These settings leverage trendy NVIDIA architectures (Ada Lovelace) to speed up coaching and cut back reminiscence utilization — as advisable for this {hardware}.

    Analysis Loss Workaround

    On the time of growing the venture, there's a identified concern within the Transformers + LoRA integration that causes an error when operating analysis with a validation dataset throughout coaching. Luckily, a community-tested workaround is obtainable (though not but merged into the primary department), and I efficiently used this repair in my experiments.

    Analysis (Inference) Setup

    The analysis dataset used for last scoring was fully unbiased from the eval_dataset used throughout coaching. It consists of 500 randomly chosen photographs, none of which had been included in both the train_dataset or the coaching eval_dataset.

    As soon as fine-tuning was full, I used the finest mannequin checkpoint — chosen primarily based on the bottom analysis loss — to run inference on these 500 samples.

    Initially, I tried to carry out inference by merely loading the LoRA/DoRA adapter on high of the bottom mannequin. Nonetheless, I discovered that inference with DoRA adapters is extraordinarily gradual when not merged into the mannequin weights (as defined within the official PEFT docs). In truth, producing one take a look at random pattern took about 90 seconds on this configuration.

    To resolve this, I merged the adapter weights into the bottom mannequin — which is the advisable observe — and after merging, inference velocity improved dramatically: all the way down to ~20 seconds for a similar pattern, making full analysis runs way more sensible.

    The reference fashions used for comparability with my fine-tuned fashions are:

    • meta-llama/Llama-3.2–90B-Vision: Meta’s large 90-billion parametermannequin — the primary baseline I aimed to surpass by means of specialization and parameter-efficient fine-tuning of a a lot smaller VLM.
    • KennethTM/pix2struct-base-table2html: A a lot smaller mannequin fine-tuned from Google’s pix2struct-base, extremely specialised for precisely the identical dataset I used on this venture. Because of its smaller dimension, the developer(s) was capable of prepare it for a lot of extra samples and over longer coaching runs — demonstrating the important thing benefit of utilizing smaller, focused fashions for particular duties.

    These two baselines allowed me to benchmark each scaling-based efficiency (vs the 90B mannequin) and specialization effectivity (vs the smaller, devoted Pix2Struct mannequin).

    Experiments & Outcomes

    A complete of 9 experiments had been performed, iteratively modifying one or two elements at a time. The purpose was to grasp the impact of every change on mannequin efficiency, regularly refining the setup to attain the absolute best HTML Similarity rating in comparison with reference fashions.

    The experimental course of was incremental: at any time when a change improved the outcomes, it was included into the subsequent spherical of experiments and continued exploring new variations.

    The experiments centered on adjusting the next elements:

    1. Imaginative and prescient vs. Language Layers
    • 1.1 lang_only
    • 1.2 vision_only
    • 1.3 lang_vision

    2. Floor Fact Output Format

    3. Coaching Framework

    • 3.1 lang_table_unsloth
    • 3.2 vision_table_unsloth

    4. Gradient Accumulation

    5. Immediate Format

    6. Gradient Accumulation & Dataset Measurement

    Each the analysis loss and the HTML Similarity metric had been used to evaluate mannequin efficiency, and I discovered them to be effectively correlated — confirming that HTML Similarity is an efficient proxy for a way effectively the mannequin is studying the duty.

    Earlier than diving into the outcomes of every experiment, let’s first take a look at GPU reminiscence utilization throughout coaching, which is commonly essentially the most crucial consider figuring out whether or not a mannequin will be fine-tuned on shopper {hardware}.

    GPU Reminiscence Utilization Throughout Coaching | Picture by creator from wandb.ai

    As proven within the graph, GPU utilization remained secure all through coaching — averaging round 75% VRAM utilization, or roughly 12 GB on my GPU. Most of VRAM utilization (~5.5 GB) is the frozen mannequin weights. LoRA gradients + optimizer states take little or no (<< 1 GB). Activations + overhead ought to fill the remaining (~5–6 GB), which depends upon batch_size and max_seq_length.

    First Run: lang_only

    This experiment makes use of the next preliminary elements/parameters:

    These had been the beginning values for the primary experiment. In subsequent runs, I modified a lot of them as I refined the method. This primary experiment centered solely on tuning language layers, whereas coaching the mannequin to foretell the total uncooked HTML output — together with all the things inside and across the 

     tags.

    Since this was the primary run, I’ll embody the coaching loss curve right here as an example the way it behaves. For later experiments, I’ll omit this graph — because the habits was related throughout runs, with minor variations. In observe, analysis loss is extra helpful for evaluating efficiency throughout experiments.

    Coaching Loss | Picture by creator from wandb.ai

    One vital observe in regards to the logging configuration: logging_steps=25 signifies that the coaching loss is simply logged after each 25 steps, the place every logged worth is the common over gradient_accumulation_steps=4. In consequence, the biggest drop in loss seems on the second log level — the place a lot of the preliminary studying occurs. After that, the mannequin continues studying extra regularly, with a gradual reducing development, relying on the issue of the coaching samples.

    Now, let’s check out the analysis loss:

    Validation Loss 1 | Picture by creator from wandb.ai

    Since we're evaluating on the identical set of 250 validation samples, the analysis loss curve offers us a extra secure and significant view of mannequin studying — and can function a baseline for comparisons throughout future runs.

    Right here, we observe a transparent and constant downward development all through coaching. The preliminary loss begins near 0.03, with a gradual enchancment as coaching progresses, ultimately stabilizing just under 0.015.

    The graceful nature of this curve — in comparison with the extra variable coaching loss — displays the common construction of the validation set and confirms that the mannequin is generalizing effectively to unseen samples, even with a small batch dimension and a single epoch of coaching.

    Now, let’s examine the efficiency of this fine-tuned mannequin in opposition to the reference fashions on the HTML Similarity metric:

    As we are able to see, this primary experiment already delivers robust efficiency beneficial properties — bettering the bottom Granite-Imaginative and prescient 2B mannequin by a big margin (+0.18) and clearly outperforming LLaMA 90B Imaginative and prescient on this specialised activity. Solely Pix2Struct retains a slight lead at this stage.

    Second Run: vision_only

    There isn’t a lot to research on this run. I examined a number of variations that would doubtlessly unblock studying within the imaginative and prescient layers — together with drastically growing the training fee — however with out success.

    Whereas the bottom code means that fine-tuning imaginative and prescient layers ought to be attainable, in observe I discovered it was not working on this setup. The next analysis loss curve confirms that no studying occurred — the loss remained fixed all through coaching. To keep away from losing compute assets, I finished the run early:

    Validation Loss 2 | Picture by creator from wandb.ai

    Moreover, coaching was noticeably sooner on this run in comparison with the earlier lang_only experiment — suggesting that the language layers (which include the majority of the mannequin’s parameters) remained frozen, and solely the small imaginative and prescient layers had been being processed:

    Validation Samples per Second 1 | Picture by creator from wandb.ai

    Third Run: lang_vision

    At this level, it was clear that solely language layers had been being successfully educated. On this lang_vision run — the place each language and imaginative and prescient layers had been chosen — I anticipated outcomes much like lang_only.

    Certainly, the analysis loss curve confirmed this expectation, displaying practically equivalent habits to lang_only:

    Validation Loss 3 | Picture by creator from wandb.ai

    As soon as this was clear, I once more stopped coaching early to preserve assets, and proceeded to check new approaches.

    Fourth Run: lang_table_only

    This experiment modified the next element:

    The purpose of this run was to coach the mannequin to foretell solely the desk content material, with none surrounding HTML wrapper code. This method may assist enhance studying — by eradicating pointless tokens — and likewise align the coaching habits extra intently with Pix2Struct’s mannequin.

    Moreover, by stripping out the wrapper HTML, the goal sequences grew to become shorter — which allowed longer and extra complicated tables to suit throughout the mannequin’s context window. This variation may additionally enhance the mannequin’s capability to generalize to bigger or extra detailed tables.

    Let’s take a look at the analysis loss in comparison with the primary run:

    Validation Loss 4 | Picture by creator from wandb.ai

    At first look, the upper analysis loss may appear counterintuitive. Nonetheless, there’s a transparent rationalization: the wrapper HTML code is trivial for the mannequin to be taught — because it tends to be practically equivalent throughout many coaching samples. These repetitive tokens cut back cross-entropy loss, artificially reducing the common loss in earlier runs. By eradicating them, the mannequin now focuses completely on the more difficult and variable desk content material — leading to the next however extra significant loss worth.

    Now, let’s see how this modification impacted the HTML Similarity metric:

    On this first take a look at, we observe no vital achieve or degradation from utilizing this new output format. It's attainable that the mannequin would want extra epochs or bigger coaching samples to totally adapt to this new format. One other concept is to replace the immediate — in order that from the very first step the mannequin understands it ought to focus solely on desk content material, reasonably than having to deduce this habits by means of coaching alone. This can be explored within the subsequent experiments.

    Fifth / Sixth Run: lang_table_unsloth, vision_table_unsloth

    In these experiments, I explored the next elements:

    At this level, I found the promising Unsloth framework — which claims to supply 2x sooner coaching with as much as 70% decrease reminiscence utilization. In fact, I needed to check whether or not it may speed up my workflow.

    My first concept was to leverage the improved reminiscence dealing with to run longer sequences (max_seq_length=2048), however in my case this rapidly led to Out of Reminiscence (OOM) errors — so I reverted to my earlier configuration.

    The coaching velocity enhancements, nonetheless, had been simple — virtually 4x sooner than my earlier runs:

    Validation Samples per Second 2 | Picture by creator from wandb.ai

    Sadly, this got here at a transparent price to loss efficiency:

    Validation Loss 5 | Picture by creator from wandb.ai

    Given this noticeable drop in high quality, I paused the experiment to research additional — notably to see if Unsloth would enable me to coach imaginative and prescient layers, which is one among its marketed benefits. Nonetheless, I encountered precisely the identical habits as with HuggingFace Transformers — no precise studying in imaginative and prescient layers.

    With these leads to thoughts, I made a decision to put aside Unsloth for this ventureand proceed utilizing HuggingFace Transformers, which had proven extra dependable studying in earlier runs.

    Seventh Run: lang_table_only_2

    Listed here are the brand new parameters for this run:

    Going again to the earlier configuration, I needed to research the influence of a bigger digital batch dimension (by way of greater gradient_accumulation_steps).

    The outcomes had been promising — the analysis loss grew to become smoother and trended nearer to the unique lang_only run, though the mannequin was now predicting solely the desk content material:

    Validation Loss 6 | Picture by creator from wandb.ai

    Based mostly on this optimistic end result, I made a decision to maintain this gradient_accumulation_steps=8 setting for the ultimate experiment.

    Evaluating this mannequin on HTML Similarity resulted in a small however significant enchancment — lastly reaching parity with Pix2Struct:

    Naturally, the purpose isn't just to match Pix2Struct — however to surpass it. Two vital levers remained to discover: dataset dimension and immediate.

    Eighth Run: lang_table_only_3

    The up to date parameters for this run had been:

    I by chance reverted gradient_accumulation_steps again to 4 on this run, solely realizing it as soon as the coaching was practically full — however this truly gave me an extra-chance to watch its impact on studying.

    The principle purpose right here was to double the coaching dimension (to 10K photographs) and to check the up to date, clearer immediate format. Sadly, a random CUDA error induced coaching to halt round 80% completion — besides, the advance was clear:

    Validation Loss 7 | Picture by creator from wandb.ai

    As anticipated, some smoothness was misplaced because of the smaller digital batch dimension, however the brand new immediate proved very efficient — noticeably boosting mannequin studying.

    This set the stage completely for the last experiment, utilizing this improved immediate, 10K coaching samples, and restoring gradient_accumulation_steps to eight.

    Remaining Run: lang_table_only_4

    The ultimate set of parameters are:

    The analysis loss for this last run:

    Validation Loss 7 | Picture by creator from wandb.ai

    As anticipated, restoring the gradient_accumulation_steps to eight smoothed the loss curve, decreasing spikes and reaching barely decrease total loss values. With a full epoch of coaching on 10K photographs, this grew to become the best-performing mannequin throughout all experiments.

    Now, let’s take a look at the ultimate outcomes on the HTML Similarity metric:

    Remaining HTML Similarity Outcomes | Picture by creator from matplotlib

    The purpose of this venture was achieved — the fine-tuned mannequin now surpasses each reference fashions on this activity. Trying again on the authentic Granite-Imaginative and prescient 2B, the LoRA fine-tuning improved efficiency to 0.77, a +21 proportion level achieve — all achieved in beneath 8 hours on a consumer-grade GPU.

    Qualitative Outcomes

    To raised illustrate how a lot the mannequin improved by means of fine-tuning, let’s take a look at a particular instance: Picture ID 618932.

    PubTabNet Analysis Pattern with ID 618932 | Picture from PMC

    This desk is especially difficult — beneath the Kappa column there are sub-headers (Current research and King et al. 2001). These complicated layouts usually problem generic VLMs, particularly after they haven’t been uncovered to sufficient related examples throughout coaching. Fashions can often perceive these sub-headers and reply questions about them, however producing the total desk construction in HTML typically requires additional immediate tuning and specialised fine-tuning.

    Let’s first see how a base, non-fine-tuned Granite-Imaginative and prescient 2B mannequin performs on this activity.

    Baseline: Uncooked Granite-Imaginative and prescient 2B

    The mannequin can reply questions primarily based on the desk appropriately:

    immediate='What's the Kappa worth for the query "Do you talk with this energy?" within the current research?'
    res = predict(pattern['image'], immediate=immediate)
    print(res)

    Out[1]:

    74

    Nonetheless, when requested to generate the total HTML desk, the mannequin struggles:

    immediate = "Convert desk to HTML ()"
    html = predict(pattern['image'], immediate=immediate)
    html = '' if '
    ' not in html else html show(HTML(html))

    Out[2]:

    And the HTML Similarity metrics for this try:

    Fashion similarity: 1.0000
    Structural similarity: 0.4091
    Lev-Edit Distance: 0.1434
    Remaining HTML Similarity Rating: 0.3619

    Positive-Tuned Mannequin: lang_table_only_4

    Now, let’s attempt the very same take a look at utilizing the fine-tuned mannequin:

    from src.fashions.granite_vision.transformers_library import LLM as granite_vision
    
    mannequin = granite_vision(
        model_path,
        adapter='lang_table_only_4'
    )

    Out[4]:

    Mannequin loaded
    Adapter 'lang_table_only_4' loaded
    Adapter 'lang_table_only_4' merged
    Utilizing cuda: NVIDIA GeForce RTX 4070 Ti SUPER

    And the identical prediction immediate:

    immediate = "Convert desk to HTML ()"
    html = mannequin.predict(pattern['image'], max_new_tokens=1024, question=immediate)
    show(HTML(html))

    Out[5]:

    The fine-tuned mannequin now produces an output that intently matches the bottom fact, appropriately capturing the desk construction and sub-headers — one thing the bottom mannequin struggled with.

    Remaining HTML Similarity metrics:

    Fashion similarity: 1.0000
    Structural similarity: 0.9231
    Lev-Edit Distance: 1.0000
    Remaining HTML Similarity Rating: 0.9615

    This instance exhibits a transparent quantitative enchancment as effectively: from a rating of 0.36 to 0.96 on a fancy desk construction — confirming that fine-tuning on this specialised activity dramatically boosts the mannequin’s functionality.

    Inference Velocity

    One main benefit of utilizing a smaller mannequin — except for the flexibility to fine-tune on consumer-grade {hardware} — is inference velocity. Even when bigger fashions supply aggressive efficiency, latency and throughput stay key elements, particularly in manufacturing settings.

    Let’s examine the inference velocity of the completely different fashions:

    Inference SpeedM | Picture by creator from matplotlib

    As proven within the plot, Pix2Struct is by far the quickest mannequin. For some use instances — reminiscent of batch-processing hundreds of paperwork for desk extraction — this velocity benefit may translate into vital time financial savings and decrease compute prices.

    Nonetheless, the fine-tuned Granite-Imaginative and prescient 2B achieves steadiness when the quantity of paperwork to course of is just not large, having a superior accuracy on this specialised activity and fairly quick inference with out the necessity for very massive compute infrastructure.

    Conclusions

    This venture demonstrated that with LoRA-based fine-tuning and a focused activity (desk extraction → HTML), a small vision-language mannequin (Granite-Imaginative and prescient 2B) can outperform a lot bigger fashions — even Meta’s 90B LLaMA Imaginative and prescient — whereas requiring solely a shopper GPU and fewer than a day of coaching.

    Just a few key takeaways:

    • Small, specialised fashions matter — you don’t all the time want 70B+ fashions to resolve particular issues with excessive accuracy.
    • Parameter-efficient fine-tuning (LoRA) is a game-changer: adapting massive basis fashions turns into accessible for many practitioners.
    • Immediate design and coaching targets have a giant affect — small adjustments (like switching to lang_table_only or refining the immediate) instantly impacted efficiency.
    • Having a customized metric (HTML Similarity) was crucial to trace significant progress past generic text-based metrics.
    • Smaller fashions not solely prepare sooner, but additionally infer sooner — splendid for manufacturing pipelines with excessive quantity.

    Lastly — and possibly most significantly — the sort of experimentation exhibits that you possibly can transfer quick and iterate even with restricted {hardware}. Positive-tuning highly effective open fashions and adapting them to real-world duties is not reserved to huge labs anymore.

    I hope this encourages different AI engineers to experiment with small VLMs and fine-tuning strategies for their very own initiatives and options — and to see that highly effective outcomes are attainable even with out large compute budgets!

    What’s Subsequent?

    There are undoubtedly some fascinating follow-up concepts that may be explored subsequent:

    • Immediate engineering refinements: Remaining exams (whereas scripting this weblog) confirmed that separating prompts into system message (defining habits) and person message (offering activity directions) considerably improved the bottom mannequin’s efficiency. Making use of this technique throughout fine-tuning may additional improve the mannequin’s capability to persistently generate correct HTML. This can be examined in upcoming experiments.
    • Coaching imaginative and prescient layers: Presently, solely the language layers are fine-tuned, as coaching the imaginative and prescient layers by means of text-only loss proved ineffective. A extra superior method may contain including an auxiliary imaginative and prescient loss — for instance, contrastive studying between imaginative and prescient outputs and HTML construction — to raised adapt the imaginative and prescient spine for desk extraction duties.
    • Improved generalization: The present mannequin is fine-tuned on a single dataset. Increasing coaching to incorporate extra numerous doc layouts, desk types, and noisy OCR eventualities may enhance robustness and transferability to real-world knowledge.

    Hyperlinks


    In the event you appreciated this put up, be happy to succeed in out or share your personal experiments!


    In direction of Knowledge Science is a neighborhood publication. Submit your insights to succeed in our international viewers and earn by means of the TDS Writer Cost Program.


    Write for TDS

    Associated Articles

    • Step-by-step code information to constructing a Convolutional Neural Community

      August 20, 2024

      6 min learn

    • A deep dive on the that means of understanding and the way it applies to LLMs

      August 21, 2024

      31 min learn

    • Photo by Krista Mangulsone on Unsplash

      A newbie’s information to forecast reconciliation

      August 20, 2024

      13 min learn

    • Image from Canva.

      Function engineering, structuring unstructured knowledge, and lead scoring

      August 21, 2024

      7 min learn

    • Image created by authors with GPT-4o

      With demos, our new resolution, and a video

      August 16, 2024

      10 min learn

    • Image by author

      Discover the knowledge of LSTM main into xLSTMs - a possible competitors to the present-day LLMs

    • Photo by Alina Grubnyak on Unsplash

      This sophistication matrix can present you the place that you must go






    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleAgentic AI: Unlocking Business Value | by Vladislav Levitin | Jul, 2025
    Next Article Stop Using ChatGPT Like an Amateur — Turn It Into a $100K Business Strategist
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    A Game-Changer in On-Device Creativity

    August 1, 2025
    Artificial Intelligence

    I Tested Ourdream for 30 Days: Here’s what really happened

    August 1, 2025
    Artificial Intelligence

    5 AI Trading Bots That Work With Robinhood

    August 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    How Engineers Can Adapt to AI’s Growing Role in Coding

    August 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    A New Frontier in Passive Investing

    May 20, 2025

    Synthetic Control Sample for Before and After A/B Test | by Gustavo R Santos | Dec, 2024

    December 20, 2024

    Tench Coxe Is Now an Nvidia Billionaire, Like Jensen Huang

    January 3, 2025
    Our Picks

    How Engineers Can Adapt to AI’s Growing Role in Coding

    August 1, 2025

    Here’s Why Anthropic Refuses to Offer 9-Figure Pay Like Meta

    August 1, 2025

    A Game-Changer in On-Device Creativity

    August 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.