Close Menu
    Trending
    • Cuba’s Energy Crisis: A Systemic Breakdown
    • AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000
    • STOP Building Useless ML Projects – What Actually Works
    • Credit Risk Scoring for BNPL Customers at Bati Bank | by Sumeya sirmula | Jul, 2025
    • The New Career Crisis: AI Is Breaking the Entry-Level Path for Gen Z
    • Musk’s X appoints ‘king of virality’ in bid to boost growth
    • Why Entrepreneurs Should Stop Obsessing Over Growth
    • Implementing IBCS rules in Power BI
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Artificial Intelligence»Segmenting Water in Satellite Images Using Paligemma | by Dr. Carmen Adriana Martínez Barbosa | Dec, 2024
    Artificial Intelligence

    Segmenting Water in Satellite Images Using Paligemma | by Dr. Carmen Adriana Martínez Barbosa | Dec, 2024

    Team_AIBS NewsBy Team_AIBS NewsDecember 30, 2024No Comments10 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Multimodal fashions are architectures that concurrently combine and course of completely different information varieties, akin to textual content, pictures, and audio. Some examples embody CLIP and DALL-E from OpenAI, each launched in 2021. CLIP understands pictures and textual content collectively, permitting it to carry out duties like zero-shot picture classification. DALL-E, alternatively, generates pictures from textual descriptions, permitting the automation and enhancement of inventive processes in gaming, promoting, and literature, amongst different sectors.

    Visible language fashions (VLMs) are a particular case of multimodal fashions. VLMs generate language primarily based on visible inputs. One outstanding instance is Paligemma, which Google launched in Might 2024. Paligemma can be utilized for Visible Query Answering, object detection, and picture segmentation.

    Some weblog posts discover the capabilities of Paligemma in object detection, akin to this glorious learn from Roboflow:

    Nonetheless, by the point I wrote this weblog, the present documentation on getting ready information to make use of Paligemma for object segmentation was imprecise. That’s the reason I wished to guage whether or not it’s straightforward to make use of Paligemma for this activity. Right here, I share my expertise.

    Earlier than going into element on the use case, let’s briefly revisit the internal workings of Paligemma.

    Structure of Paligemma2. Supply: https://arxiv.org/abs/2412.03555

    Paligemma combines a SigLIP-So400m vision encoder with a Gemma language model to course of pictures and textual content (see determine above). Within the new model of Paligemma launched in December of this 12 months, the imaginative and prescient encoder can preprocess pictures at three completely different resolutions: 224px, 448px, or 896px. The imaginative and prescient encoder preprocesses a picture and outputs a sequence of picture tokens, that are linearly mixed with enter textual content tokens. This mixture of tokens is additional processed by the Gemma language mannequin, which outputs textual content tokens. The Gemma mannequin has completely different sizes, from 2B to 27B parameters.

    An instance of mannequin output is proven within the following determine.

    Instance of an object segmentation output. Supply: https://arxiv.org/abs/2412.03555

    The Paligemma mannequin was skilled on varied datasets akin to WebLi, openImages, WIT, and others (see this Kaggle blog for extra particulars). Which means that Paligemma can determine objects with out fine-tuning. Nonetheless, such talents are restricted. That’s why Google recommends fine-tuning Paligemma in domain-specific use instances.

    Enter format

    To fine-tune Paligemma, the enter information must be in JSONL format. A dataset in JSONL format has every line as a separate JSON object, like an inventory of particular person information. Every JSON object comprises the next keys:

    Picture: The picture’s title.

    Prefix: This specifies the duty you need the mannequin to carry out.

    Suffix: This gives the bottom fact the mannequin learns to make predictions.

    Relying on the duty, you should change the JSON object’s prefix and suffix accordingly. Listed here are some examples:

    {"picture": "some_filename.png", 
    "prefix": "caption en" (To point that the mannequin ought to generate an English caption for a picture),
    "suffix": "That is a picture of a giant, white boat touring within the ocean."
    }
    {"picture": "another_filename.jpg", 
    "prefix": "How many individuals are within the picture?",
    "suffix": "ten"
    }
    {"picture": "filename.jpeg", 
    "prefix": "detect airplane",
    "suffix": " airplane" (4 nook bounding field coords)
    }

    If in case you have a number of classes to be detected, add a semicolon (;) amongst every class within the prefix and suffix.

    A whole and clear rationalization of learn how to put together the info for object detection in Paligemma might be present in this Roboflow post.

    {"picture": "filename.jpeg", 
    "prefix": "detect airplane",
    "suffix": " airplane"
    }

    Word that for segmentation, aside from the article’s bounding field coordinates, it’s good to specify 16 additional segmentation tokens representing a masks that matches throughout the bounding field. Based on Google’s Big Vision repository, these tokens are codewords with 128 entries (…). How can we receive these values? In my private expertise, it was difficult and irritating to get them with out correct documentation. However I’ll give extra particulars later.

    In case you are curious about studying extra about Paligemma, I like to recommend these blogs:

    As talked about above, Paligemma was skilled on completely different datasets. Subsequently, this mannequin is predicted to be good at segmenting “conventional” objects akin to automobiles, individuals, or animals. However what about segmenting objects in satellite tv for pc pictures? This query led me to discover Paligemma’s capabilities for segmenting water in satellite tv for pc pictures.

    Kaggle’s Satellite Image of Water Bodies dataset is appropriate for this goal. This dataset comprises 2841 pictures with their corresponding masks.

    Here is an instance of the water our bodies dataset: The RGB picture is proven on the left, whereas the corresponding masks seems on the suitable.

    Some masks on this dataset have been incorrect, and others wanted additional preprocessing. Defective examples embody masks with all values set to water, whereas solely a small portion was current within the unique picture. Different masks didn’t correspond to their RGB pictures. When a picture is rotated, some masks make these areas seem as if they’ve water.

    Instance of a rotated masks. When studying this picture in Python, the world exterior the picture seems as it will have water. On this case, picture rotation is required to right this masks. Picture made by the creator.

    Given these information limitations, I chosen a pattern of 164 pictures for which the masks didn’t have any of the issues talked about above. This set of pictures is used to fine-tune Paligemma.

    Making ready the JSONL dataset

    As defined within the earlier part, Paligemma wants entries that signify the article’s bounding field coordinates in normalized image-space (…) plus an additional 16 segmentation tokens representing 128 completely different codewords (…). Acquiring the bounding field coordinates within the desired format was straightforward, due to Roboflow’s explanation. However how can we receive the 128 codewords from the masks? There was no clear documentation or examples within the Massive Imaginative and prescient repository that I might use for my use case. I naively thought that the method of making the segmentation tokens was just like that of constructing the bounding bins. Nonetheless, this led to an incorrect illustration of the water masks, which led to flawed prediction outcomes.

    By the point I wrote this weblog (starting of December), Google introduced the second model of Paligemma. Following this occasion, Roboflow printed a nice overview of getting ready information to fine-tune Paligemma2 for various purposes, together with picture segmentation. I take advantage of a part of their code to lastly receive the right segmentation codewords. What was my mistake? Effectively, to begin with, the masks must be resized to a tensor of form [None, 64, 64, 1] after which use a pre-trained variational auto-encoder (VAE) to transform annotation masks into textual content labels. Though the utilization of a VAE mannequin was briefly talked about within the Massive Imaginative and prescient repository, there isn’t any rationalization or examples on learn how to use it.

    The workflow I take advantage of to arrange the info to fine-tune Paligemma is proven under:

    Steps to transform one unique masks from the filtered water bodies dataset to a JSON object. This course of is repeated over the 164 pictures of the practice set and the 21 pictures of the take a look at dataset to construct the JSONL dataset.

    As noticed, the variety of steps wanted to arrange the info for Paligemma is massive, so I don’t share code snippets right here. Nonetheless, if you wish to discover the code, you may go to this GitHub repository. The script convert.py has all of the steps talked about within the workflow proven above. I additionally added the chosen pictures so you may play with this script instantly.

    When preprocessing the segmentation codewords again to segmentation masks, we be aware how these masks cowl the water our bodies within the pictures:

    Ensuing masks when decoding the segmentation codewords within the practice set. Picture made by the creator utilizing this Notebook.

    Earlier than fine-tuning Paligemma, I attempted its segmentation capabilities on the fashions uploaded to Hugging Face. This platform has a demo the place you may add pictures and work together with completely different Paligemma fashions.

    Default Paligemma mannequin at segmenting water in satellite tv for pc pictures.

    The present model of Paligemma is usually good at segmenting water in satellite tv for pc pictures, nevertheless it’s not good. Let’s see if we are able to enhance these outcomes!

    There are two methods to fine-tune Paligemma, both via Hugging Face’s Transformer library or through the use of Massive Imaginative and prescient and JAX. I went for this final choice. Massive Imaginative and prescient gives a Colab notebook, which I modified for my use case. You’ll be able to open it by going to my GitHub repository:

    I used a batch measurement of 8 and a studying charge of 0.003. I ran the coaching loop twice, which interprets to 158 coaching steps. The entire working time utilizing a T4 GPU machine was 24 minutes.

    The outcomes weren’t as anticipated. Paligemma didn’t produce predictions in some pictures, and in others, the ensuing masks have been removed from the bottom fact. I additionally obtained segmentation codewords with greater than 16 tokens in two pictures.

    Outcomes of the fine-tuning the place there have been predictions. Picture made by the creator.

    It’s value mentioning that I take advantage of the primary Paligemma model. Maybe the outcomes are improved when utilizing Paligemma2 or by tweaking the batch measurement or studying charge additional. In any case, these experiments are out of the scope of this weblog.

    The demo outcomes present that the default Paligemma mannequin is healthier at segmenting water than my finetuned mannequin. In my view, UNET is a greater structure if the purpose is to construct a mannequin specialised in segmenting objects. For extra info on learn how to practice such a mannequin, you may learn my earlier weblog publish:

    Different limitations:

    I wish to point out another challenges I encountered when fine-tuning Paligemma utilizing Massive Imaginative and prescient and JAX.

    • Organising completely different mannequin configurations is tough as a result of there’s nonetheless little documentation on these parameters.
    • The primary model of Paligemma has been skilled to deal with pictures of various side ratios resized to 224×224. Ensure that to resize your enter pictures with this measurement solely. This can stop elevating exceptions.
    • When fine-tuning with Massive Imaginative and prescient and JAX, You may need JAX GPU-related issues. Methods to beat this problem are:

    a. Decreasing the samples in your coaching and validation datasets.

    b. Growing the batch measurement from 8 to 16 or greater.

    • The fine-tuned mannequin has a measurement of ~ 5GB. Ensure that to have sufficient area in your Drive to retailer it.

    Discovering a brand new AI mannequin is thrilling, particularly on this age of multimodal algorithms reworking our society. Nonetheless, working with state-of-the-art fashions can typically be difficult because of the lack of obtainable documentation. Subsequently, the launch of a brand new AI mannequin ought to be accompanied by complete documentation to make sure its easy and widespread adoption, particularly amongst professionals who’re nonetheless inexperienced on this space.

    Regardless of the difficulties I encountered fine-tuning Paligemma, the present pre-trained fashions are highly effective at doing zero-shot object detection and picture segmentation, which can be utilized for a lot of purposes, together with assisted ML labeling.

    Are you utilizing Paligemma in your Laptop Imaginative and prescient tasks? Share your expertise fine-tuning this mannequin within the feedback!

    I hope you loved this publish. As soon as extra, thanks for studying!

    You’ll be able to contact me by way of LinkedIn at:

    https://www.linkedin.com/in/camartinezbarbosa/

    Acknowledgments: I wish to thank José Celis-Gil for all of the fruitful discussions on information preprocessing and modeling.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleInventory Planning Using Machine Learning and Optimization | by Jayadeep Shitole | Dec, 2024
    Next Article Why Product Managers Hold the Key to Ethical AI Success
    Team_AIBS News
    • Website

    Related Posts

    Artificial Intelligence

    STOP Building Useless ML Projects – What Actually Works

    July 1, 2025
    Artificial Intelligence

    Implementing IBCS rules in Power BI

    July 1, 2025
    Artificial Intelligence

    Become a Better Data Scientist with These Prompt Engineering Tips and Tricks

    July 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Cuba’s Energy Crisis: A Systemic Breakdown

    July 1, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    Key Insights From a Roundtable Discussion With 5 Founder-CEOs

    January 9, 2025

    Why Entrepreneurs Should Stop Putting Life on Hold

    February 27, 2025

    Boost Productivity With This Adjustable Stand With Port Hub for Just $100

    April 26, 2025
    Our Picks

    Cuba’s Energy Crisis: A Systemic Breakdown

    July 1, 2025

    AI Startup TML From Ex-OpenAI Exec Mira Murati Pays $500,000

    July 1, 2025

    STOP Building Useless ML Projects – What Actually Works

    July 1, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.