In my previous approach for SmartScan, I designed it primarily for picture group utilizing a classification mannequin. Whereas the app functioned as supposed, the inflexible nature of a classification mannequin finally turned a limitation. This method required customers to coach it to precisely categorize their photographs, making it much less versatile when coping with the variety of consumer photographs. In distinction, embedding fashions generate function vectors, permitting us to compute cosine similarity between photographs and class representations — an method that adapts way more gracefully to various inputs. Though frameworks like ONNX and LiteRT help on-device coaching, implementing this added a layer of complexity that would negatively affect usability of the app.
By transitioning to an embedding-based method, the app not solely improved its picture group capabilities but additionally gained a strong text-to-image search function, enabling customers to search out photographs utilizing pure language queries.
Initially, I used CLIP embedding fashions (picture and textual content) for a zero-shot classification method by computing the cosine similarity between the textual illustration of folder names and the embeddings of latest photographs. Whereas this allowed for dynamic categorization, this technique was not at all times correct as a result of a folder identify alone didn’t seize the total variability of its contents. For instance, throughout testing with Twitter and Reddit screenshots, the mannequin incessantly misclassified Twitter screenshots as Reddit, with the cosine similarity between the 2 typically being extraordinarily shut.
To enhance accuracy, I launched few-shot studying technique utilizing prototype embeddings. With this new method, every vacation spot folder is represented by a prototype embedding — the common of all picture embeddings inside that folder. New photographs are then in comparison with these prototype embeddings, resulting in significantly better matching accuracy.
One other key motive for switching to an embedding mannequin was its suitability for implementing a text-to-image search function — which has now been added. Embedding fashions naturally map photographs and textual content right into a shared function area, making similarity comparability seamless and intuitive for customers looking out their gallery with textual queries.
General utilizing the embedding method for classification gives a number of vital benefits:
- Flexibility: Embedding fashions simply adapt to varied user-defined classes with out the constraints of discrete labels.
- Dynamic Consumer Management: Customers can seamlessly add or modify classes with out retraining a posh classification mannequin.
- Enhanced Scalability: The prototype embedding technique scales gracefully as new picture varieties and classes are added.
To allow environment friendly use of the CLIP fashions on consumer units, I carried out the next steps:
- Conversion to ONNX: I transformed the CLIP visible and textual content encoder fashions to the ONNX format utilizing ONNX Runtime, making certain compatibility with cell environments.
- Mannequin Quantization: Each fashions had been quantized, decreasing their sizes by roughly 4x, with the picture encoder mannequin being decreased from 351.6MB to 95.6MB and the textual content encoder mannequin being decreased from 254.2MB to 64.4MB. This quantization not solely minimizes the APK measurement but additionally optimizes inference efficiency on cell units.
Switching to an embedding mannequin has considerably enhanced the app’s flexibility and total consumer expertise. The preliminary zero-shot classification method utilizing embeddings served as a helpful stepping stone. Nonetheless, by transitioning to the few-shot studying method utilizing prototype embeddings, the categorization course of turned way more correct and strong. Furthermore, the embedding method now absolutely helps the built-in text-to-image search function, offering a pure and highly effective option to discover picture collections by easy textual content queries.
Reflecting on my shift from zero-shot classification to few-shot studying, I couldn’t assist however take into account the controversy on AI changing software program builders. To not toot my very own horn, however this exemplifies why that received’t occur solely. Whereas AI will dominate code era, the complexities of system design, debugging, and problem-solving will at all times want human perception.
For these excited by exploring the app additional, it’s open supply. You possibly can obtain it and take a look at the GitHub repository here. If you happen to discover it helpful or attention-grabbing, help it by giving it a star! It also needs to be launched on F-Droid someday this week.