Spaces:
Build error
Build error
| title: Multimodal Product Classification | |
| emoji: π | |
| colorFrom: purple | |
| colorTo: yellow | |
| sdk: gradio | |
| sdk_version: 5.44.0 | |
| app_file: app.py | |
| pinned: true | |
| license: mit | |
| short_description: Product classification using image and text | |
| # ποΈMultimodal Product Classification with Gradio | |
| ## Table of Contents | |
| 1. [Project Description](#1-project-description) | |
| 2. [Methodology & Key Features](#2-methodology--key-features) | |
| 3. [Technology Stack](#3-technology-stack) | |
| 4. [Model Details](#4-model-details) | |
| ## 1. Project Description | |
| This project implements a **multimodal product classification system** for Best Buy products. The core objective is to categorize products using both their text descriptions and images. The system was trained on a dataset of **almost 50,000** items. | |
| The entire system is deployed as a lightweight, web application using **Gradio**. The app allows users to: | |
| - Use both text and an image for the most accurate prediction. | |
| - Run predictions using only text or only an image to understand the contribution of each data modality. | |
| This project showcases the power of combining different data types to build a more robust and intelligent classification system. | |
| > [!IMPORTANT] | |
| > | |
| > - Check out the deployed app here: ποΈ [Multimodal Product Classification App](https://huggingface.co/spaces/iBrokeTheCode/Multimodal_Product_Classification) ποΈ | |
| > - Check out the Jupyter Notebook for a detailed walkthrough of the project here: ποΈ [Jupyter Notebook](https://huggingface.co/spaces/iBrokeTheCode/Multimodal_Product_Classification/blob/main/notebook_guide.ipynb) ποΈ | |
|  | |
| ## 2. Methodology & Key Features | |
| - **Core Task:** Multimodal Product Classification on a Best Buy dataset. | |
| - **Pipeline:** | |
| - **Data:** A dataset of \~50,000 products, each with a text description and an image. | |
| - **Feature Extraction:** Pre-trained models are used to convert raw text and image data into high-dimensional embedding vectors. | |
| - **Classification:** A custom-trained **Multilayer Perceptron (MLP)** model performs the final classification based on the embeddings. | |
| - **Key Features:** | |
| - **Multimodal:** Combines text and image data for a more accurate prediction. | |
| - **Single-Service Deployment:** The entire application runs as a single, deployable Gradio app. | |
| - **Flexible Inputs:** The app supports multimodal, text-only, and image-only prediction modes. | |
| ## 3. Technology Stack | |
| This project was built using the following technologies: | |
| **Deployment & Hosting:** | |
| - [Gradio](https://gradio.app/) β interactive web app frontend. | |
| - [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces) β for cost-effective deployment. | |
| **Modeling & Training:** | |
| - [TensorFlow / Keras](https://www.tensorflow.org/) β used to train the final MLP classification model. | |
| - [Sentence-Transformers](https://www.sbert.net/) β for generating text embeddings. | |
| - [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) β for the image feature extractor (`TFConvNextV2Model`). | |
| **Development Tools:** | |
| - [Ruff](https://github.com/charliermarsh/ruff) β Python linter and formatter. | |
| - [uv](https://github.com/astral-sh/uv) β fast Python package installer and resolver. | |
| ## 4. Model Details | |
| The final classification is performed by a custom-trained **Multilayer Perceptron (MLP)** model that takes the extracted embeddings as input. | |
| - **Text Embedding Model:** `SentenceTransformer` (`all-MiniLM-L6-v2`) | |
| - **Image Embedding Model:** `TFConvNextV2Model` (`convnextv2-tiny-22k-224`) | |
| - **Classifier:** A custom MLP model trained on top of the embeddings. | |
| - **Classes:** The model classifies products into a set of specific Best Buy product categories. | |
| | Model | Modality | Accuracy | Macro Avg F1-Score | Weighted Avg F1-Score | | |
| | :------------------ | :----------- | :------- | :----------------- | :-------------------- | | |
| | Random Forest | Text | 0.90 | 0.83 | 0.90 | | |
| | Logistic Regression | Text | 0.90 | 0.84 | 0.90 | | |
| | Random Forest | Image | 0.80 | 0.70 | 0.79 | | |
| | Random Forest | Combined | 0.89 | 0.79 | 0.89 | | |
| | Logistic Regression | Combined | 0.89 | 0.83 | 0.89 | | |
| | **MLP** | **Image** | **0.84** | **0.77** | **0.84** | | |
| | **MLP** | **Text** | **0.92** | **0.87** | **0.92** | | |
| | **MLP** | **Combined** | **0.92** | **0.85** | **0.92** | | |
| > [!TIP] | |
| > | |
| > Based on the evaluation on the test set, the Multimodal MLP model achieved an excellent **92% accuracy** and a **92% weighted F1-score**, confirming its superior performance by leveraging both text and image data. | |