| | --- |
| | datasets: |
| | - GetSoloTech/FoodStack |
| | language: |
| | - en |
| | base_model: |
| | - lerobot/smolvla_base |
| | library_name: transformers |
| | tags: |
| | - Robotics |
| | - Lerobot |
| | - Food |
| | - PickPlace |
| | - VLA |
| | - SmolVLA |
| | - PhysicalAI |
| | --- |
| | |
| | ### SmolVLA Fine-Tuned on for Food Stacking |
| |
|
| | **Summary**: This is a fine-tuned version of `lerobot/smolvla_base` for stacking food objects (e.g., burgers, sandwiches). It was fine-tuned on the `GetSoloTech/FoodStack` dataset using the LeRobot framework. |
| |
|
| | ### Model details |
| | - **Base model**: `lerobot/smolvla_base` |
| | - **Task**: Vision-Language-Action control for manipulation (stacking) |
| | - **Domain**: Food item stacking (burger, sandwich, etc.) |
| | - **Params**: ~450M (SmolVLA) |
| | - **Library**: LeRobot (`lerobot`) |
| |
|
| | ### Quick start |
| | Install LeRobot with SmolVLA extras: |
| |
|
| | ```bash |
| | git clone https://github.com/huggingface/lerobot.git |
| | cd lerobot |
| | pip install -e ".[smolvla]" |
| | ``` |
| |
|
| | Load the policy from this repo and run inference: |
| |
|
| | ```python |
| | from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy |
| | |
| | # Replace with your actual model ID on the Hub |
| | model_id = "GetSoloTech/SmolVLA-FoodStack" |
| | |
| | policy = SmolVLAPolicy.from_pretrained(model_id) |
| | |
| | # Example placeholders for observation and instruction |
| | observation = { |
| | "image": ... , # BGR/RGB frame or processed observation per your setup |
| | "state": ... , # optional proprio/scene state if used |
| | } |
| | instruction = "Stack the burger: bun, patty, cheese, lettuce, bun." |
| | |
| | # Depending on your pipeline, you may wrap this in your control loop |
| | actions = policy(observation, instruction) |
| | |
| | # Send actions to your robot controller |
| | # send_actions_to_robot(actions) |
| | ``` |
| |
|
| | For end-to-end examples (policy loops, camera/robot IO), see the LeRobot docs and examples. |
| |
|
| |
|
| | Notes: |
| | - Tune batch size/steps and augmentation to your hardware and dataset split. |
| | - Ensure your observation preprocessing at train-time matches inference. |
| |
|
| |
|
| | ### Limitations |
| | - Specializes in food stacking; may not generalize to unseen objects/layouts. |
| | - Sensitive to perception domain shift (lighting, textures, camera intrinsics). |
| | - Requires correct observation normalization consistent with training. |
| |
|
| | ### Dataset |
| | - **Training data**: `GetSoloTech/FoodStack` |
| |
|
| | ### Resources and references |
| | - SmolVLA base: `https://huggingface.co/lerobot/smolvla_base` |
| | - SmolVLA overview: `https://smolvla.net/index_en.html` |
| | - LeRobot: `https://github.com/huggingface/lerobot` |