Instructions to use cvtechniques/VideoGameHandGestures with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- ultralytics
How to use cvtechniques/VideoGameHandGestures with ultralytics:
from ultralytics import YOLOvv8 model = YOLOvv8.from_pretrained("cvtechniques/VideoGameHandGestures") source = 'http://images.cocodataset.org/val2017/000000039769.jpg' model.predict(source=source, save=True) - Notebooks
- Google Colab
- Kaggle
| language: en | |
| license: mit | |
| tags: | |
| - computer-vision | |
| - object-detection | |
| - yolov8 | |
| - gesture-recognition | |
| - gaming | |
| pipeline_tag: object-detection | |
| library_name: ultralytics | |
| # Model Description | |
| ### Overview | |
| This model detects hand gestures for use as input controls for video games. It uses object detection to recognize specific hand poses from a webcam or standard camera and translate them into game actions. | |
| The goal of the project is to explore whether computer vision–based gesture recognition can provide a low-cost and accessible alternative to traditional game controllers. | |
| ### Training Approach | |
| The model was trained using the nano version of YOLOv8 (YOLOv8n) through the Ultralytics training framework. | |
| The model was trained from pretrained YOLOv8n weights and fine-tuned on a custom hand gesture dataset. | |
| ### Intended Use Cases | |
| * Gesture-controlled video games with simple control schemes | |
| * Touchless interfaces | |
| * Interactive displays | |
| * Public kiosks | |
| * Smart home media controls | |
| * Desktop navigation | |
| *** | |
| # Training Data | |
| ### Dataset Sources | |
| **The training dataset was constructed from two sources:** | |
| Rock-Paper-Scissors dataset | |
| * Source: Roboflow Universe | |
| * Creator: Audrey | |
| * Used for the first three gesture classes | |
| * Dataset URL: https://universe.roboflow.com/audrey-x3i6m/rps-knmjj | |
| Custom gesture dataset | |
| * Created by recording a 30-second video of the author performing gestures | |
| * Video parsed into frames at 10 frames per second | |
| * Images manually selected and annotated | |
| ### Dataset Size | |
| | Category | Count | | |
| | ---------------- | --------- | | |
| | Original Images | 444 | | |
| | Augmented Images | 1066 | | |
| | Image Resolution | 512 × 512 | | |
| ### Class Distribution | |
| | Class | Gesture | Annotation Count | | |
| | -------- | ----------- | ---------------- | | |
| | Forward | Open Palm | 169 | | |
| | Backward | Closed Fist | 210 | | |
| | Jump | Peace Sign | 187 | | |
| | Attack | Thumbs Up | 121 | | |
| ### Data Collection Methodology | |
| The dataset combines stock gesture images with a custom dataset created from recorded video frames. | |
| **The custom dataset was generated by:** | |
| * Recording a short gesture demonstration video | |
| * Extracting frames at 10 FPS | |
| * Selecting usable frames | |
| * Annotating gesture bounding boxes | |
| * This process produced 236 custom images that were merged with the stock dataset. | |
| ### Annotation Process | |
| All annotations were created manually using Roboflow. | |
| Bounding boxes were drawn around the visible hand gesture in each image. | |
| Due to failure to import annotation metadata from the original dataset, all 444 images were annotated manually. | |
| Estimated annotation time: 2–3 hours | |
| ### Train / Validation / Test Split | |
| | Dataset Split | Image Count | | |
| | ------------- | ----------- | | |
| | Training | 933 | | |
| | Validation | 88 | | |
| | Test | 45 | | |
| ### Data Augmentation | |
| **The following augmentations were applied:** | |
| * Rotation: ±15 degrees | |
| * Saturation adjustment: ±30% | |
| *These augmentations expanded the dataset from 444 to 1066 images.* | |
| ### Dataset Availability | |
| Dataset availability: https://universe.roboflow.com/b-data-497-ws/hand-gesture-controls | |
| ### Known Dataset Biases and Limitations | |
| * Small dataset size | |
| * Class imbalance (thumbs-up has fewer examples) | |
| * Mixed image quality between stock and custom images | |
| * Limited diversity in backgrounds and lighting conditions | |
| * Limited number of subjects (primarily one person) | |
| *These factors may affect model generalization.* | |
| *** | |
| # Training Procedure | |
| ### Framework | |
| Training was performed in Google Colab using altered Python code from a YOLOv11 training run. Code was taken and altered for YOLOv8n from [here](https://oceancv.org/book/TrainandDeployObj_YOLO.html). | |
| ### Model Architecture | |
| Base model: YOLOv8n (Nano) | |
| **Reasons for selection:** | |
| * Lightweight architecture | |
| * Low inference latency | |
| * Lower hardware requirements | |
| * Faster training times | |
| * Suitable for real-time applications | |
| ### Training Configuration | |
| | Parameter | Value | | |
| | ----------------------- | ---------------------------- | | |
| | Epochs | 200 (training stopped early) | | |
| | Early stopping patience | 10 | | |
| | Image size | 512 × 512 | | |
| | Batch size | 64 | | |
| ### Training Hardware | |
| | Component | Specification | | |
| | ------------- | ---------------- | | |
| | GPU | A100 (High Ram) | | |
| | VRAM | 80 GB | | |
| | Training Time | ~40 minutes | | |
| ### Preprocessing Steps | |
| * Images resized to 512×512 | |
| * Bounding box annotations normalized | |
| * Augmented images generated before training | |
| *** | |
| # Evaluation Results | |
| ### Overall Metrics | |
| **Final model performance at epoch 41:** | |
| | Metric | Score | | |
| | --------- | ----- | | |
| | mAP@50 | 0.97 | | |
| | mAP@50–95 | 0.78 | | |
| | Precision | 0.93 | | |
| | Recall | 0.91 | | |
| | F1 Score | 0.94 | | |
| *These results exceed the predefined project success criteria.* | |
| **Per-Class Performance** | |
| <img alt= "Per-Class Performance" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/perclass_perf.png" width="1000" height="180"></img> | |
| **Sample Class Images** | |
| <img alt= "Sample Images" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/sample_images.png" width="1100" height="700"></img> | |
| ### Key Visualizations | |
| <img alt= "Confusion Matrix" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/confusion_matrix_normalized.png" width="1100" height="700"></img> | |
| <img alt= "F1 Curve" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/BoxF1_curve.png" width="1100" height="700"></img> | |
| <img alt= "Precision-Recall Curve" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/BoxPR_curve.png" width="1100" height="700"></img> | |
| ### Performance Analysis | |
| The model achieved high precision and recall across all gesture classes, indicating strong detection performance on the test set. | |
| Several factors contributed to this performance: | |
| * A small number of distinct gesture classes | |
| * Highly visible and consistent hand poses | |
| * A balanced dataset for most classes | |
| However, the dataset size is relatively small, which may inflate evaluation scores and limit generalization. | |
| Failure cases were observed in several situations: | |
| * Complex or cluttered backgrounds | |
| * Low confidence detections | |
| * Ambiguous or blurred gesture poses | |
| These issues highlight areas where the model could be improved with more diverse training data. | |
| *** | |
| # Limitations and Biases | |
| ### Known Failure Cases | |
| <img alt= "Failure Cases" src="https://huggingface.co/cvtechniques/VideoGameHandGestures/resolve/main/failure_cases.png" width="1100" height="700"></img> | |
| The model struggled with some of the photos from the RPS dataset as these images contain complex backgrounds, partially occluded hands, or ambiguous gestures. | |
| ### Data Biases | |
| Potential biases include: | |
| * limited subject diversity | |
| * similar backgrounds across many images | |
| * dataset partially composed of stock imagery | |
| * limited environmental variability | |
| ### Environmental Limitations | |
| Model performance may degrade when: | |
| * lighting conditions vary significantly | |
| * gestures are performed at unusual angles | |
| * hands are partially occluded | |
| * gestures appear at extreme scales or distances | |
| ### Inappropriate Use Cases | |
| This model should not be used for: | |
| * complex gesture recognition (complex 3D control schemes) | |
| * sign language recognition | |
| * high-precision human-computer interaction systems | |
| * any safety-critical applications | |
| ### Sample Size Limitations | |
| The dataset is relatively small for object detection training, which may limit generalization to new users or environments. | |
| Future improvements to the model would likely be a larger and more diverse dataset. Best course of action would be to remove stock images dataset and culminate gesture videos using diverse individuals, backgrounds, etc. | |
| *** | |
| # Future Work | |
| Potential improvements include: | |
| * collecting a larger and more diverse gesture dataset | |
| * increasing the number of gesture classes | |
| * improving image quality and environmental diversity | |
| * exploring hand keypoint detection models instead of object detection | |
| * Keypoint estimation could allow detection of more complex hand gestures and improve gesture recognition accuracy. |