01 · Problem The Challenge
Sign language translation systems face a two-fold challenge: spatial complexity (recognizing exact hand shapes and body posture) and temporal dependency (understanding how these shapes change sequentially over time to form words).
While massive transformer models or full-frame 3D-CNNs can solve this, they require expensive GPUs. For an educational tool aimed at the general public in developing nations, the key constraint is running inference directly on older laptops or mobile devices without thermal throttling or perceived lag.
02 · Solution Methodology & Architecture
To overcome these constraints, the methodology employed an experimental hybrid CNN-LSTM architecture optimized for edge devices. A constrained dataset of low-resolution Indonesian Sign Language images was manually labeled, resized, normalized, and augmented to simulate real-world variability.
The system deployed MobileNetV2 to efficiently extract spatial features, integrated with an LSTM network to capture sequential gesture dynamics. The model was then rigorously compressed using L2 regularization and post-training quantization — converting 32-bit weights into 8-bit integers — to drastically minimize the memory footprint without sacrificing accuracy.
1. Dataset Prep
A constrained dataset of low-resolution Indonesian Sign Language images was manually labeled, resized, normalized, and augmented to simulate real-world variability.
2. Feature Extraction
MobileNetV2 efficiently extracts spatial features from the frames, which are then integrated with an LSTM network to capture sequential gesture dynamics.
3. Compression
The model was rigorously compressed using L2 regularization and post-training quantization to minimize memory footprint without losing accuracy.
03 · Results Key Findings
The experimental results demonstrated that by constraining the model to coordinate-based inputs and using a heavily pooled architecture, we were able to drastically compress the model foot-print.