Hybrid CNN-LSTM Research

"Traditional Deep Learning models for continuous sign language recognition achieve high accuracy but are computationally expensive, restricting their deployment to powerful cloud servers. In this research, we propose a lightweight Hybrid CNN-LSTM architecture leveraging spatial landmarks to achieve 95.30% validation accuracy while maintaining strict low-latency requirements for real-time browser inference on consumer hardware."

01 · Problem The Challenge

Sign language translation systems face a two-fold challenge: spatial complexity (recognizing exact hand shapes and body posture) and temporal dependency (understanding how these shapes change sequentially over time to form words).

While massive transformer models or full-frame 3D-CNNs can solve this, they require expensive GPUs. For an educational tool aimed at the general public in developing nations, the key constraint is running inference directly on older laptops or mobile devices without thermal throttling or perceived lag.

02 · Solution Methodology & Architecture

To overcome these constraints, the methodology employed an experimental hybrid CNN-LSTM architecture optimized for edge devices. A constrained dataset of low-resolution Indonesian Sign Language images was manually labeled, resized, normalized, and augmented to simulate real-world variability.

The system deployed MobileNetV2 to efficiently extract spatial features, integrated with an LSTM network to capture sequential gesture dynamics. The model was then rigorously compressed using L2 regularization and post-training quantization — converting 32-bit weights into 8-bit integers — to drastically minimize the memory footprint without sacrificing accuracy.

1. Dataset Prep

A constrained dataset of low-resolution Indonesian Sign Language images was manually labeled, resized, normalized, and augmented to simulate real-world variability.

2. Feature Extraction

MobileNetV2 efficiently extracts spatial features from the frames, which are then integrated with an LSTM network to capture sequential gesture dynamics.

3. Compression

The model was rigorously compressed using L2 regularization and post-training quantization to minimize memory footprint without losing accuracy.

03 · Results Key Findings

The experimental results demonstrated that by constraining the model to coordinate-based inputs and using a heavily pooled architecture, we were able to drastically compress the model foot-print.

95.3%

Validation Accuracy

Early stopping & dropout

45ms

Inference Speed

40-50ms per batch on CPU

<25MB

Model Weight

Ideal for web transmission

04 · Stack Tech Stack

Data & Processing

Python NumPy MediaPipe Holistic OpenCV

Model Architecture

TensorFlow / Keras MobileNetV2 LSTM Matplotlib

Hybrid CNN-LSTM Architecture

Role

Timeline

Domain

Core Tech