Academic Research

Hybrid CNN-LSTM Architecture

A highly optimized, lightweight neural network architecture designed for real-time sign language recognition on edge devices.

95.3% Accuracy Edge Inference Published Research
Research Architecture
Role

Lead Researcher / Author

Timeline

Personal Project

Domain

Computer Vision

Core Tech

TensorFlow, MediaPipe

"Traditional Deep Learning models for continuous sign language recognition achieve high accuracy but are computationally expensive, restricting their deployment to powerful cloud servers. In this research, we propose a lightweight Hybrid CNN-LSTM architecture leveraging spatial landmarks to achieve 95.30% validation accuracy while maintaining strict low-latency requirements for real-time browser inference on consumer hardware."

01 · Problem The Challenge

Sign language translation systems face a two-fold challenge: spatial complexity (recognizing exact hand shapes and body posture) and temporal dependency (understanding how these shapes change sequentially over time to form words).

While massive transformer models or full-frame 3D-CNNs can solve this, they require expensive GPUs. For an educational tool aimed at the general public in developing nations, the key constraint is running inference directly on older laptops or mobile devices without thermal throttling or perceived lag.

02 · Solution Methodology & Architecture

To overcome these constraints, the methodology employed an experimental hybrid CNN-LSTM architecture optimized for edge devices. A constrained dataset of low-resolution Indonesian Sign Language images was manually labeled, resized, normalized, and augmented to simulate real-world variability.

The system deployed MobileNetV2 to efficiently extract spatial features, integrated with an LSTM network to capture sequential gesture dynamics. The model was then rigorously compressed using L2 regularization and post-training quantization — converting 32-bit weights into 8-bit integers — to drastically minimize the memory footprint without sacrificing accuracy.

1. Dataset Prep

A constrained dataset of low-resolution Indonesian Sign Language images was manually labeled, resized, normalized, and augmented to simulate real-world variability.

2. Feature Extraction

MobileNetV2 efficiently extracts spatial features from the frames, which are then integrated with an LSTM network to capture sequential gesture dynamics.

3. Compression

The model was rigorously compressed using L2 regularization and post-training quantization to minimize memory footprint without losing accuracy.

03 · Results Key Findings

The experimental results demonstrated that by constraining the model to coordinate-based inputs and using a heavily pooled architecture, we were able to drastically compress the model foot-print.

95.3%
Validation Accuracy
Early stopping & dropout
45ms
Inference Speed
40-50ms per batch on CPU
<25MB
Model Weight
Ideal for web transmission

04 · Stack Tech Stack

Data & Processing
Python NumPy MediaPipe Holistic OpenCV
Model Architecture
TensorFlow / Keras MobileNetV2 LSTM Matplotlib
Previous Project
Next Project Flux Budget App
View All Projects