CTC + KD
|
ASR is all you need: cross-modal distillation for…
|
59.80
|
2019-11-28
|
|
TM-seq2seq
|
Deep Audio-Visual Speech Recognition
|
58.90
|
2018-09-06
|
|
EG-seq2seq
|
Discriminative Multi-modality Speech Recognition
|
57.80
|
2020-05-12
|
|
CTC-V2P
|
Large-Scale Visual Speech Recognition
|
55.10
|
2018-07-13
|
|
Hyb + Conformer
|
End-to-end Audio-visual Speech Recognition with C…
|
43.30
|
2021-02-12
|
|
VTP
|
Sub-word Level Lip Reading With Visual Attention
|
40.60
|
2021-10-14
|
|
RNN-T
|
Recurrent Neural Network Transducer for Audio-Vis…
|
33.60
|
2019-11-08
|
|
CTC/Attention (LRW+LRS2/3+AVSpeech)
|
Visual Speech Recognition for Multiple Languages …
|
31.50
|
2022-02-26
|
|
SyncVSR
|
SyncVSR: Data-Efficient Visual Speech Recognition…
|
31.20
|
2024-06-18
|
|
VTP (more data)
|
Sub-word Level Lip Reading With Visual Attention
|
30.70
|
2021-10-14
|
|
AV-HuBERT Large
|
Learning Audio-Visual Speech Representation by Ma…
|
26.90
|
2022-01-05
|
|
DistillAV
|
Audio-Visual Representation Learning via Knowledg…
|
26.20
|
2025-02-09
|
|
AV-HuBERT Large + Relaxed Attention + LM
|
Relaxed Attention for Transformer Models
|
25.51
|
2022-09-20
|
|
VSP-LLM
|
Where Visual Speech Meets Language: VSP-LLM Frame…
|
25.40
|
2024-02-23
|
|
RAVEn Large
|
Jointly Learning Visual and Auditory Speech Repre…
|
23.40
|
2022-12-12
|
|
USR (self-supervised)
|
Unified Speech Recognition: A Single Model for Au…
|
22.30
|
2024-11-04
|
|
SyncVSR
|
SyncVSR: Data-Efficient Visual Speech Recognition…
|
21.50
|
2024-06-18
|
|
USR (self + semi-supervised)
|
Unified Speech Recognition: A Single Model for Au…
|
21.50
|
2024-11-04
|
|
Auto-AVSR
|
Auto-AVSR: Audio-Visual Speech Recognition with A…
|
19.10
|
2023-03-25
|
|
LP + Conformer
|
Conformers are All You Need for Visual Speech Rec…
|
12.80
|
2023-02-17
|
|