Audio Visual Speech Separation in Noisy Environments
with a Lightweight Iterative Model


Julius Richter2
Kai Li1
Xiaolin Hu1,3,*
Timo Gerkmann2,*

1. Department of Computer Science and Technology, Tsinghua University, Beijing, China
2. Signal Processing (SP), Department of Informatics, Universität Hamburg, Hamburg, Germany
3. Chinese Institute for Brain Research (CIBR), Beijing, China
*Corresponding Author

Sample results on LRS3+WHAM!


We generated synthetic mixtures containing 2 speakers plus noise by combining the LRS3 [1] and WHAM! [2] datasets. The Signal-to-Noise Ratio is uniformly sampled within the range [-5, 5] dB for clean speech and [-6, 3] dB for the noise, following the mixing procedure originally described in WHAM!.
The following table shows the results for 10 randomly selected examples from the test set.

Video ground truth Input mixture ConvTasNet 3 DPRNN 4 AVConvTasNet 5 LAVSE 6 L2L 7 A-FRCNN-8 8 AVLIT-8 (Ours)

References

[1] Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. “LRS3-TED: a large-scale dataset for visual speech recognition”. In: arXiv preprint arXiv:1809.00496 (2018).

[2] Gordon Wichern et al. “WHAM!: Extending speech separation to noisy environments”. In:arXiv preprint arXiv:1907.01160 (2019).

[3] Luo Y, Mesgarani N. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation[J]. IEEE/ACM transactions on audio, speech, and language processing, 2019, 27(8): 1256-1266.

[4] Yi Luo, Zhuo Chen, and Takuya Yoshioka. “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 46–50.

[5] Jian Wu et al. “Time domain audio visual speech separation”. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE. 2019, pp. 667–673.

[6] Shang-Yi Chuang et al. “Lite Audio-Visual Speech Enhancement”. In: Proc. Interspeech 2020. 2020, pp. 1131–1135. DOI: 10 . 21437 / Interspeech . 2020 - 1617. URL: http://dx.doi.org/10.21437/Interspeech.2020-1617.

[7] Ariel Ephrat et al. “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation”. In: arXiv preprint arXiv:1804.03619 (2018).

[8] Xiaolin Hu et al. “Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network”. In: Thirty-Fifth Conference on Neural Information Processing Systems. 2021. URL: https://openreview.net/forum?id=SlxH2AbBBC2.