We generated synthetic mixtures containing 2 speakers plus noise by combining the LRS3 [1] and WHAM! [2]
datasets.
The Signal-to-Noise Ratio is uniformly sampled within the range [-5, 5] dB for clean speech and [-6,
3] dB for the noise, following the mixing procedure originally described in WHAM!.
The following table shows the results for 10 randomly selected examples from the test set.
Video ground truth | Input mixture | ConvTasNet 3 | DPRNN 4 | AVConvTasNet 5 | LAVSE 6 | L2L 7 | A-FRCNN-8 8 | AVLIT-8 (Ours) |
---|---|---|---|---|---|---|---|---|
[1] Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. “LRS3-TED: a large-scale dataset for visual speech recognition”. In: arXiv preprint arXiv:1809.00496 (2018).
[2] Gordon Wichern et al. “WHAM!: Extending speech separation to noisy environments”. In:arXiv preprint arXiv:1907.01160 (2019).
[3] Luo Y, Mesgarani N. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation[J]. IEEE/ACM transactions on audio, speech, and language processing, 2019, 27(8): 1256-1266.
[4] Yi Luo, Zhuo Chen, and Takuya Yoshioka. “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 46–50.
[5] Jian Wu et al. “Time domain audio visual speech separation”. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE. 2019, pp. 667–673.
[6] Shang-Yi Chuang et al. “Lite Audio-Visual Speech Enhancement”. In: Proc. Interspeech 2020. 2020, pp. 1131–1135. DOI: 10 . 21437 / Interspeech . 2020 - 1617. URL: http://dx.doi.org/10.21437/Interspeech.2020-1617.
[7] Ariel Ephrat et al. “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation”. In: arXiv preprint arXiv:1804.03619 (2018).
[8] Xiaolin Hu et al. “Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network”. In: Thirty-Fifth Conference on Neural Information Processing Systems. 2021. URL: https://openreview.net/forum?id=SlxH2AbBBC2.