Audio Visual Speech Separation in Noisy Environments
with a Lightweight Iterative Model

Julius Richter2
Kai Li1
Xiaolin Hu1,3,*
Timo Gerkmann2,*

1. Department of Computer Science and Technology, Tsinghua University, Beijing, China
2. Signal Processing (SP), Department of Informatics, Universität Hamburg, Hamburg, Germany
3. Chinese Institute for Brain Research (CIBR), Beijing, China
*Corresponding Author

Sample results on LRS3+WHAM!

We generated synthetic mixtures containing 2 speakers plus noise by combining the LRS3 [1] and WHAM! [2] datasets. The Signal-to-Noise Ratio is uniformly sampled within the range [-5, 5] dB for clean speech and [-6, 3] dB for the noise, following the mixing procedure originally described in WHAM!.
The following table shows the results for 10 randomly selected examples from the test set.

Video ground truth Input mixture ConvTasNet 3 DPRNN 4 AVConvTasNet 5 LAVSE 6 L2L 7 A-FRCNN-8 8 AVLIT-8 (Ours)


