Technion: Self-content-based audio inpainting

The popularity of voice over Internet protocol (VoIP) systems is continuously growing.
Such systems depend on unreliable Internet communication, in which chunks of data
often get lost during transmission. Various solutions to this problem were proposed, most
of which are better suited to small rates of lost data. This work addresses this problem by
filling in missing data using examples taken from prior recorded audio of the same user.
Our approach also harnesses statistical priors and data inpainting smoothing techniques.
The effectiveness of the proposed solution is demonstrated experimentally, even in large
data-gaps, which cannot be handled by the standard packet loss concealment techniques.
Voice over Internet protocol (VoIP) systems have become a basic tool with ever growing popularity. However, they commonly rely on an unreliable communication channel, such as the Internet, and are therefore subject to frequent events of data loss. These events are usually realized as lost data packets carrying audio information. This, in turn, leads to temporal gaps in the received audio sequences. Left untreated, such gaps create breaks in the audio (e.g. missing syllables in speech signals). High percentage of packet loss (above 20%) can often render speech unintelligible. For this reason, VoIP applications regularly incorporate a packet loss concealment (PLC) mechanism, to counter the degradation in audio quality, by filling in for the missing audio data, using various techniques. A PLC mechanism should not impose high computational loads or extensive memory usage. Specifically, PLC should operate in real-time. Moreover, intense computations consume more power, which is a limited resource in mobile devices.

Most existing PLC techniques have difficulties handling long audio gaps. This paper presents an approach for handling such gaps, corresponding to high packet loss rates. We suggest using an example-based principle that exploits audio examples collected from past audio signals. Once an audio gap is encountered, our algorithm harnesses the audio data surrounding this gap to look for the most suitable audio example to fill this gap. A mixture of audio features and prior knowledge on the statistical nature of the audio signal is used for finding the most appropriate set of examples that could be used for filling the gap. Once found, our solution presents a series of steps for isolating the best fitted example to use and pre-processing the exact portion of the audio to be extracted from the chosen example. This portion is smoothly inlaid to fill the audio gap. Inpainting is a term commonly used in the context of filling in missing pixels in images. It was borrowed by Adler et al. to describe filling short audio gaps in a signal, by using the intact portions surrounding each gap. Our work has a similar flavour, but it differs from in several important aspects. The novelty in our work lies in using a self-content-based approach, while exploiting a higher level model for the audio signal. These enable handling longer temporal audio gaps which cannot handle, as observed when experimenting with such long gaps.

Publication in Signal Processing, December 2014