A number of recent advances in neural audio synthesis rely on upsampling layers, which can introduce undesired artifacts. In computer vision, upsampling artifacts have been studied and are known as checkerboard artifacts (due to their characteristic visual pattern). However, their effect has been overlooked so far in audio processing. Here, we address this gap by studying this problem from the audio signal processing perspective. The main sources of upsampling artifacts are:
Additive periodic noise percieved as a high-frequency buzzing noise.
They attenuate some bands, de-emphasizing high-end frequencies.
The spectral replicas of signal offsets can introduce additional artifacts.
The following video gently introduces those concepts:
We benchmark different upsampling layers for music source separation: transposed and subpixel convolutions, different interpolation upsamplers like nearest neighbor and stretch interpolation, and wavelet-based upsamplers.
Our results show that filtering artifacts, associated with interpolation upsamplers like nearest neighbor, are perceptually preferrable (higher MOS scores)—even if they tend to achieve worse SDR scores. Check our papers for further experimental details.
2.70 / 5
2.48 / 5
3.30 / 5
2.83 / 5
2.45 / 5
Upsampling layers can introduce unpleasant tonal artifacts, you can listen those in the following examples of separated vocals!
The following spectrogram depicts the kind of tonal artifacts (horizontal lines) one can listen throughout this website. These are particularly noticeable in high-frequency and silent regions. To easily identify and understand those, we recommend using headphones or visualizing their spectrograms!
Now note that nearest neighbor does not introduce tonal artifacts, observe this behavior in the following examples of separated drums!
The following spectrogram depicts the kind of filtering artifacts (the highlighted horizontal valley) one can listen throughout this website. Those attenuate some bands, and are not necessarily noticeable unless we visualize the spectrograms.
So far, we only discussed filtering and tonal artifacts.
Filtering artifacts are introduced by the frequency response of the non-learnable components of interpolation and wavelet upsamplers.
Tonal artifacts are introduced by the structure (and initialization) of transposed and subpixel convolution architectures.
replicas can introduce additional upsampling artifacts.
For this reason, stretch and lazy wavelet can also introduce tonal artifacts.
Note the above mentioned artifacts in the following examples of separated drums!
Spectral replicas of signal offsets
Definition: offsets are constants with zero
frequency. Hence, its frequency transform contains an energy component at frequency zero. When upsampling, zero-frequency spectral replicas can appear in-band, introducing tonal artifacts.
Key observation: note that stretch interpolations architecture can only introduce filtering artifacts (not tonal artifacts). Yet, the above stretch separations contain tonal artifacts. This is because the spectral replicas
of signal offsets (accross feature maps) can appear in-band when upsampling. More details in our paper.
Practical advice: the spectral replicas of offsets interact with filtering artifacts. Importantly, interpolation-based upsamplers (like nearest neighbor, which introduce filtering artifacts) can attenuate the exact bands where spectral replicas of signal
Hence, filtering artifacts are a powerful tool to combat the spectral replicas
of signal offsets. More details in our paper.
Negative result: we tried using normalization layers, as a way to mitigate
the spectral replicas of signal offsets. However, informal listening reveals that
normalization layers did not remove the tonal artifacts as intended. Listen to the subpixel convolution separations (a model which uses normalization layers, see our paper for more details).
Spectral replicas as a source of coherent high-frequency content
Hypothesis definition: we can
sort interpolation upsamplers by how strong their filtering artifacts
are, for example: sinc → linear → nearest neighbor → stretch. Note that
sinc interpolation strongly filters the signal, and stretch interpolation
introduces no filtering artifacts. While sinc interpolation is widely
used in audio because it removes all spectral replicas, this might not
be desirable for deep learning.
Positive result: we hypotesized that allowing spectral
replicas accross feature maps is beneficial, as it allows the model to
have access to coherent high-frequency feature maps for wide-band
synthesis. In our paper we experimentally validate this hypothesis, and we recommend using nearest neighbor or stretch interpolation.
Post-networks to palliate upsampling artifacts
Negative result: previous research explored using post-processing networks, as an “a posteriori” mechanism to palliate upsampling artifacts (link, link).
However, we found that post-neworks were unable to fully remove tonal artifacts. Listen to the following examples: