Researchers from the University of Michigan have developed completely crazy technique which makes it possible to generate spectrograms that look like images that can produce sounds that correspond to them when listened to. They call it “sound pictures”.
Their approach is simple and works without special training. It relies on pre-trained text-to-image and text-to-spectrogram diffusion models that operate in the shared latent space. During the generation process, two denoise models share latents simultaneously, guided by two texts describing the desired image and sound.
The result is amazing! This produces spectrograms that, viewed as images, resemble a castle with towers, and heard as sounds, cause bells to ring. Or tigers whose stripes hide the sound patterns of their roars.
To evaluate their hack, the researchers used quantitative metrics like CLIP and CLAP, as well as human perception studies. Their method outperforms alternative approaches and generates patterns that closely match text queries in both modalities. They also show that coloring spectrograms produces images that are more pleasing to the eye while preserving sound.
This endeavor reveals that there is an intersection between image distributions and audio spectrograms, and despite their differences, they share low-level features such as contours, curves, and corners. This makes it possible to compose visual and acoustic elements in an unexpected way, such as a line that marks both the attack of the bell and the outline of the belfry.
The authors see it as the progress of the multimodal generation of composition and a new form of audio-visual artistic expression. A kind of steganography that would hide the images in the soundtrack, which are only revealed when they are transformed into a spectrogram.
To recreate this method at home, simply go to Github project and follow the technical instructions.