Make some noise: Teaching the language of audio to an LLM using sound tokens

Subscribers:
351,000
Published on ● Video Link: https://www.youtube.com/watch?v=RIAsFyr3zs8



Duration: 0:00
463 views
19


August 22, 2024
Speakers: Shivam Mehta
Host: Hannes Gamper

We investigate the use of low bitrate causal quantized audio representations to fine-tune large language models (LLMs) using LoRA for comprehending and generating audio. Differing from earlier approaches that depend on continuous audio representations for audio comprehension, our attempt involves learning a discretized language of audio through a causal variational quantization leading to an ultra-low bitrate of 0.293 kbps. These proposed audio tokens are then utilized to fine-tune the Llama 7b model for multimodal tasks involving audio understanding and generation. By treating audio as a language with a similar left-to-right inductive bias, we can leverage these tokens to train a multimodal model and conduct qualitative multimodal analysis.