ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models
TL;DR
Existing contrastive audio–text embeddings are tuned mainly for caption matching, which limits support for diverse retrieval goals and controllable behavior. ALM2Vec transfers the audio understanding and instruction-following abilities of pretrained large audio–language models (LALMs) into a unified embedding space for retrieval across domains and tasks—including instruction-aware search for audio QA and aspect-conditioned retrieval. On standard audio and speech retrieval benchmarks it is competitive, while also showing promising compositional and controllable retrieval capabilities as a unified model across domains, tasks, and user intents.
Demo
We showcase ALM2Vec across four retrieval settings. Scroll to browse each part, or use the navigation bar to jump directly; each part contains several illustrative examples.
instruction-aware audio → audio
The same audios are encoded in query mode under different instructions. Because each embedding follows its instruction, the resulting similarities reflect instruction following — the same audio pair lands closer or farther apart depending on what the instruction asks the model to attend to.
audio → text retrieval
Each example contains three audio–text pairs. The 3×3 matrix reports the pairwise similarity between every audio and every text, with the diagonal (matching pairs) expected to dominate.
text → audio retrieval
Symmetric to audio→text: three text–audio pairs and the 3×3 pairwise similarity matrix between every text and every audio.
audio + question → answer
An audio clip paired with a question forms the query. Several candidate answers act as independent documents; the query×doc similarities (a 1×N row) should peak on the correct answer.
License
The repository is licensed under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
Citation
If you find this work useful, please consider contributing to this repo and citing:
@article{ALM2Vec2026,
title={ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models},
author={Fengjie Lu and Chenang Jiang and Jiarui Hai and Helin Wang and Aaron Yee},
journal={arXiv preprint arXiv:TBD},
year={2026}
}