ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models

Fengjie Lu, Chenang Jiang, Jiarui Hai, Helin Wang, Aaron Yee

ZJU JHU Humanify

TL;DR

Existing contrastive audio–text embeddings are tuned mainly for caption matching, which limits support for diverse retrieval goals and controllable behavior. ALM2Vec transfers the audio understanding and instruction-following abilities of pretrained large audio–language models (LALMs) into a unified embedding space for retrieval across domains and tasks—including instruction-aware search for audio QA and aspect-conditioned retrieval. On standard audio and speech retrieval benchmarks it is competitive, while also showing promising compositional and controllable retrieval capabilities as a unified model across domains, tasks, and user intents.

ALM2Vec model overview
ALM2Vec benchmark evaluation results

Demo

We showcase ALM2Vec across four retrieval settings. Scroll to browse each part, or use the navigation bar to jump directly; each part contains several illustrative examples.

instruction-aware audio → audio

The same audios are encoded in query mode under different instructions. Because each embedding follows its instruction, the resulting similarities reflect instruction following — the same audio pair lands closer or farther apart depending on what the instruction asks the model to attend to.

audio → text retrieval

Each example contains three audio–text pairs. The 3×3 matrix reports the pairwise similarity between every audio and every text, with the diagonal (matching pairs) expected to dominate.

text → audio retrieval

Symmetric to audio→text: three text–audio pairs and the 3×3 pairwise similarity matrix between every text and every audio.

audio + question → answer

An audio clip paired with a question forms the query. Several candidate answers act as independent documents; the query×doc similarities (a 1×N row) should peak on the correct answer.

License

The repository is licensed under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International).

Citation

If you find this work useful, please consider contributing to this repo and citing:

@article{ALM2Vec2026,
  title={ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models},
  author={Fengjie Lu and Chenang Jiang and Jiarui Hai and Helin Wang and Aaron Yee},
  journal={arXiv preprint arXiv:TBD},
  year={2026}
}