NeurIPS 2024 Past Speech & audioGenerative models

Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

Audio Imagination: NeurIPS 2024 Workshop

Submission deadline
Sep 21, 2024, 23:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (48)

Fetched from OpenReview (v2) on 2026-06-10.

  1. 3D Audio-Visual Segmentation

    Artem Sokolov, Swapnil Bhosale, Xiatian Zhu · PDF
  2. A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation

    Alexander H. Liu, Qirui Wang, Yuan Gong, James R. Glass · PDF
  3. Articulatory Synthesis of Speech and Diverse Vocal Sounds via Optimization

    Luke Mo, Manuel Cherep, Nikhil Singh, Quinn Langford, Patricia Maes · PDF
  4. AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models

    JISHENG BAI, Haohe Liu, Mou Wang, Dongyuan Shi, Wenwu Wang, Mark D Plumbley, Woon-Seng Gan, Jianfeng Chen · PDF
  5. AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

    Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, Yapeng Tian · PDF
  6. Benchmarking Music Generation Models and Metrics via Human Preference Studies

    Ahmet Solak, Florian Grötschla, Luca A Lanzendörfer, Roger Wattenhofer · PDF
  7. BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning

    Luca A Lanzendörfer, Constantin Pinkl, Nathanaël Perraudin, Roger Wattenhofer · PDF
  8. Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation

    Junwon Lee, Modan Tailleur, Laurie M. Heller, Keunwoo Choi, Mathieu Lagrange, Brian McFee, Keisuke Imoto, Yuki Okamoto · PDF
  9. Coarse-to-Fine Text-to-Music Latent Diffusion

    Luca A Lanzendörfer, Tongyu Lu, Nathanaël Perraudin, Dorien Herremans, Roger Wattenhofer · PDF
  10. Contextual Speech Emotion Recognition with Large Language Models and ASR-Based Transcriptions

    Enshi Zhang, Christian Poellabauer · PDF
  11. Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation

    Marco Pasini, Javier Nistal, Stefan Lattner, George Fazekas · PDF
  12. Contrastive Lyrics Alignment with a Timestamp-Informed Loss

    Timon Kick, Florian Grötschla, Luca A Lanzendörfer, Roger Wattenhofer · PDF
  13. DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech

    Jan Melechovsky, Ambuj Mehrish, BERRAK SISMAN, Dorien Herremans · PDF
  14. Decoding Musical Perception: Music Stimuli Reconstruction from Brain Activity

    Matteo Ciferri, Matteo Ferrante, Nicola Toschi · PDF
  15. Decoding Strategy with Perceptual Rating Prediction for Language Model-Based Text-to-Speech Synthesis

    Kazuki Yamauchi, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari · PDF
  16. DGFM: Full Body Dance Generation Driven by Music Foundation Models

    Xinran Liu, Zhenhua Feng, Diptesh Kanojia, Wenwu Wang · PDF
  17. Diffusion-based Speech Enhancement: Demonstration of Performance and Generalization

    Julius Richter, Timo Gerkmann · PDF
  18. Disentangling Multi-instrument Music Audio for Source-level Pitch and Timbre Manipulation

    Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Wei-Hsiang Liao, Keisuke Toyama, Toshimitsu Uesaka, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Simon Dixon, Yuki Mitsufuji · PDF
  19. Do music LLMs learn symbolic concepts? A pilot study using probing and intervention

    Wenye Ma, Xinyue Li, Gus Xia · PDF
  20. Efficient Generative Multimodal Integration (EGMI): Enabling Audio Generation from Text-Image Pairs through Alignment with Large Language Models

    Taemin Kim, Wooyeol Baek, Heeseok Oh · PDF
  21. FSD: Acoustic Echo Cancellation with Fewer Step Diffusion

    Yang Liu, Li Wan, Yiteng Huang, Ming Sun, Changsheng Zhao, Zhaoheng Ni, Xinhao Mei, Yangyang Shi, Florian Metze · PDF
  22. Generating Vocals from Lyrics and Musical Accompaniment

    Georg Streich, Luca A Lanzendörfer, Florian Grötschla, Roger Wattenhofer · PDF
  23. High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching

    Gael Le Lan, Bowen Shi, Zhaoheng Ni, Sidd Srinivasan, Anurag Kumar, Brian Ellis, David Kant, Varun K. Nagaraja, Ernie Chang, Wei-Ning Hsu, Yangyang Shi, Vikas Chandra · PDF
  24. Improving Musical Accompaniment Co-creation via Diffusion Transformers

    Javier Nistal, Marco Pasini, Stefan Lattner · PDF
  25. Improving Source Extraction with Diffusion and Consistency Models

    Tornike Karchkhadze, Mohammad Rasool Izadi, Shuo Zhang · PDF
  26. Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses

    Suhita Ghosh, Frank Dreyer, Tim Thiele, Frederic Lorbeer, Sebastian Stober · PDF
  27. Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM

    Robin Shing-Hei Yuen, Timothy Tin-Long Tse, Jian Zhu · PDF
  28. Latent Diffusion Model for Audio: Generation, Quality Enhancement, and Neural Audio Codec

    Haohe Liu, Wenwu Wang, Mark D Plumbley · PDF
  29. LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking

    Mayank Kumar Singh, Naoya Takahashi, Wei-Hsiang Liao, Yuki Mitsufuji · PDF
  30. LoVA: Long-form Video-to-Audio Generation

    Xin Cheng, Xihua Wang, Yihan Wu, Yuyue Wang, Ruihua Song · PDF
  31. MLADDC: Multi-Lingual Audio Deepfake Detection Corpus

    ARTH JUHUL SHAH, Ravindrakumar M. Purohit, Dharmendra H. Vaghera, Hemant Patil · PDF
  32. Multi-Source Music Generation with Latent Diffusion

    Zhongweiyang Xu, Debottam Dutta, Yu-Lin Wei, Romit Roy Choudhury · PDF
  33. MusicScore: A Dataset for Music Score Modeling and Generation

    Yuheng Lin, Zheqi DAI, Qiuqiang Kong · PDF
  34. Neural Audio Codec for Latent Music Representations

    Luca A Lanzendörfer, Florian Grötschla, Amir Dellali, Roger Wattenhofer · PDF
  35. One-shot Text-aligned Virtual Instrument Generation Utilizing Diffusion Transformer

    Qihui Yang, Jiahe Lei, Qiuqiang Kong · PDF
  36. Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-only Transformers

    Ziqiao Meng, Qichao Wang, Wenqian Cui, Yifei Zhang, Bingzhe Wu, Irwin King, Liang Chen, Peilin Zhao · PDF
  37. SNAC: Multi-Scale Neural Audio Codec

    Hubert Siuzdak, Florian Grötschla, Luca A Lanzendörfer · PDF
  38. Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions

    Yi Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhengxi Liu, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xubo Liu, Xiyuan Kang, Mark D Plumbley, Wenwu Wang · PDF
  39. SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

    Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji · PDF
  40. Spatially-Aware Losses for Enhanced Neural Acoustic Fields

    Christopher A. Ick, Gordon Wichern, Yoshiki Masuyama, François Germain, Jonathan Le Roux · PDF
  41. Style Mixture of Experts for Expressive Text-To-Speech Synthesis

    Ahad Jawaid, Shreeram Suresh Chandra, Junchen Lu, BERRAK SISMAN · PDF
  42. Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

    Chenxu Xiong, Ruibo Fu, Shuchen Shi, Zhengqi Wen, Tao Wang, Chenxing Li, Chunyu Qiang, Yuankun Xie, XinQi, Guanjun Li, Zizheng Yang · PDF
  43. Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion

    ZHENYU WANG, Chenxing Li, YONG XU, Chunlei Zhang, John H. L. Hansen, Dong Yu · PDF
  44. Three-modal guidance for symbolic music generation: melody, structure, texture

    Daniel Alexander Lucht, David Philip Leins, Dimitri von Rütte, Alexandra Moringen · PDF
  45. Towards Temporally Synchronized Visually Indicated Sounds Through Scale-Adapted Positional Embeddings

    Xinhao Mei, Gael Le Lan, Haohe Liu, Zhaoheng Ni, Varun K. Nagaraja, Anurag Kumar, Yangyang Shi, Vikas Chandra · PDF
  46. VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

    Bing Han, Long Zhou, Shujie LIU, Sanyuan Chen, Lingwei Meng, Yanmin Qian, Eric Liu, sheng zhao, Jinyu Li, Furu Wei · PDF
  47. Vision Language Models Are Few-Shot Audio Spectrogram Classifiers

    Satvik Dixit, Laurie Heller, Chris Donahue · PDF
  48. What do MLLMs hear? Examining the interaction between LLM and audio encoder components in Multimodal Large Language Models

    Enis Berk Çoban, Michael I Mandel, Johanna Devaney · PDF