NeurIPS 2024 Past Speech & audioGenerative models
Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation
Audio Imagination: NeurIPS 2024 Workshop
- Submission deadline
- Sep 21, 2024, 23:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (48)
Fetched from OpenReview (v2) on 2026-06-10.
-
3D Audio-Visual Segmentation
-
A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation
-
Articulatory Synthesis of Speech and Diverse Vocal Sounds via Optimization
-
AudioSetCaps: Enriched Audio Captioning Dataset Generation Using Large Audio Language Models
-
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation
-
Benchmarking Music Generation Models and Metrics via Human Preference Studies
-
BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning
-
Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation
-
Coarse-to-Fine Text-to-Music Latent Diffusion
-
Contextual Speech Emotion Recognition with Large Language Models and ASR-Based Transcriptions
-
Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
-
Contrastive Lyrics Alignment with a Timestamp-Informed Loss
-
DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech
-
Decoding Musical Perception: Music Stimuli Reconstruction from Brain Activity
-
Decoding Strategy with Perceptual Rating Prediction for Language Model-Based Text-to-Speech Synthesis
-
DGFM: Full Body Dance Generation Driven by Music Foundation Models
-
Diffusion-based Speech Enhancement: Demonstration of Performance and Generalization
-
Disentangling Multi-instrument Music Audio for Source-level Pitch and Timbre Manipulation
-
Do music LLMs learn symbolic concepts? A pilot study using probing and intervention
-
Efficient Generative Multimodal Integration (EGMI): Enabling Audio Generation from Text-Image Pairs through Alignment with Large Language Models
-
FSD: Acoustic Echo Cancellation with Fewer Step Diffusion
-
Generating Vocals from Lyrics and Musical Accompaniment
-
High Fidelity Text-Guided Music Editing via Single-Stage Flow Matching
-
Improving Musical Accompaniment Co-creation via Diffusion Transformers
-
Improving Source Extraction with Diffusion and Consistency Models
-
Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses
-
Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM
-
Latent Diffusion Model for Audio: Generation, Quality Enhancement, and Neural Audio Codec
-
LOCKEY: A Novel Approach to Model Authentication and Deepfake Tracking
-
LoVA: Long-form Video-to-Audio Generation
-
MLADDC: Multi-Lingual Audio Deepfake Detection Corpus
-
Multi-Source Music Generation with Latent Diffusion
-
MusicScore: A Dataset for Music Score Modeling and Generation
-
Neural Audio Codec for Latent Music Representations
-
One-shot Text-aligned Virtual Instrument Generation Utilizing Diffusion Transformer
-
Parrot: Autoregressive Spoken Dialogue Language Modeling with Decoder-only Transformers
-
SNAC: Multi-Scale Neural Audio Codec
-
Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
-
Spatially-Aware Losses for Enhanced Neural Acoustic Fields
-
Style Mixture of Experts for Expressive Text-To-Speech Synthesis
-
Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation
-
Text-to-Audio Generation via Bridging Audio Language Model and Latent Diffusion
-
Three-modal guidance for symbolic music generation: melody, structure, texture
-
Towards Temporally Synchronized Visually Indicated Sounds Through Scale-Adapted Positional Embeddings
-
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
-
Vision Language Models Are Few-Shot Audio Spectrogram Classifiers
-
What do MLLMs hear? Examining the interaction between LLM and audio encoder components in Multimodal Large Language Models