UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching
Neta Glazer, Aviv Navon, Yael Segal-Feldman, Aviv Shamsian, Hilit Segev, Asaf Buchnick, Menachem Pirchi, Gil Hetz, Joseph Keshet
aiOla Research
Workshop on Machine Learning for Audio, ICML 2025
Abstract
Recent advances in Text-to-Speech (TTS) have
enabled highly natural speech synthesis, yet integrating
speech with complex background environments
remains challenging. We introduce UmbraTTS,
a flow-matching based TTS model that
jointly generates both speech and environmental
audio, conditioned on text and acoustic context.
Our model allows fine-grained control over background
volume and produces diverse, coherent,
and context-aware audio scenes. A key challenge
is the lack of data with speech and background
audio aligned in natural context. To overcome the
lack of paired training data, we propose a selfsupervised
framework that extracts speech, background
audio, and transcripts from unannotated
recordings. Extensive evaluations demonstrate
that UmbraTTS significantly outperformed existing
baselines, producing natural, high-quality, environmentally
aware audios.
Architecture
TTS With Environmental Conditioning
Background Condition
SER = 0
SER = 0.5
SER = 1
1. I can't believe how fast this year is flying by.
2. Let me know if you need anything, I'm happy to help.
3. Excuse me, you dropped something!
4. I need to find an ATM, do you know where one is?
5. Do you want to meet up for lunch tomorrow?
6. I just finished a great book. Do you like to read?
7. Oh no, my phone battery is about to die!
8. I think I've seen you around here before.
9. Are you free this weekend? We should do something fun!
10. Wow, it's really crowded today!
11. Wow, it's really crowded today!
12. I love going for a walk in the evening, it's so relaxing.
13. Do you know what time the next train arrives?
14. I'm trying to learn a new language. It's pretty challenging!
15. Are you free this weekend? We should do something fun!
16. Ugh, I forgot my umbrella and now it's raining!
17. I think I got lost. Can you help me find this address?