We study generative models for creating realistic and controllable content. Our primary focus is on speech and audio generation, but we also explore generative approaches for images, video, and other media.
We build AI that can engage in natural, real-time conversations. Our research focuses on AI that can both understand and express emotions for empathetic communication, as well as full-duplex and proactive interaction, where AI can listen and speak simultaneously and even initiate conversations on its own, just like a human companion. At the same time, we develop assistant capabilities through tool use and information retrieval.
We aim to extend speech-centric AI toward systems that can see, hear, read, and act. By integrating speech with vision and text understanding, our goal is to create intelligent agents that can perceive, reason, and take actions in real-world environments.