Today, we are talking about the development of Moshi, a speech-text foundation model designed for real-time dialogue. Moshi leverages Helium, a text language model, and Mimi, a neural audio codec, to handle both the linguistic and acoustic aspects of dialogue. The authors introduce the RQ-Transformer architecture, enabling efficient modeling of long audio sequences. Further innovations include the Inner Monologue training procedure, which improves the model's fluency and linguistic capabilities. The text also investigates the impact of model compression on Moshi's performance and explores methods for watermarking and identifying content generated by the model.