Voice is our primary means of communication, and telephony has enabled us to connect using our voices for over a century. The phone call as we know it has evolved from analogue to digital, from fixed to mobile, and from low speech quality to natural speech quality. One major advancement, however, was still lacking: how to enable a fully authentic, immersive sound to be transmitted, live.
The introduction of the IVAS (Immersive Voice and Audio Services) codec, standardized by 3GPP in Release 18 in June this year represents a major advancement in audio technology. Unlike traditional monophonic voice calls, IVAS enables the transmission of immersive, three-dimensional audio, offering a richer, more lifelike communication experience. This innovation is made possible using new audio formats optimized for conversational spatial audio experience. One such example is a new Metadata-Assisted Spatial Audio format, MASA, which uses only two audio channels and metadata for spatial audio descriptions. Spatial audio calls allow users to experience sound as though it were happening in real life, complete with features like head tracking.
Below we will explore the challenges of bringing 3D live calling to mobile phones, the requirements addressed in spatial communication and the new IVAS codec, and the game-changing impact live 3D audio will have for people, mobile operators, and business smartphones.
Head of Product Management, Nokia Technologies.
Bringing 3D calling to Mobile Phones
The last major innovation in voice calling was the EVS codec, introduced in 2014 and recognized by consumers as HD Voice+. While it significantly enhanced call quality, like all previous codecs, it only offered a monophonic listening experience.
With the introduction of 3D audio calling—the biggest leap in voice-calling audio technology in decades—comes the challenge of creating an authentic, immersive experience in everyday communication. While voice technology has evolved significantly – from analog to digital, fixed to mobile, and from low quality to natural speech quality – transmitting spatial audio, where sounds are perceived as naturally coming from all around, is far more complex to recreate in mobile environments.
Achieving this level of immersive sound experience has been easier in controlled settings like movie theaters and video games, where sound design is a core element, but reproducing it in everyday mobile calls introduces a range of technical hurdles including real-time spatial sound processing, hardware constraints, and ensuring compatibility across devices.
The Immersive Voice and Audio Services (IVAS) voice codec is therefore the most significant step forward in voice-call audio technology for decades.
How to Tackle and Overcome Spatial Communication Challenges
There have been several challenges to overcome for Immersive Voice to become a robust spatial audio solution. A key issue is noise reduction, crucial for enhancing speech clarity in settings like concerts or nature. Traditional noise reduction methods often only filter out continuous sounds, such as air conditioning hums or traffic noise, but often leave other background noise. Wind interference also poses a challenge by introducing unwanted noise and causing fluctuations in audio levels.
However, recent advancements in machine learning and intelligent noise reduction have addressed these issues. Immersive audio technology, for example, is designed to intelligently adjust how much background noise is reduced depending on the surrounding environment, as well as providing users control, allowing individuals to manually adjust the levels of noise reduction. This ensures that the essential sounds are transmitted while minimizing unwanted background noise.
Immersive audio setups with multiple microphones and loudspeakers also face a major obstacle – acoustic echo. This happens when microphones pick up sound from nearby speakers, causing unwanted feedback. The problem is even more challenging in setups with spatial audio, where the placement and number of loudspeakers affect sound quality and the device's ability to capture spatial audio. Traditional Acoustic Echo Cancellation (AEC) methods often do not work well in these complex environments. To solve this, a machine-learning-based spatial AEC solution was created, which removes the loudspeaker sound from the microphone input using a reference signal. This improves audio quality, especially for spatial audio in real-time voice applications.
Introducing the IVAS codec
To bring spatial audio to mobile phone calling, in addition to Over-the-Top (OTT) services, the 3rd Generation Partnership Project (3GPP) recently adopted a new voice codec standard. Developed through the collaboration of 13 companies, the IVAS codec standard was included in the 3GPP's Release 18, building on the widely used Enhanced Voice Services (EVS) codec. Importantly, the IVAS codec maintains full backwards compatibility, ensuring seamless interoperability with existing voice services.
One of the key innovations during IVAS standardization was the creation of a new parametric audio format, Metadata-Assisted Spatial Audio (MASA), designed specifically for devices with limited form factors, like smartphones. The IVAS codec integrates a built-in renderer that supports head-tracked binaural audio and multi-loudspeaker playback using the MASA format.
Additionally, an immersive voice client SDK can serve as the IVAS front-end, capturing spatial audio from device microphones and converting it into the standardized MASA format. This technology enables true 3D immersive audio experiences for various types of voice calls.
The Power of 3D Live Audio: What it Means for People, Operators, and Businesses
New immersive 3D audio revolutionizes the audio experience for consumers, enterprises, and industries. For consumers, it deepens engagement in interactions with friends and family by sharing local sounds, whether live-streamed or recorded, and offers full immersion in synchronized metaverse experiences. For enterprises, 3D audio voice calling unlocks new capabilities, from enhanced customer experience through directional audio to transforming team collaboration and decision-making. In industrial settings, audio analytics can drive automated processes like predictive maintenance, streamlining operations, and boosting efficiency.
In order to enable these experiences across diverse network conditions, service providers need scalable solutions that optimize performance regardless of bandwidth constraints. The 3GPP IVAS standard codec accommodates bitrates ranging from 13.2 to 512 kbit/s, ensuring immersive audio quality whether used in congested networks or high-quality streaming environments. This scalability empowers service providers to support more users while delivering rich audio experiences.
Looking to the future, it is expected that voice-based user behavior will continue to evolve. Beyond traditional calls, spatial audio communication will expand to include semi-synchronous messaging through popular apps, people sending voice clips to each other, and more extensive use of group calls. With the rise of extended reality devices and services across industries, the scope of voice communication is set to become even broader, with immersion as a defining feature. A key factor in this evolution will be standardization and the integration of the IVAS codec into the latest 5G advanced standard, which is essential to ensure the interoperability needed to bring 3D calling to every phone at the push of a button.
We've rated the best business phone systems.
This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro