Módulo 3: Enfoques
How current technologies reconfigure sound: channels, objects, scenes, and binaural listening.
Next Generation Audio (NGA)
Let’s begin with the concept of Next Generation Audio (NGA), which serves as a general framework to group and make sense of various ideas related to recent transformations in the production, distribution, and experience of sound. It is a broad category through which different technical and aesthetic developments in audio are articulated.
Next Generation Audio (NGA) refers to a set of audio technologies and formats designed to offer more immersive, flexible, and adaptive listening experiences.
These technologies allow sound content to be represented in separate layers—objects, channels, metadata—which not only enhances spatiality, but also enables dynamic personalization of the listening experience based on the environment, playback device, or user preferences.
Historical Context of Next Generation Audio (NGA)
To understand the concept of Next Generation Audio (NGA), it is necessary to place it within the technological transformation context that gave rise to it. Traditional audio mixing systems are based on a channel-based structure, where each sound is assigned to a specific channel. For example, the voice is directed to the center channel; main sound effects to the left and right channels.
This model began to show its limitations in the mid-2000s, alongside the expansion of flat-screen TVs. Between 2000 and 2008, high-profile events such as the FIFA World Cup in Germany and the Super Bowl drove massive sales of these devices. However, the slim design of the new screens reduced the space available for speakers, which degraded sound quality in the home environment.
Faced with this limitation, soundbars emerged as a compact solution. Their design allowed for a more precise and immersive sound experience without requiring bulky systems. The incorporation of wireless technologies like Bluetooth facilitated integration with other devices and enabled multi-speaker setups at home.
However, this advancement exposed new challenges, mainly related to interoperability between different devices and software. Technological fragmentation highlighted the need to develop new formats and codecs capable of ensuring compatibility and continuity in the listening experience. In this context, multichannel systems—especially the 5.1.0 model—became progressively standardized. Dolby established itself as a central player, setting its systems as industry standards in cinema and commercial audio applications.
The emergence of NGA responds to this sequence: it aims to overcome the limitations of the channel-based model by proposing new encoding and spatialization methods that consider the variability of devices and listening environments. This is not merely a technical change but a shift that affects how sound is produced, distributed, and experienced.
Approaches to Next Generation Audio (NGA)
Within the scope of Next Generation Audio (NGA), four main approaches to representing sound are recognized: channel-based audio, object-based audio, scene-based audio, y audio binaural.
Below is a summary of each of these methods, which will be explored in detail in the upcoming modules.

Channel-Based Audio

Object-Based Audio

Ambisonic

Binaural
Channel-Based Audio

Definition and Technical Explanation
Channel-based audio is the traditional method of sound reproduction. In this system, the premixed audio is assigned to fixed positions in space, using discrete channels that correspond to specific loudspeakers. Each channel delivers its signal to a designated speaker without further modification. Formats such as stereo, 5.1, and 7.1 are examples of this technique, which has been an industry standard for decades.
In surround sound, two main variants are distinguished: discrete channels and matrixed channels. Discrete channels transmit independent signals, as is the case in stereo with the left and right channels. Matrixed channels, on the other hand, embed additional information within a limited number of channels and use decoding techniques to recreate the illusion of greater spatiality. While matrix technology has evolved, discrete systems still offer clearer channel separation and more precise spatialization.
In the Audio Definition Model (ADM), this type of audio is referred to as DirectSpeakers, to avoid confusion with other uses of the term “channel” in different technical contexts.
Pros and cons
Producción
- Advantage: Workflows are straightforward. Mixing tools and practices are well-established and familiar to professionals.
Disadvantage: Flexibility is limited. Content mixed for a specific setup (e.g., 5.1) does not always adapt well to systems with different numbers or arrangements of speakers. This often requires separate mixes for each format.
Distribution
- Advantage: Compatibility is broad. This method works with most existing playback systems, from simple stereo to surround configurations.
Disadvantage: It is inefficient for customization. Offering options such as multiple languages or additional tracks requires distributing separate full mixes, which increases bandwidth usage.
Listener experience
- Advantage: It offers a clear and consistent experience when playback occurs on a system that matches the configuration intended during mixing.
Disadvantage: Immersion is limited compared to object- or scene-based systems. The quality of the experience depends on the user having exactly the speaker setup for which the mix was designed.
Implementation in Formats and Technologies
Channel-based audio underpins formats such as stereo, 5.1, and 7.1. Even the more recent variants—like 5.1.4 or 7.1.4, which incorporate height speakers—still assign channels to fixed positions. Codecs such as Dolby AC-3, Dolby Digital Plus, and DTS-HD Master Audio support these configurations and have been used in media like DVDs, Blu-ray discs, and streaming platforms.
Typical Workflow
Content creation based on channels starts with recording using microphones in fixed positions or editing existing signals. These signals are mixed and assigned to specific channels that correspond to speaker locations. Production is carried out under the premise that playback will occur on a system with the same speaker arrangement used during mixing.
Object-Based Audio

Definition and Technical Explanation
Object-based audio introduces a different model of sound representation. Instead of assigning sounds to fixed channels, each sound is treated as an independent object with metadata describing its position, trajectory, size, and other properties in a three-dimensional space. During production, these objects are not tied to specific speakers or channels.
The central element of this approach is the metadata. This data defines the location and movement of each object in space, as well as characteristics such as apparent size or acoustic behavior. It also enables adaptive functions, such as adjusting dialogue levels, selecting languages, or activating additional tracks. During playback, a compatible system—the renderer—interprets this metadata and positions the audio objects according to the available speaker setup.
Pros and cons
Producción
- Advantage: Object-based audio allows greater creative flexibility by positioning and moving sounds in a three-dimensional space without relying on speaker configuration. This enables the creation of more immersive and realistic sound experiences.
Disadvantage: The workflows and tools required for production can be more complex than those used in channel-based audio. Additionally, managing multiple audio objects can be challenging.
Distribution
- Advantage: A single audio stream can adapt to various playback systems, making the distribution of personalized audio easier and more efficient.
Disadvantage: It requires the playback system to have compatible decoders or renderers.
Listener experience
- Advantage: It offers an immersive three-dimensional listening experience, with precise sound localization and customization options such as dialogue enhancement and language selection.
Disadvantage: The quality of the experience depends on the sophistication of the renderer and the playback system.
Implementation in Formats and Technologies
Dolby Atmos: Combines channel-based audio beds with dynamic audio objects accompanied by metadata. It can handle up to 128 audio elements across channels and objects.
DTS:X: An object-based format that flexibly adapts to different speaker configurations. It has no fixed limit on the number of objects and uses MDA (Multidimensional Audio), an open, royalty-free platform.
MPEG-H: A flexible codec that supports channel-, object-, and scene-based audio, offering advanced customization and accessibility features.
Auro-3D: Another object-based audio format that adds height layers to traditional surround sound, creating a more immersive experience.
Sony 360 Reality Audio: An object-based format that places individual sound sources (voices, instruments, effects) in a 360° spherical sound field, primarily designed for streaming services and headphones, but also adaptable to speaker systems.
Practical Applications
Cinema: Enables immersive soundtracks with sounds precisely positioned and moved within a three-dimensional space.
Music: Facilitates the creation of immersive and customizable sound experiences, such as with 360 Reality Audio.
Video Games: Provides dynamic positional audio that enhances immersion and realism, adapting to player movement and the game environment.
Virtual and Augmented Reality: Delivers realistic and interactive spatial sound, perfectly synchronized with the visual elements of VR and AR environments.
Typical Workflow
Object-based audio production begins with recording the different sound elements—such as dialogue, music stems, and effects—in separate files. Then, using specialized software (like Pro Tools with Dolby Atmos Production Suite, Nuendo with MPEG-H Authoring Suite, L-ISA, GRIS, or Reaper with SPAT Revolution, among others), these sounds are positioned in a three-dimensional space and assigned metadata that determines their location, movement, size, and other characteristics. During mixing, the audio objects and, if used, the channel-based beds are integrated within the 3D environment. Finally, the complete mix is rendered into a format that incorporates both the audio essence and its associated metadata, such as the ADM file in Dolby Atmos.
Scene-Based Audio (Ambisonics)

Definition and Technical Explanation
Scene-based audio, primarily represented by Ambisonics and its evolution to Higher-Order Ambisonics (HOA), captures the entire sound field at a single point in space using spherical harmonics. Unlike channel-based audio, its signals are not intended for specific speakers. Instead, they carry an abstract and speaker-independent representation of the sound field, known as B-format, which is later decoded according to the listener’s speaker layout.
The B-format signal includes an omnidirectional component (W) and three directional components (X, Y, and Z) that describe the sound pressure gradient in three dimensions. The concept of “order” in Ambisonics (first order, second order, higher order) indicates spatial resolution: the higher the order, the greater the spatial accuracy and the higher the number of channels required to encode the sound scene. There are different conventions for ordering these channels, with AmbiX and FuMa being the most widely used.
During playback, the B-format signal is decoded according to the specific available speaker setup or adapted for binaural reproduction, allowing an immersive experience through headphones.
Pros and cons
Producción
- Advantage: It allows capturing a complete 360-degree sound field, making it ideal for immersive experiences.
Disadvantage: It requires specific microphones and specialized post-production tools. Higher orders increase the number of required channels, which demands greater bandwidth and processing capacity.
Distribution
- Advantage: Its speaker-independent representation facilitates adaptation to different playback systems without the need to create separate mixes.
Disadvantage: Playback on specific systems or in binaural format requires a decoding process.
Listener experience
- Advantage: It offers an immersive and natural three-dimensional audio experience, especially effective when head tracking is used in VR/AR environments.
Disadvantage: At first order, spatial resolution is more limited compared to higher orders or object-based audio. Additionally, the “sweet spot” for accurate playback can be small, especially at lower orders.
Implementation in Formats and Technologies
MPEG-H: Supports scene-based audio through the use of Higher-Order Ambisonics (HOA), enabling immersive playback adaptable to different systems.
Ambisonics: A technology encompassing various orders and formats, such as B-Format, ACN, and FuMa, including HOA to achieve higher spatial resolution and more accurate sound field representation.
IAMF (Immersive Audio Model and Formats): An open standard that supports scene-based audio via Ambisonics, with detailed specifications for Ambisonics configuration and modes.
Practical Applications
Virtual and Augmented Reality: Ideal for designing immersive 360° sound environments where audio responds accurately to the user’s head movements, enhancing the sense of realism.
360° Video: Provides spatial sound that boosts immersion and coherently complements the visual experience of spherical videos.
Immersive Music: Enables capturing and reproducing the spatial dimension of musical performances, enveloping the listener within the sound scene.
Typical Workflow
Scene-based audio production begins with capturing the sound field using specialized Ambisonic microphones, such as tetrahedral arrays or higher-order microphones. The signals recorded in A-format are subsequently converted to B-format, which represents the sound field independently of the speaker configuration. This B-format signal is mixed and processed within a digital audio workstation (DAW), using plugins and specific tools for Ambisonics. Finally, the resulting audio is rendered into the required delivery format: either binaural for headphones or multichannel for speaker systems.
Audio Binaural

Definition and Technical Explanation
Binaural audio is a recording and playback technique that uses two microphones positioned similarly to human ears, with the goal of simulating a three-dimensional sound experience for the headphone listener. Unlike conventional stereo audio, binaural audio replicates the time and level differences between both ears (ITD and ILD), as well as the spectral modifications caused by the head and ears (HRTF).
Within the scope of Next Generation Audio (NGA), binaural audio is classified as a form of spatial audio that enables an immersive experience through headphones. Although it does not fully align with the categories of channel-based, object-based, or scene-based audio, it is often considered a separate category within NGA. Binaural audio can be either pre-rendered or dynamically generated from object- or scene-based representations using HRTFs. Its ability to deliver an immersive experience with a simple playback setup—such as headphones—makes it a highly attractive option for applications like virtual reality, augmented reality, video games, and mobile device listening.
Pros and cons
Producción
- Advantage: It enables the creation of immersive and realistic audio experiences using only two channels, simplifying the process compared to multichannel setups.
Disadvantage: Creating authentic binaural recordings requires specialized microphones or advanced processing techniques to simulate human perception. Additionally, the experience can vary significantly between listeners due to individual differences in head and ear anatomy (HRTF).
Distribution
- Advantage: Binaural audio is highly efficient for distribution, as—like conventional stereo—it only requires two audio channels. This makes it compatible with most existing devices and platforms without the need for multichannel infrastructure.
Disadvantage: Playback through stereo speakers does not allow accurate spatial perception, since the binaural effect relies on headphone listening. This can significantly compromise the quality and fidelity of the immersive experience.
Listener experience
- Advantage: Delivers a convincing and highly immersive three-dimensional audio experience through headphones, creating the sensation that sound is coming from all directions. It is particularly well-suited for applications where headphones are the primary listening medium, such as virtual reality, augmented reality, video games, and personal listening.
Disadvantage: The quality of immersion depends on the accuracy of the HRTF used, which may not be suitable for all listeners.
Implementation in Formats and Technologies
Binaural audio can be adopted both as a final delivery format and as a rendering technique within other Next-Generation Audio (NGA) formats.
Binaural audio files: Stereo audio files can contain binaural recordings captured with specialized microphones or digitally processed to simulate spatial effects typical of binaural listening.
Binaural rendering in NGA formats: Technologies such as Dolby Atmos, DTS:X, MPEG-H, and Ambisonics incorporate binaural rendering algorithms to enable immersive playback through headphones. This process uses HRTF to adapt the audio to human auditory perception when using headphones.
Practical Applications
Virtual and Augmented Reality (VR/AR): Binaural audio enables spatial sound experiences that integrate coherently with visual elements in VR and AR environments.
Video Games: Provides accurate positional information through headphones, allowing users to localize sounds within a three-dimensional space and enhancing the sense of immersion.
Music: Used to develop spatial and intimate listening experiences via headphones, offering new forms of sonic perception.
Podcasts and Audiobooks: Enhances the listening experience by introducing a spatial dimension that creates a sense of presence and environment.
Typical Workflow
Binaural content creation can be carried out either through recording with specialized microphones—embedded in dummy heads or ear-level devices—that capture sound in a way analogous to human perception, or by applying binaural processing to mono or multichannel recordings using software and plugins that employ Head-Related Transfer Functions (HRTFs). This processing allows sound sources to be positioned in a three-dimensional space and renders the result for playback through headphones.
Exercise: Spatial Audio Approaches Exploration Game
Objective:
Develop your critical listening skills while playfully exploring the different approaches used by immersive audio technologies to construct spatial sound. The challenge is to identify not only the direction of the sounds, but also how they move, what depth or height you perceive, and how each technology influences your sense of immersion.
Duration: 30 minutes
Introduction:
We live surrounded by soundscapes that envelop us in films, video games, music on platforms like Apple Music or Tidal, and virtual reality experiences. However, each technology creates that spatial sensation in a different way: some use fixed channels, others use freely moving objects, some encode entire scenes, and others adapt everything to how our ears perceive sound.
In this sound game, you’ll train your ear to recognize those differences. Remember: there is no “better” or “worse” approach — each one offers its own way of experiencing and understanding space. For example, stereo is not less valid than ambisonics; it simply follows a different logic. As you play, ask yourself: Which sounds captivate me? Which one makes me feel more immersed? How do I imagine these sounds were created?
Instructions:
Preparation (5 minutes)
Find a quiet place with no external noise.
Use closed-back, high-quality headphones. Do not use loudspeakers for this exercise.
Click on the following link: https://surl.li/ormqlu
Note: Keep in mind that this exercise is designed to be done using headphones, so the experience will differ from what you would get in a physical space with distributed loudspeakers. All references here are specifically intended for headphone listening and do not exactly replicate the sensation of a surround system or room-based spatialization.