works@solrezza.net

Module 6: Technical Principles Applied to Object-Based Audio The Paradigm Shift to Object-Based Audio The evolution of sound reproduction, from monophonic to stereo and surround sound, has been a continuous process aimed at increasing immersion and spatial accuracy. Traditional surround sound systems represented a technical advance, but object-based audio (OBA) constitutes a paradigm shift. This approach overcomes the limitations of fixed speaker layouts, offering a more flexible and precise method for creating and perceiving three-dimensional soundscapes. Fundamental Principles: Channel-Based Audio vs. Object-Based Audio The difference between channel-based audio and object-based audio defines the new technical and creative framework of professional audio. Channel-Based Audio: Fixed Speaker Assignment Channel-based audio is the traditional method of mixing and distributing sound. Each signal is assigned to a specific speaker configuration, such as 5.1 or 7.1. Formats like Dolby Surround, Dolby Pro Logic, and DTS are examples of this paradigm.Sound localization is achieved by adjusting the level of the signal in each speaker during mixing. This approach has limitations: the final mix depends on the exact speaker layout. If a speaker is missing or the configuration differs from the original, the sound scene becomes distorted and the immersive experience is compromised. This direct and exclusive relationship between channel and speaker is the main constraint of the channel-based model. Object-Based Audio: Dynamic Rendering with Metadata Object-based audio treats each sound element as an independent “object.” Each object includes metadata describing attributes such as position, size, and velocity. A rendering engine interprets this metadata in real time, adapting playback to the available speaker configuration.This method preserves spatial perception across different environments, from multichannel systems to binaural headphones. The flexibility and spatial accuracy of OBA surpass the limitations of fixed-channel systems, providing a more reliable and consistent sense of localization. Hybrid Approach: The Interaction Between Beds and Objects In practice, OBA workflows often combine beds and objects. This approach balances computational efficiency with creative freedom. Bed: The Channel-Based Foundation of the Immersive Mix A bed is a fixed multichannel layer, typically in a 7.1.2 configuration, used as the foundation of the mix. Its channels have predefined routes to specific speakers. Beds are used for static elements such as ambience, background music, or dialogue, and they are the only elements that can send signal to the LFE (Low Frequency Effects) channel. Object: Dynamic Elements of the 3D Sound Field Objects are discrete elements that can be positioned anywhere in three-dimensional space. Their location is controlled through panning metadata, allowing playback systems to adapt sound positioning to the speaker layout.The combination of beds and objects optimizes resource usage: static elements are efficiently processed in beds, while objects are reserved for dynamic elements that require high spatial precision. This design reflects a deliberate production strategy, prioritizing perceptual quality without overloading rendering systems. The Anatomy of an Audio Object Understanding the capabilities of object-based audio requires analyzing its fundamental unit: the audio object. This is not a conventional audio file, but a data structure that separates the sound signal from its spatial information. This separation represents a key technical innovation, enabling the flexibility and scalability of the OBA paradigm. Audio Object Structure: Signal and Metadata An audio object consists of two distinct layers: a primary audio signal and a metadata layer. Primary Audio Signal At its most basic level, an audio object contains a monophonic signal. This waveform represents the sound content—voice, instrument, or effect—without any predefined spatial characteristics. It is encoded using efficient, high-quality compression algorithms, such as the Modified Discrete Cosine Transform (MDCT) used in MPEG-H 3D Audio. Unlike channel-based audio, where the sound’s location depends on how the signal is distributed across multiple speakers, an object’s signal has no predetermined position. Its location is defined solely by the accompanying metadata.A conceptual diagram of the audio object shows it as two independent blocks:Monophonic Signal —the raw sound content.Metadata Layer —with subcomponents such as Position (X, Y, Z), Size, and Velocity.The combination of these two layers defines the audio object. This model clarifies that an object is not just a sound, but the combination of the sound and its dynamic spatial instructions. Metadata Layer: Position, Size, Velocity, and Other Attributes Position: X, Y, Z coordinates defining the object’s location.Size: The spatial extent of the object, influencing how it is perceived.Velocity: The magnitude and direction of movement within the sound field.Other Attributes: Gain, acoustic characteristics, or material properties that the rendering engine can use for advanced processing.This information allows the system to render sound in real time, adapting it to the available speaker configuration. Separation of Signal and Spatial Data: Core Technical Principle In channel-based systems, the signal and its position are inseparable: sending a signal to the left channel means the sound is positioned to the left. In OBA, the signal is independent of its location; the spatial information is transmitted as a separate metadata stream.During playback, the rendering engine interprets this metadata and distributes the signal according to the available system, whether it’s a 7.1.4 setup, a 22.2 cinema, or binaural headphones.This independence between signal and spatial data makes an OBA master format-agnostic, adaptable to future systems, and compatible with multiple configurations without loss of quality or creative intent. The Real-Time Renderer — Psychoacoustic Principles in Action The real-time rendering engine is the component that transforms an audio object’s data into a coherent, spatially accurate sound experience. It is a system—software or hardware—that interprets the metadata and applies psychoacoustic principles to reproduce the sound according to the creative intent and the listening environment. Translating Metadata to Speaker Output The renderer is the core of the object-based audio (OBA) system. It receives the main signal of each object along with its metadata and calculates in real time the output signals for the available speakers. It distributes the sound energy across multiple speakers so that each object is perceived from its assigned coordinates, adapting to the specific configuration of the listening environment and the user’s device.A central aspect of the renderer is its adaptive nature. Configured according to the listener’s speaker layout (for example, 5.1.2, 7.1.4, 24.1.10), the renderer adjusts the mix to deliver the most accurate playback possible. It dynamically directs objects to the available speakers and adapts the rendering algorithm to compensate for different speaker configurations and positions. Unlike channel-based audio, which can lose precision if speakers are missing in the playback system, this adaptability ensures a consistent, high-quality listening experience regardless of the hardware.The renderer generates spatial perception by applying psychoacoustic principles that the brain uses to locate sounds. It simulates the Interaural Time Difference (ITD) through signal delays for each ear or speaker, which is effective for low-frequency sounds. The Interaural Level Difference (ILD) is reproduced by adjusting the amplitude between channels, which is crucial for high-frequency sounds. Additionally, it applies Head-Related Transfer Functions (HRTFs) to simulate elevation and front-back localization, adapting the signal according to general anatomy; using generic HRTFs can produce errors such as front-back confusion or “in-head” localization.These principles are covered in:Module 2: Fundamentals of Spatial Sound PerceptionModule 4: HRTF: Understanding How Our Head Works Changes in the workflow In the conventional workflow, the mix is organized into fixed channels that determine how the sound is distributed across the playback system.In the object-based audio workflow, the beds provide the foundation of the mix, while the objects are managed individually along with their metadata. This separation introduces flexibility and allows the renderer to adapt playback to different speaker configurations.The workflow begins in a Digital Audio Workstation (DAW) that supports OBA formats. The first step is to create a project with the appropriate settings, such as a sample rate of 48 kHz and a surround format of 7.1.2 or higher.The core of the workflow is the integration of an OBA rendering tool, which can function either as a plugin within the DAW or as specialized external software (for example, SPAT). The DAW operates as the audio authoring environment—recording, editing, and mixing—while the rendering and spatialization tool manages the spatial properties and metadata of the sound sources. Guide for a Hybrid Workflow A hybrid workflow combines beds and objects, integrating both within the same mixing session. Bed Configuration The bed forms the foundation of the mix. It is created as a multichannel surround bus (for example, 7.1.2). Static or essential elements—such as background music, atmospheres, and main dialogue—are routed to this bus. Panning is managed using a conventional surround panner. Object Creation Elements that require movement or precise positioning are assigned to tracks treated as audio objects. Instead of channel-based panning, a dedicated 3D panner is used, allowing the sound’s location to be defined within a three-dimensional virtual space. Mixing and Automation The mixer controls each object’s attributes (position, size, movement) through automation. This data is stored as metadata and interpreted by the renderer to reproduce the spatialization in real time. Monitoring and Binaural Rendering The mix is monitored on a calibrated speaker system (for example, 7.1.4) to evaluate the spatial scene. In parallel, a binaural renderer for headphones is used, ensuring that the mix translates accurately to this listening format. Precise definition of the objects enhances the quality of the binaural playback. From the DAW to the Final Master The data flow in OBA differs from the linear, channel-based model. Conceptually:The DAW serves as the starting point.From the DAW, audio sources are divided into two paths:Bed Tracks: multichannel buses with fixed assignments.Object Tracks: discrete channels with metadata (position, size, velocity, etc.).Both sets are sent to a rendering application or plugin, such as the Dolby Atmos Renderer or equivalent software.The system combines beds and objects into a unified master (for example, ADM BWF) that integrates audio and metadata, ready for distribution. OBA in Practice: Analysis of Leading Formats Object-based audio (OBA) is a unifying paradigm, but its implementation depends on the format. A comparative analysis of Dolby Atmos and MPEG-H shows how the core principles of OBA are adapted to meet the specific requirements of different markets and applications. Dolby Atmos: Industry Standard Dolby Atmos has established itself as a dominant format in immersive audio, with presence in cinema, music production, and home entertainment.Dolby Atmos expands traditional surround systems by adding height channels and free-moving audio objects.In cinema, it can store and deliver up to 128 tracks: a 9.1-channel bed for environmental stems and dialogue, plus up to 118 individual objects.For home consumption, it applies spatial coding, reducing the audio to a maximum of 16 concurrent elements or location clusters, dynamically adapting to the content.This approach is scalable and supports configurations ranging from home theater 3.1.2 to complex supersystems like 24.1.10. The Bed/Object Model in Dolby Atmos The Dolby Atmos format is based on a hybrid bed-and-object model.In a typical music mix, the bed is used for static elements, while up to 118 objects can represent instrumentation or dynamic elements.In film post-production, multiple beds can be employed to separate dialogue, music, and effects.This approach allows the creation of a mix that combines a stable foundation provided by the bed with precise, moving objects, ensuring control over spatialization and sound movement. Spatial Encoding and Bitrate Efficiency To reduce bitrate in streaming and consumer applications, Dolby Atmos employs spatial encoding.The process dynamically groups nearby objects and speakers into aggregated objects, which are panned across the speaker array.This technique efficiently preserves the original sound’s position and intensity.Spatial encoding is fundamental to the Dolby ecosystem, as it enables the distribution of immersive content while maintaining audio quality within consumer bandwidth limitations. MPEG-H: The Next-Generation Broadcast Format MPEG-H 3D Audio (ISO/IEC 23008-3) is an international standard for broadcast, streaming, and immersive media. It is characterized by its universal delivery and user interactivity capabilities.MPEG-H 3D Audio is a versatile coding standard developed by the ISO/IEC Moving Picture Experts Group.It supports a hybrid model that combines audio channels, objects, and Higher Order Ambisonics (HOA).It is highly scalable, supporting up to 64 speaker channels and 128 codec core channels, with five levels in its main profile defining maximum distinct channels.For broadcast and streaming, there are two profiles: Low Complexity (LC) and Baseline (BL), designed to enable immersive content while meeting the bitrate and complexity constraints of these industries. The Role of OBA, Channel-Based, and HOA in MPEG-H MPEG-H’s hybrid model includes Higher Order Ambisonics (HOA), a scene-based format that represents a complete 3D sound field. This makes it suitable for capturing and reproducing complex, diffuse environments, as required in broadcast production.Its flexible, multi-format approach allows creators to choose the most appropriate method for each element of the mix: discrete objects, channel-based stems, or captured ambient sound fields. Interactivity and Accessibility in MPEG-H A central principle of MPEG-H is enabling user-controlled personalization and accessibility.Audio object metadata allow individual elements’ gain or position to be adjusted during playback.This functionality is key for accessibility features, such as enhanced dialogue for listeners with hearing impairments or dynamic descriptions for visually impaired users.The system also supports multiple audio presets within a single stream, allowing users to switch instantly between different languages or mixes.Although Dolby Atmos and MPEG-H share the object-based audio paradigm, they represent distinct architectural implementations optimized for different market needs.Dolby Atmos can be defined as a creator-focused format, prioritizing the preservation of the mixer’s creative intent and ensuring high fidelity across various systems. Its approach is particularly relevant in cinematic and music applications.MPEG-H, on the other hand, is a user-focused format, offering consumer control and enhanced accessibility. Its flexible design, with scalable profiles and interactive features, meets the requirements of broadcast and streaming.This comparison highlights a philosophical distinction in immersive audio: fidelity versus personalization. Glossary Ambisonics (HOA): A scene-based format that describes a complete three-dimensional soundfield.Bed: A traditional, channel-based audio layer used within an object-based mix for static or foundational sounds.Head-Related Transfer Function (HRTF): A complex filter that models how the human head, torso, and ears affect a sound wave, used to simulate 3D audio.Interaural Level Difference (ILD): The difference in sound level between the two ears, used for localizing high-frequency sounds.Interaural Time Difference (ITD): The difference in time between a sound’s arrival at the two ears, used for localizing low-frequency sounds.LFE: Low-Frequency Effects, a dedicated channel for sub-bass sound, typically used for cinematic effects.Metadata: Data that describes an audio object’s spatial attributes, such as position, size, and velocity.Object: A discrete audio element with its own metadata, which can be placed anywhere in a 3D sound field.Renderer: A software or hardware processor that translates an audio object's metadata into a speaker output in real-time.Spatial Coding: A technique used in Dolby Atmos to reduce bitrate by dynamically clustering objects and speakers for consumer applications.