In the past few years a whole range of visual effects have been standardized. Future websites can render pretty much anything using bitmap canvasses, display 3D content using CSS 3D Transforms or WebGL and even implement entire key-frame based animations using nothing but CSS. Combined with specifications like the Application Cache and Local Storage, “HTML5? enables a whole new range of web-based applications.
Unfortunately, now that almost everything can be visualized on your monitor, the inability to synthesize, process, and analyse audio streams is becoming more and more obvious. While Flash provides fairly extensive APIs for working with sound, having a native (and preferably more extensive) API available to synthesize, process, and analyse any audio source is much more convenient. That’s why the W3C Audio Incubator Group was founded!
Don’t get too excited just yet: while an initial draft has been published by Google’s Chris Rogers, you shouldn’t expect the API to be finished within the year. The initial version received lots of input from six Apple engineers: Maciej Stachowiak, Eric Carlson, Chris Marrin, Jer Noble, Sam Weinig and Simon Fraser, and now frequently gets updated based on feedback received via the mailing list. The draft specifies various features for the API: spatialized audio, a convolution engine, real-time frequency analysis, biquad filters and sample-accurate scheduled sound playback. Wait, spatialized what?
The reason why it doesn’t exist already
The complexity involved with synthesizing, processing, and analysing audio is one of the key reasons why it doesn’t exist already. Most audio today has a sampling rate of just over 44 thousand samples per second; tracks of DVDs and blu-ray discs can be as high as 192 thousand samples per second. When multiplied by the number of sound channels and considering the decoding required to make sure the file makes sense, you can imagine the amount of work that goes into translating that MP3 file to waves our ears can interpret.
Of course, part of this process is handled by hardware, like converting the digital stream to an analog signal. However, applying effects to an audio stream happens entirely in software where each sample gets processed. In situations where effects are applied and the processed sound is played back almost simultaneously, you can imagine how critical things like buffering and timing are.
Native processing to the rescue: just create an API
The idea is simple: the “base” is an AudioContext interface which manages connections between the different Audio Nodes. The context contains a Destination Node by default, which represents the output device on your computer. This could be your speakers, your headphones or, perhaps in the future, even as a file on your harddrive.
Of course, there have to be audio sources as well. There are various kinds of sources: MediaElementAudio- SourceNode for <audio> and <video> tags and AudioBufferSourceNode for other kinds of input, like MP3 files requested via XHR. Other types are yet to be defined, but source nodes like DeviceElementSourceNode aren’t unthinkable, which could be used to process microphone input via the <device> element.
Between audio sources and destinations, there can be other types of nodes to perform various kinds of manipulations. The specification currently defines the following interfaces:
- AudioGainNode Allowing you to change the volume of the audio.
- AudioPannerNode Positioning and spatializing audio in a 3D space.
- BiquadFilterNode Add lowpass, highpass, and other types of common filters to the audio.
- ChorusNode Add a chorus effect to the audio.
- ConvolverNode Add effects to audio, such as imitating the sound of a concert hall.
- DelayNode Apply dynamically adjustable delays to an AudioNode.
- DynamicsProcessorNode Adding shaping (compressing/expanding) effects.
- WaveShaperNode Adding non-linear waveshaping effects, like distortion.
These nodes form the foundation of many of the features currently available in audio systems, but the specification is still far from finished and more types of nodes may be added. For analysis you could use a RealtimeAnalyserNode, which allows you to analyse the audio node in real time. This could be used for example, to display the tones output by a stream.
An example: dynamically changing the language of a video
Currently there is no clean way to switch between alternative audio streams for a HTML5 <video> element. The Audio API is ideal for such a purpose. When you keep a number of things in mind, like fragmenting the audio in smaller files to speed up the (initial) loading, it won’t be hard to create a language switcher:
- Create an AudioContext,
- Get the audio sources from the <video> element using a MediaElementAudioSourceNode,
- Decrease the volume of the video using an AudioGainNode,
- Get the new audio stream by requesting the MP3 via XHR and putting it in an AudioBufferSourceNode,
- Combine the two using the Dynamics Compressor (DynamicsProcessorNode),
- Play the audio stream.
This can be demonstrated using the following diagram:
These same techniques could be used to dynamically control background sounds for clips and create timed effects for games using an arbitrary number of output channels (which could be 2 for stereo, 5.1 for surround or even more!). Of course, more normal use-cases can be thought of as well: a beep when you click on a button, messages when interactive validation in a form fails or a music player featuring cross-over effects.
Thanks and credits to Chris Rogers and Koen ten Berg for their technical input and feedback!