-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Planar-only considered harmful #2458
Comments
Just wanted to drop a quick note here that I wholeheartedly agree with Casey on this. We have an Audio Worklet backend live in multiple products and this was one of the things that felt wrong when I was coding it. All our internal processing (C++ compiled to WASM) uses interleaved data and we deinterleave when filling the buffers in the worklet processor. As all native / low level interfaces I've used so far expect the data interleaved this seems very counterintuitive and bad for performance as Casey said. So if we can't have interleaved everywhere due to some need by the graph processing system at least a configurable option to bypass this conversion for the simple case would be great. |
The API fundamentally makes you deal with channels (input/output indices). https://developer.mozilla.org/en-US/docs/Web/API/AudioNode/connect At a basic level, I agree that an interleaved input on a destination node, and interleaved output from media sources, would be a fantastic addition. At the graph level, I've always wished that nodes' inputs and outputs were some kind of Signal object, rather than planar pcm. In particular, I wish the Signal object could carry planar data, fused spatial or quasi-spatial data such as ambisonic signals, spectral data, and that the signal rate was the natural signal rate for the data. Furthermore, I'd hope that there'd be some interface to test if signal-outs are compatible with signal-ins, and that in the case of incompatible signals, explicit adaptor nodes might be available, so that deplanarization, or spectralization, rate coversion, or up and down mixing, would never be heuristically applied, but explicitly supplied to support the dataflow. |
The answer to most of the questions here is "for historical reasons". The Web Audio API was shipped by web browsers on the web without have been fully specified, and with not enough considerations for really advanced use-cases and high performance. The alternative proposal was direct PCM playback in interleaved format, but didn't get picked, this was about 10 years or so ago. The Web Audio API's native nodes will never work in interleaved mode, because it's fundamentally planar (as shown in previous messages here), but this doesn't mean this problem cannot be solved, so that folks with demanding workloads can use the Web. Another API was briefly considered a few years back, but it didn't feel important enough to continue investigating, in light of the performance numbers gathered at the time. It was essentially just an audio device callback, but this is implementable today with just a single That said, the first rule of any performance discussion is to gather performance dat(a) In particular, are we talking here about: (a) Software that uses a hybrid of native audio nodes, and or some other setup, or maybe something hybrid ? (a) and (b) suffer from lots of copies to/from the WASM heap, (c) doesn't, a single copy from the WASM heap will happen, at the end. (a) and (b) also suffer from a minimum processing block size (referred to in the spec as a "render quantum") of (for now) 128 frames, but this will change in #2450. (c) doesn't have this issue. For (a) and b., the interleaving/deinterleaving operations can be folded into the copy to the WASM heap (with possible sample type conversion if e.g. the DSP works in int16 or fixed point). This lowers the real cost of the interleaving/deinterleaving operations (without eliminating it). (See links at the end for the elimination of this copy). For (c), the interleaving / deinterleaving operations will happen exactly twice (as noted): going into the Again, what is needed first and foremost is real performance numbers. Thankfully, if there is already running code implementing the approaches above ((a), (b) and (c), possibly others), it's not particular hard to get them, using https://blog.paul.cx/post/profiling-firefox-real-time-media-workloads/ and https://web.dev/profiling-web-audio-apps-in-chrome/. I assume the people in this discussion can skip most of the prose in both those articles, because they are familiar with real-time safe-code, and can skip to the part about getting/sharing the data. Here we're mostly interested at the difference between the total time it took to render n frames audio (i.e. the Some assorted links for context:
|
This comment was marked as off-topic.
This comment was marked as off-topic.
I'm currently working on something like this for remote browser isolation audio streaming and having to resort to using a mono stream from I think the use case of real-time processing / playing-from-stream of audio is pretty important. Is there a way to do this with stereo? |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Thank you, @guest271314! This looks awesome. I might try to use your code at some point 🙂 |
As far as I could tell from the spec and the API, the design of the WebAudio API is such that it is always "planar" rather than interleaved, no matter what part of the pipeline is in play. While the efficiency of this design for a filter graph is a separate concern, the WebAudio design creates a more serious issue because it does not distinguish between the final output format and the graph processing format.
As WASM becomes more prevalent, more people will be writing their own audio subsystems. These subsystems will have to output to the browser at some point. At the moment, the only viable option is to use WebAudio. Because WebAudio only supports planar data formats, this means people's internal audio subsystems must output planar data.
This creates a substantial inefficiency. A large installed base of CPUs do not have planar scatter hardware. Since modern mixers must use SIMD to be fast, this means that scattering channel output 2-wide (stereo) or 8-wide (spatial) is extremely costly, as several instructions must be used to manually deinterleave and scatter each sample.
To add insult to injury, most hardware expects to receive interleaved sample data. This means that in many cases, after the WASM code has taken a large performance hit to deinterleave to planar, the browser will then turn around and take another large performance hit to reinterleave the samples, often (again) without gather hardware, meaning it will require several instructions to manually reinterleave.
I would like to recommend that serious consideration be given to supporting interleaved float as an output format. It could be a separate path just for direct audio output, and does not have to be part of the graph specification, if that reduces the cost of adding it to the specification. It could even be made as part of a WASM audio output specification only, with no JavaScript support, if necessary, since it would presumably only be relevant to people writing their own audio subsystems. I believe there is already consideration of WASM-specific use cases, as I have seen mentions of the need to avoid cloning memory from the WASM memory array into the JavaScript audio worklet, etc.
If I have misunderstood the intention here in some way, I would welcome explanations as to how to avoid the substantial performance penalties inherent in the current design.
- Casey
The text was updated successfully, but these errors were encountered: