2021-01-25 02:14:11 (edited by Ethin 2021-01-25 02:16:55)

So I'm looking at audio libraries like Miniaudio or CPAL. I've always used others like FMOD/BASS because they abstract these details away from me, and so I've never really had to worry about it. However, I've always been curious how this works. In CPAL, for example, you create a host with a method like default_host. These hosts can then initialize devices with one of the methods defined in the Host trait. After that, you create streams with either build_input_stream(), build_output_stream(), build_input_stream_raw(), or build_output_stream_raw(), as defined in the Device trait. This is where my confusion arises: any of the build stream methods takes a callback to feed the audio system audio samples. I understand how this works in principal if I want to play a single sound, or if I want to play multiple sounds mixed together. But what if I have multiple "channels"? How do modern audio systems handle that kind of problem if you usually only have a single stream (at least, I think), and the stream data callback is called at some unknown interval? Is each channel just another sound to be added to the final output signal?
I'm also curious about how DSPs are propagated when set on sounds, but I imagine the data method is called extremely frequently so I'm guessing that the data method contains a central algorithm for applying DSPs, mixes all the samples together and then completes. But I'm honestly curious how this actually works.

"On two occasions I have been asked [by members of Parliament!]: 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."    — Charles Babbage.
My Github

2021-01-25 03:38:29

The really short answer is read Synthizer, which uses Miniaudio.

The slightly longer answer is it depends, so I will explain Synthizer:

First, you want everything to be the same samplerate at some point.  Samplerate conversions are expensive, not enough that a couple matters but enough that if you run one per sound you're going to start hurting.  So you push that to the edge.  E.g. Synthizer buffers are resampled on load.  Unfortunately for streaming and for audio output this isn't always possible: in the streaming case you're of course limited to the samplerate of the original audio and have to insert one there, and for audio output it's sometimes the case that running the library at a fixed samplerate internally is a big win.  For Synthizer, that's 44100 because HRTF datasets like to be that way, and resampling those at runtime is complicated.

The rest of this is basically summing arrays.  You take your sources of audio--Synthizer generators for example.  Sum those to sources.  Pan them.  Sum the output of those to the audio output buffers.  At each stage of this process, you might have to convert between channel formats, especially mono->stereo and stereo->mono.  Synthizer does that with specialized functions.  The general case is a matrix multiplication, but stereo->mono is (l+r)/2 and mono->stereo is just copying to two arrays, so it's worth it.  Also, you want to specialize the no conversion case, either to memcpy or to adding to an output buffer.

Synthizer is fast enough to run in prod in debug builds.  You shouldn't, but you can, and that's been valuable for testing.  If you optimize you can do very demanding audio on the CPU.  Unfortunately this means running at maximum efficiency; Synthizer in a real world scenario can easily approach 50 megaflops or even a couple gigaflops or more, depending.  That's not too bad except that you're sharing the system, and it's effectively a hard realtime requirement, even more so than graphics.  To do so, you have to address 3 aspects.

First, anything that might block the audio generation thread cannot happen on the audio generation thread.  That's not a hard rule, because sometimes it's just not possible to avoid needing to do memory allocation or something, but mutexes/locks are not your friend at all in any fashion.  You can make this a hard rule, but only if you add limitations like maximum number of sources.  Synthizer just says "here is some reasonable pre-allocated bits, if you do something crazy things might click while we grow the buffers", basically.

Second, memory bandwidth is a problem.  Any excess zeroing of buffers, any excess buffers at all for that matter, will push things out of L1.  Your worst case is that things get pushed all the way to ram.  So you don't want that.  Synthizer deals with this in two ways.  First, instead of making buffers per object, you can make buffers per invocation of a function and cache them.  I estimated out the number of buffers a typical generator->source->context stack would need, then wrote a fixed-size cache which can hand them out on request.  Instead of putting a bunch of temporary buffers in your class for the intermediate steps, you ask the cache for buffers, and it's a stack so it's likely that the one you get is still in L1.  Second, Synthizer establishes a convention that all audio processing will add to the specified output buffer rather than just writing.  The naive way of doing this is one buffer per source, you fill all of those, then you loop over it and grab every source and add.  But this is one buffer per source, usually a few kdb each, and you just did the worst thing you could: read all of them start to finish, pushing everything else out of the cache.  Adding to the output buffer shifts this to some need to zero, but you've gone from O(n) buffers to roughly O(1) buffers.  Synthizer isn't perfect about this, but it's fast enough and I'll probably only finish improving it when it becomes time to optimize for the Pi or something like that.

Also, pointer chasing is bad so allocate buffers inline when you can, but this is already long enough.  And also also, it doesn't matter how much memory is allocated but how much you access, so allocate all day long as long as you don't have to read all of it all the time, but going into that is also probably beyond the scope of what is quickly becoming an essay.  Suffice it to say that Synthizer does lots of arrays that are waaay oversized for what they need to be, but it's fine because you only access the front.  In particular there's a hard-coded internal limit of 16 audio channels (why 16 is another topic, but I didn't pull it out of my ass and it's not related to CPU efficiency).

Third, you have to take advantage of the CPU.  This means autovectorization or hand-vectorized code, and being friendly to the branch predictor and compiler optimizations.  This gets a little bit speculative in that I haven't firmly benchmarked and a lot of what I did in Synthizer for this is based off mental heuristics, so grain of salt.

To make your code autovectorizable, prefer compile-time constants.  Synthizer does this by having a hard-coded block size and hard-coded samplerate, which means that the compiler almost always knows the exact number of iterations of a loop.  This gets you loop unrolling and stuff for free right away, and a lot of free autovectorization.  Also, be aware that floating point math isn't optimized how you think: (a+b)+(c+d) is efficient, but a+b+c+d isn't, because it has to evaluate left to right and they're not equivalent to each other.  So you probably want -ffast-math or a good intuition of when/how to write efficient floating point (synthizer goes off good intuition, but will turn on fast math eventually).  For one concrete example, a/b in a loop is terrible, but a*(1/b) where b is a constant or (1/b) is computed before the loop is something like 2 to 10 times faster.  You also get a lot here out of having loops that do iterations of 4 or 16, and everything being a power of 2.

X86 can do as much as 16 floating point operations in a cycle, sometimes more than that per cycle, if and only if you can hit SIMD, but writing it by hand is a pain and architecture specific; fortunately using the stuff I'm describing here means you never have to, and also there's vectorization pragmas and things like that as well, some trig tricks, etc.

To make your code friendly to the branch predictor, lift if statemenets out.  A for loop inside an if/else is almost always better/faster than an if inside a for loop, because the former case evaluates it once but the latter case evaluates it once per loop.  The higher you push this up, the better.  Branches are terribly expensive, especially when mispredicted; it's something like 30-50 math operations vs 1 branch.  Synthizer uses the fact that C++ has constants as template parameters and that short-lived audio loops means that we don't care about the instruction cache for this purpose.  For example, BufferGenerator does if statements in a helper template that are inside the loops and look terrible to determine if the position should reset when the buffer reaches the end due to looping, but it's actually off a constant template parameter and thus trivially optimized by the compiler (you can reliably assume the compiler will do this; and in debug builds, it's a perfectly predicgted branch).  Higher up, an if tree determines which bools to turn on, then instantiates dedicated branchless functions off those (you know C++, just read src/generators/buffer.cpp, the code is *much* cleaner than this explanation and it's fairly obvious why I wrote it this way).

As a bonus, pushing if statements up makes more loops autovectorizable, because autovectorization isn't perfect and one of the ways in which it's not perfect is that the compiler can't always know when to push the if statement up itself, and SIMD has no facilities for branching (that's a lie to children, but I'm not going to go consult references I don't have memorized to show incredibly complicated examples of how you can do it, and doing it effectively is like an entire Saturday of research for a few lines. Suffice it to say your compiler can't be relied on for this, so move the if statement up or else).

Anyway, not sure if this answers the question or not, but there's a reason the third time's the charm with me and my audio library forays.  Getting all of the above right can be anywhere from a 2x to 20x or more performance increase, depending how wrong whatever it was to start, and a lot of the knowledge just comes from doing C/C++-level coding for a good while.

My Blog
Twitter: @ajhicks1992

2021-01-25 04:37:24

@2, wow, a lot to digest there. Autovectorization for me has always been difficult to get right; it seems like you have to write your code in a particular way. And pulling in architecture intrinsics is a pain (I still haven't figured out how to actually use SIMD properly and correctly). Your post was informative; I'll dig into Synthizer sometime and see if I can figure it out from there.

"On two occasions I have been asked [by members of Parliament!]: 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."    — Charles Babbage.
My Github

2021-01-25 05:58:23

Autovectorization is interesting in the sense that it's never guaranteed, but it's much easier to write an autovectorized loop than it is to write SIMD intrinsics for two or more platforms at once.  The compiler will recognize all the common structures pretty reliably, e.g. adding two arrays, and that's that.  The only real trick is getting all the conditionals out of the loops, then maybe marking with whatever your compiler's loop hinting pragmas is to say "yes, it's worth doing this here".  But I haven't had to use those even once and even for something this performance sensitive you can just wait until it's too slow to be a problem.  In general as long as you write even halfway decent math code that's friendly to the architecture, the compiler will just do it and you can treat all of that as a black box.  Synthizer used to use the clang vector extensions, but I removed them and they're probably not coming back.

If you want to go down the road of really understanding this stuff, honestly even more than I need, there's actually an accessible version of compiler explorer: https://godbolt.org/noscript

If you don't know what that is, you paste a C/C++/Rust/Go/a bunch of others program into it, select a compiler and some flags, and it gives you the assembly.  Supports all the common platforms/architectures and frankly I have no idea how it's free, but it is.

My Blog
Twitter: @ajhicks1992

2021-01-25 06:05:01

@4, yeah, I've heard of that.

"On two occasions I have been asked [by members of Parliament!]: 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out ?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."    — Charles Babbage.
My Github

2021-01-25 08:59:47

really interesting stuff +1

2021-01-26 00:24:33

@Camlorn
Can't remember if you've addressed this elsewhere, so feel free to tell me to JFGI, but why did you choose to write Synthizer in C++ and not Rust? I thought you were a big fan of rust these days?

I seem to remember you saying something about Rust not having some of the stuff you needed, but I don't remember what specifically, and I might not understand it, even if I did.

Would you ever bother converting to Rust? I know you've said in the past about machine-specific C++ weirdness. Wouldn't Rust eliminate that?

-----
I have code on GitHub

2021-01-26 00:48:26

Rust needs to stabilize an advanced enough form of const generics as well as specialization before I'd jump at using Rust for this.  There's also the fact that Rust can't easily express object-oriented inheritance hierarchies, which turns out to be really useful for audio libraries, believe it or not.

It is possible to write safe C++ if you know what you're doing and use modern C++ features as opposed to the older stuff.  I know what I'm doing.  I think there's been less than 5 segfault/invalid pointer issues.  Running under asan and dealing with them is a lesser cost than trying to use Rust.  If I wasn't experienced at C++ or if Rust had actually followed through on stabilizing things they've been promising for years, the trade-off might be different.  Admittedly specialization and const generics are both very hard problems that are only used by niche applications, but audio is one of said niche applications so here we are.

As for whether it gets rid of machine-specific weirdness? Really depends.  Segfaults, yeah, maybe, until i end up using unsafe for stuff because reasons.  Or get the memory models on atomics wrong on less forgiving platforms than X86.  Rust isn't some sort of magic silver bullet, it just guarantees a lack of data races and a lack of invalid pointers if and only if you never use unsafe, nothing more than that.

My Blog
Twitter: @ajhicks1992

2021-01-26 09:56:51

@8
Wow, OK, that completely shattered my misguided notion of Rust as Python for clever people haha.

Thanks for the explanation.

-----
I have code on GitHub

2021-01-26 16:53:06

No, it's not that.  If you want to learn a native systems language it's probably where you want to go, and one of the positive things people say about it is "I came from Python and".  But it's really different.  There are a lot of constraints.

For instance, no garbage collector.  Stuff at the Rust/C/C++ level typically can't afford one.  This actually includes Synthizer, because a 1ms or 2ms freeze at the wrong time is super bad.  Mind you if whatever you're doing *can* afford one, it's a good sign you should not use the native-level languages.

My Blog
Twitter: @ajhicks1992