I was thinking call a function which returns a pointer to a raw audio buffer.
Then there are some functions I can call with that buffer.
Clear empties the buffer, and Append adds the char* and its length to the buffer.
I thought that Synthizer would take audio data from the buffer, if there is any, and mix it in using a thread.
EDIT: BTW, your miniaudio_example.cpp file didn't compile on my end until I added the line:
To the beginning.
#51 (edited by keithwipf1 2020-03-04 21:28:34)
I was thinking call a function which returns a pointer to a raw audio buffer.
It very much depends on what you want to do.
If you're talking about these hypothetical buffer functions being used to compute the entire thing ahead of time, Sapi is quite heavyweight and you'll add as much latency as you were trying to get rid of doing it that way. And amusingly (or not so amusingly) that latency will depend on the length of the string as well.
The only way to do streaming even remotely efficiently is to either have the library call a callback you provide asking you for samples immediately, or for the library to provide a function that blocks until more samples are required. NVDA does the latter. It is not feasible to share the thread of your main game loop for this if you want low latency; even with a polling interface, it's not possible to be accurate enough to prevent audio artifacts if you're interleaved with game code and trying to share with game timers.
It's easy enough to provide interfaces for realtime streaming, but it's quite hard, probably impossible, to make them as easy to use as you want them to be. The problem isn't even exactly performance, it's getting several things to align so that you can get the samples into the right place at the right time without glitching.
But if the right interfaces exist you can in theory even do it from Python, only problem there being the global interpreter lock which can sometimes get in the way, but cython might let you avoid even that and you can embed C code directly with CFFI anyway if you want, so there's ways.
For those wondering if this is dead: it's not. As of now it's the world's lamest media player. Once this is further along there'll be a dedicated thread for it. No promises as to when that'll be, but the first hard cliff is getting multi-format support and the custom protocol stuff working--to have things like HTTP or microphone streaming or who knows what else on the table, there's rather a lot that goes in now.
I have a code line counter: getting this far is 700 lines of my own code, excluding blanks and comments. Does wav at the moment, but pluggable backends exist so that it can be extended with any other format (perhaps proprietary ones even). Ogg, Mp3, and Flac are queued but not the immediate priority--let's prove the HRTF first.
I have 2 suggestions:
1. please, please don't use c++ virtual functions (that has a lot of headake when we want to implement it in cython).
2. if you want to make a c api, for implementing filters, using callbacks is the best bet. the callback can do it''s job in another thread and it is possible for others to roll their own filter stuff (like different echolisers or other types of reverb like freeverb etc).
and it's easier to code it in cython and write other functions for it in cython and more in python itself.
I know. It's getting a C API, eventually. That's coming. Virtual functions are fine internally as long as they're not exposed to the user, and the API outside of the internals is Going to be C so that's not a concern for anyone doing bindings. Besides, I'm probably going to end up writing/maintaining the cython bindings myself anyway.
You probably won't be implementing custom effects in Cython. You'll probably be implementing custom effects in Lua 5.1 or EEL2. While there are lots of ways to allow Python to provide custom sources, Python's GIL prevents any real concurrency and puts unpredictable kernel calls in places we don't want unpredictable kernel calls if you try to use it for effects. Why is complicated, but the short version is that you don't want to allow the Python program to block the audio thread from the time you allocate space in the ringbuffer to the time you allow the ringbuffer to proceed forward, and effects have to run in that time window by the nature of effects.
There'll probably be custom effect C callbacks, but using them with Cython will probably not get you what you want, in other words.
Fortunately Lua 5.1 (via LuaJIT--no longer maintained, thus the lack of Lua 5.3) is actually about as fast as C, and EEL2 is the language from Reaper's custom effects which happens to be open source and meets all the requirements for audio programming, so there are not super inconvenient ways to get this.
Thanks for the update!
Would you consider adding SDL2 as an optional audio back-end?
It seems a waste to have an extra audio back-end shipping with some or most games and not use it.
SDL2 isn't an audio backend or even particularly low latency, and even if added, all the other backends have to be in there anyway. You don't gain anything from it. It won't make anything smaller or faster.
@55 I said that because it is hard to implement virtual functions in cython ((you should roll a proxy class that calls a python function from within that virtual function)
regarding lua or eel, those are good languages, but why we have a language like python which can process effects easily (via numpy or other means), why we should use those?
I'm not discoraging it's use (especially eel was designed for effect processing in mind).
regarding the audio callback, that can be exicuted on it's own thread without blocking anything else.
Google Python global interpreter lock. You can't execute audio callbacks reliably on background threads in Python because Python has an internal mutex that prevents this. Cython has a nogil utility but then stops short of giving you atomics, so even if the gil weren't a problem you couldn't communicate with the callback in a lockfree fashion.
The impact of the GIL is that even if multiple effects are running on different threads, only one of them can be inside Python at once.
Also, the overhead of getting very small blocks of audio data into and out of Numpy for processing is massive.
And, you have no real control over when Python allocates.
If we had just one of these concerns it could be dealt with but we have all of them, so it can't.
People think about audio programming wrong. You have to be fast not because you're using all the resources but because you have to wait until the last instant for low latency purposes. I am already using a semaphore, which has to call into the kernel, which usually means a high chance of thread preemption, and I hate it. Here's hoping that I only need one.
But finally I take it you've never programmed an audio effect before. Numpy doesn't help you. Delay lines can't be vectorized. if you want to do a convolution or filter you can just feed the impulse to something else, no need to implement it yourself. If you want to do a feedback delay network or something that's already slow even in C.
As I said already the callback will exist, because in non-Python scenarios it makes sense to, but you're not going to get what you want from it.
You can do custom sources in Python because you can compute the data way ahead of time. Or at least way ahead of time in audio land, which is 200ms or so. SO Synthizer can handle that internally. But audio effect callbacks can't be done ahead of time and must run only once the current blocks of audio are waiting to go out to the sound card. Practically speaking this means you have 1 or 2 ms at most start to finish.
What kind of effect do you want to implement that Synthizer isn't going to provide already?
A bit off-topic, but what Cmake flags should I use to compile this for 32-bit or should I just install a 32-bit LLVM toolchain and use that? Also, the CFLAGS "-ffunction-sections -fdata-sections" reduced wav_test.exe from 596KB to <250KB, is it OK to add these to the default flags?
CMake will pick up architecture from the VS command prompt you launch it from. Install VS 2019, install Clang 9, install the most recent CMake, then you can build for 32 bit from the native tools command prompt for 32 bit, I believe. It's not as simple as changing flags, you do in fact have to change the toolchain (sort of--it's complicated. But the previous procedure should work).
You can probably get smaller binaries just by changing CMAKE_BUILD_TYPE to release in CMakeCache.txt. On the whole the size of that executable doesn't matter and the final dll will be on the order of megabytes.
I had to look up those flags. They're not useful flags. The useful flag is probably -flto or some variant thereof.
Please don't send patches for CMake configuration changes: I will probably not merge them at this stage, as that's a moving target. Also if you don't know CMake well, it's easy to get it wrong in ways that seem right.
The idiomatic way to do this is to edit CMakeCache.txt and rebuild.
Ok, looking further, yes those options are helpful now. But they won't be for the most part when this is done. At the moment there's dead code because most code is in progress. There might be a bit of dead code from i.e. WDL in the end but I'm not overly concerned.
#63 (edited by keithwipf1 2020-03-25 21:41:20)
Thanks. I actually tried and CFLAGS=-flto, LDFLAGS=-flto -fuse-ld=lld didn't work. I got some kind of undefined references.
I figured as much about the dead code. It actually sounded pretty funny when I played a wave file with a lower samplerate.
@59, in python I didn't write audio effects so I can't comment on it anymore (just gave my idea).
what I do with audio now is to process and feed it to neural networks for data science purposes (which is beyond the scope of this library at least).
but for coding games I've gave this idea of writing effects in python (which SoLoud doesn't support in it's Filter interface).
Wave files of different sample rates sounds funny because that's a test which doesn't currently use the resampler because (as of what's on GitHub at any rate) there is no resampler to be used.
wav_test will shortly be rewritten to not use miniaudio directly.
Yeah, but the thing is that with respect to what you actually need to write effects for, there shouldn't be anything for the most part.
And the neural network stuff trades off latency for performance wherever possible, which means none of that tooling is really applicable. And yes I do include Numpy in that--Numpy is great if what you have is a Numpy array, but the overhead of creating and destroying them is terrible.
Really interested here, thanks for your work¡
@65 it depends on its size.
a simple array has no performance overhead.
but for something like neural networks (which my experience says you will need to allocate that on GPU which in turn requires cuda and another framework other than numpy. numpy by itself doesn't support gpu).
There is overhead because you need to get the bytes out of a Synthizer buffer and into the Numpy array. Then you need to get the bytes out of the output array and back into a synthizer buffer.
I have used these tools significantly. I've run Numba, I've run Theano, I've run Tensorflow. Numpy/scipy is what I reach for when I need to do math, and is what is powering the HRTF conversion scripts Synthizer itself needs. My pipe dream plan to build a speech synthesizer relies on this stuff as well.
It might be possible to use the memoryview interfaces to do the copy without round tripping into Python but you'd still have to hold the GIL for that and holding the GIL is a bad idea for audio effects. Plus the copy still happens. There is one possibility here where you write your effect in cython with nogil and Synthizer provides the missing inter-thread synchronization piece, but even if this starts working you can't just go grab the GIL and start playing with Numpy and have things be amazing because your main game threads can just decide to hold onto it and not let go for a bit.
But let's say that none of that is a problem. You've got it in Numpy via some magic. The GIL is gone, somehow, which the Python people have been trying and failing to do for literally 10 years now. But in the case of this hypothetical example. You still have two problems.
First is that Numpy does have overhead with small arrays, since every vectorized operation has to be arranged for by Python. Numpy asymptotically approaches C with large arrays, but for 256 samples or so it certainly doesn't. It'll be faster than Python, but saying it's zero overhead is incorrect by far. It's constant overhead, but that means that if the arrays are small the constant overhead can sure count for a lot. Rule 1 of algorithmic complexity and constant overhead and things like that is that you only get to sweep them under the rug if the problem size is large enough to hide it.
Second is that audio effects can't be vectorized. Numpy vectorizes two audio operations of interest: the FIR filter and the IIR filter. These equate to a lowpass/highpass/bandpass plugin, nothing more. You can perhaps vectorize ringmod if you're sufficiently clever. But that's it. Most audio effects are recursive, for example echoes, where you need to compute every output sample one at a time so you can feed it back into the beginning. For those that aren't, say a simple delay line, the least efficient way to do it is the obvious vectorized way: what you have to do instead is compute one sample at a time and use modulus to make indices wrap around in the buffer. That can be vectorized a little bit if and only if the delay is always longer than the number of samples the effect needs to compute at once, but that's not usually the case.
And what that means is that your nice numpy operations just became read one value from the numpy array, do math in pure Python, write one value to the numpy array. And at that point it's as much overhead as lists save that it's easier to copy into and out of numpy arrays from the bindings. But at that point why bother doing that? You can get zero copy if you just use the buffers directly, which is what the bindings will expose assuming that I'm right about being able to solve this in some fashion via cython nogil and Synthizer exposing some atomic lockfree operations.
When you use something like Numpy for machine learning, even on the CPU, there is overhead. But 10-20 MS of overhead when the problem you're running takes 10-20 seconds per iteration is nothing. On my machine currently, Synthizer's block size is 256 samples at a sampling rate of 44100, which is 5 MS per block. You'll lose 1-2 MS of this to the GIL and 0.5 to 1 MS of this to getting data into and out of Numpy. That leaves you 2 MS to play with. If you compute slower than 2 MS on more than half the blocks, eventually Synthizer runs out of data to feed to the sound card. If you do it on less than half the blocks, Synthizer can still run out of data if you miss 4 or 5 in a row, which will probably kick an as-yet-to-be-written latency increasing algorithm into gear, thus making the audio of your game progressively higher latency.
Now let's bring the GIL back in. You've vectorized in Numpy, but every round trip to Python is going to release/acquire the GIL. Let's say that you do that 5 times per effect. The way Python works is that other threads running pure Python bytecode will only release the GIL after running a specific number of bytecodes (but I believe on Python 3 this got switched to a timer--I'm not going to go read C sourcecode for the sake of this post) so figure on 0.1 to 0.25 MS per GIL operation if the main game thread is trying to acquire it. And there goes your computation window without you even doing any math.
And we haven't even talked about how if you accidentally allocate objects (which Python makes easy) it might call into the OS and knock more time off the very small amount of time you have in which to do your effect computations.
Keep in mind: most effects are literally millions of math operations, just to put this in perspective--even in C++ it's hard to stay within these time windows, and Synthizer is already going nuts about lockfree data structures and such to avoid doing certain negative things that Libaudioverse did which forced Libaudioverse's block size and latency to be much larger than what Synthizer can offer you in order to hide constant overheads.
This is a hard realtime system. Python is not suitable for hard realtime systems. Hopefully the above is clear enough as to why and I won't have to keep explaining this.
Frankly, I know more Python than almost everyone on this forum and this is my third audio library, and arguably my second successful one. Libaudioverse didn't fail, it's just much wider in scope and would take too much time to finish. Point being, if I say something is a bad idea and it's to do with Python or audio, I probably know what I'm talking about, and I'd appreciate some credit as opposed to the current thing where you keep trying to tell me that I'm wrong. You probably wouldn't use this if it existed anyway, simply because 99% of projects don't need a custom effect. And as I've already said, custom sources are a different kind of thing and you'll be able to implement them in Python fine without a problem, and I'm already looking into what embeddable scripting languages meet the licensing and hard realtime requirements that need to be met for custom effects as well.
For anyone wondering if this is still seeing work: it is. As of a very long weekend I have proven that what I want to do for HRTF works in theory, and I've written all the supporting pieces to actually go ahead and implement it in practice. There's nothing like a giant pile of math to work through, but I worked through it and the prototypes of the mathematically questionable parts seem to be functioning.
I'm hoping that there'll be an HRTF demo in a week or two, but work is going into Q2 this week and I'm getting handed at least one big project right away, so no promises. A C API and such will take longer. At the moment this is a bunch of library-private classes that I'm throwing together on an as-needed basis to write test programs.
It turned out that I still had my 3D audio benchmark against OpenALSoft. After quite a bit of work I was able to get that running: without doing anything but HRTF (i.e. no streaming etc) my computer gets 1600 sources in it's most basic HRTF configuration. It's less than that if you configure the hRTF to sound better than the default. But in terms of having a target number to play the speed game with (and keeping in mind that you'd need to be running on my tower for meaningful comparison) we have one now.
Also I think that I'll be able to write a really good buffer/sound cache in C that preloads assets for you; I ended up writing a ton of concurrency stuff for other reasons. And I figured out why Libaudioverse couldn't ever get Windows to give Libaudioverse amazingly high priority for the audio threads as well.