@12
Synthizer gets double digit sources in debug builds with the boring old non-optimized convolution that you're never supposed to use and I expect upwards of 1000 in practice on a good machine in the microbenchmarks, whenever they happen. That's partly sarcasm--when deciding to use boring old O(n^2) convolution everyone always misses the bit about the crossover point where the FFT version starts being faster, which is something I persistently see people get wrong. Sadly it sounds like the WebAudio people missed that too. Fun. Admittedly part of that is that I am very clever about packing data appropriately for SSE vectorization, but still. Every time I think I'm done being disappointed with the implementation quality of it, I find that I was in fact mistaken.
I haven't read your code, but I'm assuming that you're doing a biquad per channel. The problem with that is that a single biquad isn't actually enough to capture HRTF. You'll permanently lose out on any vertical effects for one thing. What I may eventually do in Synthizer is allow an additional lowpass filter for emphasis. But even that doesn't work out so well in practice: the lowpass filter for behind the player due to HRTF and the lowpass filter for occluded by the wall are effectively the same filter. It might be possible to construct a more accurate model with 2 or 3 biquads in series, but I haven't tried. The actual frequency response of those impulses is more like a complicated equalizer, not just a couple filters in series. The "the head is a sphere and" models throw out a lot of detail.
As a benchmark, check this or any of the other variants of it with different HRTF datasets. OpenALSoft is kind of the benchmark for this, and it's just convolution plus reintroducing removed interaural time difference that was extracted with magic. I can go into the specifics of the magic, if you know enough math, or you can read the scripts in Synthizer's repository where I do something in the same general ballpark of it. The general idea is that you convert the filters to minimum phase, then you do some normalization that lets you select between flat frequency response and "this is Bob's head and only Bob's head" response, and just land somewhere in the middle, enough to emphasize what most people have in common without overindividualizing it.
But in general you're stuck with WebAudio and probably can't do better, at least unless you feel like getting into web assembly. It might be possible to make a program which determines the low-level coefficients for biquad equivalents, but I don't think WebAudio is big on crossfading that properly for you and the naive thing where you crossfade the coefficients one by one actually produces intermediate unstable filters.
One thing that does really leap out at me is that you say what you do is faster than 2 StereoPannerNodes per source. What was the motivation there? As far as I know you should only ever need one per source.
My BlogTwitter: @ajhicks1992