All right, here goes. I'm going to show how we get from the statement "I want to make a sine wave" to "I have a buffer that contains a sine wave", including all the reasoning. Trig is not required, so long as you're willing to accept the things I say about the sine function--and then, the view of the sine function in audio is different from that in trig, anyway. This is long. I will entertain questions. There are no complete programming examples because the programming is very unimportant.
A lie about your Speaker
This is a lie about your speaker. It is true enough to be useful, while not actually really being true. Pretend that you've got a spring and a plate on the end of it. The plate will naturally sit at some position, which we will call 0 units. You can push this plate inwards, say to -5 units. you can pull this plate outwards, say to 20 units. In our hypothetical lie-brand speaker, this spring can be contracted and expanded by an electrical signal. Say that the minimum contraction is at -5 volts and the maximum expansion is at 5 volts. By rapidly varying the voltage applied to this speaker, we can vibrate the plate in a predictable pattern.
Different speakers have different ways of talking to them and different minimum and maximum voltages. Your sound card sits between your program and the speaker and deals with this for us. The minimum and maximum value depends on the format of the buffer--for a 16-bit int, it's something like -32767 to 32767 (I don't have the maximum values of different int sizes memorized). most audio programmers prefer to use -1.0 to 1.0, and that's what I'm going to use here. Getting from a buffer from -1.0 to 1.0 is simply a multiplication by the maximum value in the new format, and virtually everything these days will let you just not bother and set the audio format to float. I'm going to come back to this buffer at the end, but it's enough to know for now that if our function ever goes above 1.0 or below -1.0, it's bad. The minimum contraction corresponds to -1.0 and the maximum expansion to 1.0.
Continuous-time audio
Every sound in the real world is a continuous mathematical function that cannot go to infinity or -infinity. Your footstep, your car, everything. This is different from the computer's notion of audio, which is at the end of this post.
I'm not going to start with a sine wave, because it's complex. I'm going to start with a sawtooth wave because this lets me demonstrate how this works before adding a complexity that the trigonometric functions bring in. The % operator means remainder, i.e. 2%3 is 2, 4%3 is 1, 9%3 is 0, 12%3 is 0, and 13%3 is 1. Furthermore, this is defined on floating point values: 2.5%2 is .5, 4.7%2 is .7, 5.3%2 is 1.3. You can try the following mathematical function in Python, if you want to see what it looks like-but I'll tell you. It goes up at about a 45-degree angle from 0 to 1, drops back down to 0 at 1, and then goes up again from 1 to 2, and so on. For the purposes of this post, t represents time in seconds and must start at 0--time is never negative.
So, that said, let's derive a sawtooth wave. The function representing a sawtooth wave is simply t%1. Since this function repeats every 1 second, we say that it has a period of 1 second.
there's two problems. I said that the minimum and maximum value need to be -1.0 and 1.0 respectively. This function has a minimum value of 0 and a maximum value of 1. Furthermore, the frequency of this function is too low--specifically, it's 1 HZ, well below the human hearing range. So let's fix them.
The first problem is easy. If I multiply the function by 2, it's minimum value stays 0 and its maximum value becomes 2. The way I get from here to the range we want is to subtract 1. This gives:
2*(t%1)-1
Which is a 1 HZ sawtooth wave at maximum volume--that is, it's minimum value is -1.0 and its maximum value is 1.0.
The next thing we want to do is to make its frequency change on demand. The period of a function a is related to its frequency by frequency=(1/period). Say we want to make it play at 500 hz. By basic algebraic manipulation, we can see that the period needs to be 1/500th of a second. We basically want t to move faster--to go through 500 periods between t=0 and t=1. We can get this effect by replacing t with 500*t.
2*((500*t)%1)-1
But something similar holds for all frequencies. Let frequency in hertz be f. How about this?
2*((f*t)%1)-1
Which is a sawtooth wave at any frequency. It will repeat f times a second instead of just once per second.
The Sine Wave
So let's apply this process to the sine wave. The minimum value of the sine function is already -1 and the maximum value of the sine function is 1, so we're all good there. The period of the sine function is a bit tricky, however: it's 2*pi--if you just do sin(t), it's going to repeat every 2*pi seconds.
The first thing we have to do is get rid of the 2*pi period. We want to use the above derivation, which works only on functions with a period of 1. We can just do sin(2*pi*t), which makes the period go to 1, and that's it. By what we have above, we bring in the f for frequency, and sin(2*pi*f*t) is the function representing a sine wave. It looks like a slithering snake or a bunch of half circles in a line, where every other half circle has its flat side up instead of down (it's a bit hard to describe accurately. I'm trying).
Sine waves are interesting for a few reasons. It happens that if you introduce a concept called phase that I'm not going to try to describe here because it's unnecessary, you can represent any sound and, by extension, any real mathematical function with certain properties as a sum of sine functions. An interesting thing about audio in general is that, if you take two functions representing different sounds and add them, you get a new function that, if played, is the same as playing the sounds separately but simultaneously.
your computer and the Sampling Rate
So all of this is great, but I've not talked about the buffer yet.
your computer cannot work in continuous time. It is impossible. Almost nothing can work in continuous time, save analog circuitry--as soon as the word digital comes in, the word sampling rate isn't far behind.
In order to overcome its limitations, your computer takes "samples" of audio. When you record, the microphone varies continuously, but the computer will only look at the value of the microphone so many times per second. If you are recording at 44100 HZ, the computer is going to ask the microphone "what is your displacement?" 44100 times a second. It then takes thee displacement values and puts them in a buffer. It happens that, at least for humans, 44100 hz sampling rate might as well be a continuous function-when you play it back, you are literally unable to tell the difference between the original audio and the recorded audio. there are higher sampling rates and some people swear by them, but it's hard to tell any difference beyond 44.1KHZ.
So if you have the above sine function and want to make a buffer, here's how it works. Let's say that our sampling rate is 1000--it's an easier number than 44100 when you try to reason this out for the first time for the simple reason that 1/1000th of a second is 1 millisecond.
The first sample of the buffer is always at 0. The next sample is at 1/sr, the 3rd at 2/sr, the 4th at 3/sr, the 5th at 4/sr, and so on. These are in seconds. Because sr=1000, the first sample of the buffer is at 0 ms, the second at 1 ms, the 3rd at 2ms, and so on. What you want to do is plug these time values into the sine function, and put whatever it returns at the appropriate slot in the buffer. The following python code is very, very slow but should return a list of numbers representing the buffer for a sine wave with duration 1 second at sample rate 1000:
buffer = []
f = 500 #frequency
sr = 1000
for i in xrange(sr):
t = i/float(sr) #float(sr) because we need the float involved.
buff.append(sin(2*pi*f*t))
You can fold the sampling rate into the sin function, so that you have an index i instead of a time t. It would be as follows, if you did that:
sin(2*pi*f*(i/float(sr))
here sr is the sampling rate. Float makes it work in python: without converting either i or sr to a float, it'll do integer division, which you don't want.
Plug in i=0 for the first sample (array index 0), i=1 for the second (array index 1), and so on. You're getting the function at time values that are a multiple of 1/sr and recording them in a buffer. A continuous time buffer would take infinite memory, if you will, and consequently can't ever work.
Some Practical notes
So, some quick things that you need to know to actually make use of this:
The minimum and maximum values of float and double format audio are always -1.0 to 1.0, but the integer buffer formats aren't. To convert a buffer of floats, you need to multiply every sample in the buffer by 2**(bits-1) and copy it over to a new buffer. This is annoying, but true-it's easiest to request signed floating point when possible.
Volume is a multiplier. To make a buffer quieter, multiply the entire buffer times a scalar less than 1. Louder is a scalar greater than 1. I find it easiest to make everything at max volume and then make it quieter--this gives a known range for volume without issues rather than having to guess if 2.0 is going to cause clipping. It is worth noting that human volume perception is not linear: 0.5 will not sound "half as loud" as 1. I can go into this if required, but all that's going to come out of me doing so is a magic formula. I suggest Googling the decibel and coming back if you have questions, which I will try to answer. Wikipedia will give the formulas and a pretty good description.
If you go above the frequency sr/2, you will not get what you expect. For 44.1khz, this should never be an issue--sine waves above 10000 hz quickly become inaudible, and 44100/2 is just above the upper end of the human hearing range. The specific reasoning for this is something I do not understand the proof of, but it's called the Nyquist frequency and you never want to go over it (except for some advanced ring modulation algorithms that I don't need or understand either).
And as for duration to bufffer size: you need floor(duration*sr) samples in the buffer for a sound duration seconds long. For the above functions, it's best to stick to multiples of 1 second--it'll click if you don't because the ends won't line up when you loop it.
I think I've hit all the points, though digesting this might take a while.
My BlogTwitter: @ajhicks1992