I think you're assuming that there was more designing than there actually was. :)
I probably started out with a minimal example of "I want to get audio input and output", hooked the two up to each other, added something resembling a UI to control it, and then added a layer of network IO to connect the two.
I certainly took the path of getting audio to work (nicely) first and then adding video later - and I'm *still* trying to make audio work nicely!
I'm probably missing something important but realtime video streaming simply doesn't have a bunch of the problems that audio does. If packets don't arrive or they arrive later or out-of-order it doesn't matter, you can always just show the latest frame of video that you have and it'll look fine. With audio you very quickly start getting very noticeable artifacts if you miss any number of packets.
I did spend a bunch of time at one point trying to rearrange code so that it was a bit more "elegant" and intuitive, but I eventually gave up, realising that if there's a nicer design in there it'll probably become clear later on, and carried on with doing actual work.
Actually, I don't like to work without designing it first :)
I never thought that audio streaming is harder than video streaming. Wow!
Thank you,I am going to start like you then. I'm planning to add encryption,too. Wish me luck!