Socket I/O: Unix, Windows, and .NET

This is adapted from a recent email conversation I had about network I/O models for a GUI client running on .NET. I ended up on a tangent, covering APIs available for 3 major platforms. Since all this info was gathered in one place, I figured I might as well post it 🙂

There’s also some discussion about when blocking and non-blocking models are appropriate, but I don’t actually draw any conclusions. Hopefully there’s enough background here for someone to make the call on their own.

Let’s start by clarifying some terms. Blocking vs non-blocking refers to I/O calls on the sockets; a nonblocking socket will return when it has done as much as it was able to do immediately. For a recv(), that means it either returns data already in the network stack’s buffer, or signals that there’s nothing to get. For a send(), that means it shoves data into the stack’s buffer until it’s done, or signals that the buffer filled after N bytes. For a blocking socket, recv() won’t return until something is received; send() won’t return until all has been sent.

Asynchronous I/O refers to having the entire operation in flight, and getting notification of its completion later. Unlike nonblocking I/O, which simply avoids waiting if it can’t finish the request now, an asynchronous call will complete at some point without any further API calls.

When you use nonblocking or asynchronous I/O, you eventually need to get notification of when you can do something more, which brings up event models.

Unix

For Unix, the important point to note is that “everything is a file” as far as I/O goes. Network, file, console, pipe, device — it’s all accessed through a file descriptor, so event notifications are all oriented around FDs. UI stuff follows the same path.

For nonblocking I/O, select() is the classic: it accepts 3 arrays of boolean FD flags (one each for read, write, and error) and a wait timeout. Upon return, the appropriate flags are changed in each array for an event that has occurred on the associated FD. In the nonblocking socket case, a “read” state means there’s more incoming data to recv(), while a “write” state means the internal output buffer has emptied, so it’s possible to call send() again. select() has three major limitations: it can only indicate 3 types of events, it has an often-hardcoded upper limit on the number of FDs you can check, and since it’s a scan-array-every-call implementation, it doesn’t scale for high load servers.

poll() is the successor to select(). It accepts a single array of structs (FD and associated flags). Upon return, various flags are set indicating the state of the associated FD. poll() can provide notification of more types of events, and doesn’t have select()’s array size limit, but still has the scalability problem.

Those two are available pretty much everywhere. The following nonblocking I/O event types are supported by some systems but not others, and all of them have the goal of getting rid of the scalability limitation. Some of them also differ in how they indicate the events. select() and poll() provide what’s known as “level-triggered” notifications: they return if it’s possible to read or write now. If you poll() and it returns a socket in the read-ready state, and then you don’t recv() from that socket but call poll() again, it will again immediately return to tell you that socket is read-ready. The other type of event indication is “edge-triggered”: you get a single notification when the state changes (e.g. from can’t-read to read-ready), and then nothing more until you do something to cause another state change.

Solaris sports /dev/poll. It works like poll(), but without the arrays: you write FDs to it, and then read FD events back out. The wait call is an ioctl() on the /dev/poll FD; it is level-triggered.

Linux has epoll(). It’s the same concept as /dev/poll, just with a dedicated API instead of a device. The wait call is epoll_wait(); it can be either edge-triggered or level-triggered.

Linux also has realtime signals. In this model, an FD is associated with a specific signal using fcntl(), which is raised when the status changes. Unix signals are normally delivered to the process during certain common system calls (they interrupt the syscall, call the program’s handler for the signal, and (maybe) return from the syscall afterward). When using realtime signals under Linux and you have nothing else to do, sigwaitinfo() is the wait call. rtsigio is edge-triggered.

The BSDs and Mac OS X have kqueue(), which is a generalized/unified event notification mechanism. Depending on the exact OS, it can handle read/write status on FDs, AIO completion (see below), file/directory changes, other process status, signal delivery, timers, and network interface status. (Before kqueue(), the non-FD notifications were handled with signals, polling, dedicated APIs, or not at all.) This is similar to the other mechanisms in that you tell it once what events you’re interested in, and then simply wait until something happens. It also supports passing around your own state pointer/object to avoid a lookup when you get an event. The wait function is kevent(); it can be edge-triggered or level-triggered.

Asynchronous I/O is not popular on Unix; there are very few suitable implementations for high-performance networking. AIO is covered by the POSIX/ISO 1003.1 standard (1990ish), and it normally provides completion notification via signals. Some systems can use kqueue() to deliver notification instead. It also has an optional wait call of its own, aio_waitcomplete(). POSIX AIO does not support asynchronous completion of connect() or accept().

Some systems have experimented with asynchronous I/O using mechanisms other than the POSIX standard as well.

Windows

Under Windows NT, everything’s an object (handle). Regrettably, the Win32 layer doesn’t agree, hence the clash with GUI work.

For nonblocking I/O, Winsock supports the classic select() for socket handles, with all the limitations of the Unix version. No support for GUI work.

WSAEventSelect() can be used to associate a socket with an event object to be signaled for various things, including routing and address changes. When the event handle is signaled, you call WSAEnumNetworkEvents() on a socket to find out caused it. There are several possible wait calls; one is MsgWaitForMultipleObjectsEx(), which does support GUI work. Such wait functions can process many different kinds of objects, but there’s a limit of 63 handles, which makes this approach difficult to scale. WSAEventSelect() is a hybrid of level-triggered and edge-triggered behavior.

WSAAsyncSelect() provides the same notifications as WSAEventSelect(), but posts a window message instead of signaling an event object. The relevant data for the event is provided in the window message, so no other calls are needed. This approach is well integrated with GUI work, and the wait call is the standard window message loop. Note that the name refers to the “select” operation itself; this is not asynchronous I/O. WSAAsyncSelect() is a hybrid of level-triggered and edge-triggered behavior.

Asynchronous I/O under Windows is called Overlapped I/O. A wide variety of tasks can be performed in an asynchronous way, so these notification methods apply to more than just network I/O. All overlapped calls support passing arbitrary user data around. The primary Winsock functions involved for overlapped I/O are WSASend(), WSARecv(), ConnectEx(), and AcceptEx().

An overlapped operation can signal an event object upon completion; GetOverlappedResult() is called afterward to get the associated data from the operation. This method has the same usage style and limitations as WSAEventSelect().

WSASend() and WSARecv() support queuing an asynchronous procedure call to the current thread upon completion. APCs are somewhat similar to unix signals, but require the thread to be in an alertable wait state instead of interrupting system calls. The APC mechanism supports arbitrary user calls as well. GUI work is only supported if MsgWaitForMultipleObjectsEx() is used with MWMO_ALERTABLE.

Finally, we have I/O completion ports, which are similar in use to BSD’s kqueue: socket handles are associated with a completion port, along with a user state object if desired. Whenever an overlapped I/O operation finishes on that handle, notification will be delivered to the port. Arbitrary user notifications can also be sent. I/O completion ports are designed for multithreaded scenarios, and are integrated with the scheduler to ensure an optimal number of threads are active at any given time. The wait function is GetQueuedCompletionStatus(); GUI work is not supported.

.NET

That brings us to .NET, which is a bit simpler.

For nonblocking I/O, the only notification method is Socket.Select(). Same limitations as the classic; no support for GUI work.

For asynchronous I/O, a BeginXxx() method returns an IAsyncResult object, which is later used to get the result with an EndXxx() call. A generic version of this pattern can be applied to any delegate, but several libraries in the Framework have more specific versions for greater efficiency and flexibility, including Socket and related classes (Socket.BeginConnect(), NetworkStream.BeginRead(), etc).

IAsyncResult.AsyncWaitHandle provides a waitable object that is signaled to indicate completion. A typical wait function is WaitHandle.WaitAny() on an array of WaitHandles. However, there’s a limit of 64 handles, and each handle belongs to a single asynchronous operation (instead of an entire socket), making this approach difficult to scale. GUI work is not supported.

The BeginXxx() methods can also accept an AsyncCallback, which is then called when the operation is complete, usually on a thread from the threadpool. No wait function is involved. For GUI work, the callback can notify the main thread using AsyncOperation or SynchronizationContext.

Many people seem to have the webserver in mind as the canonical example of a concurrent server. Implementation limits on scalability aside, a thread per connection using blocking calls is suited to this type of application because of the isolated request-response nature of tasks. However, there are a lot of applications that revolve around shared state, and a thread-per-client model tends to break down from the lack of isolation in them. Such applications also tend to not be request-response, which makes blocking calls difficult since you don’t know which one is needed at a given time.

The real blocking vs nonblocking vs asynchronous choice depends entirely on the application model you use. Blocking is for single, sequential tasks. Nonblocking and asynchronous are better suited to concurrent related or non-sequential tasks, or event-driven applications. There are tradeoffs to be made in terms of complexity, scalability, vulnerability to remote attacks, etc.

Guarding against resource starvation tends to be particularly hard with blocking sockets. Typically a timeout on an aggregate task is needed, but that implies an “asynchronous” timer — something that’s hard to do when you’re stuck waiting for a single specific thing to happen. (Unix has a mechanism for this, incidentally. The alarm() function can be used to a set a timer that sends a signal, which interrupts blocking calls. Use of this in network apps is not as common as it once was.)

Aside from batch-style programs that do a specific thing, I don’t often see an obvious use case for blocking sockets; in my experience most nontrivial applications are either event-driven (often due to a UI) or need to do more than one thing at once. More commonly, people seem to think blocking is the only way network I/O is done, and build various hacks around that (like a separate thread per connection).

In regard to the popular notion that asynchronous operations are implemented by some library using blocking calls on threads: not likely. Computers are event-driven by nature, from the hardware all the way up the drivers and network stack to the application. A blocking operation, semantically, needs to do something like:

waitObject = CreateEvent();
    NetworkStackStartSend(socket, buffer, waitObject, &resultBuffer);
    SleepUntilSignaled(waitObject);
    return resultBuffer;

Most network stacks are in-kernel, where they can talk to the scheduler directly, so real implementations don’t look quite like that.

The stack itself needs to add headers, checksums etc, drop the new packet in a DMA buffer, and tell the NIC to start sending from there. Then it’s done, and the kernel goes on to other scheduled work. Meanwhile the NIC does DMA stuff and I/O based on the hardware clock, and when it’s done with that buffer it fires an interrupt. The kernel stops whatever it was doing and notifies the network stack, which realizes it’s done with the application buffer and the protocol semantics are satisfied. It sets a success result code, then signals the event object, which tells the scheduler to add that thread to the ready queue. Interrupt is done, kernel goes back to whatever it was doing, and runs the thread at the next scheduled slot. Thread is done sleeping and can return to application code.

Blocking is really just a facade for programmer sanity 🙂 Concurrent work requires exposing at least some asynchrony in public APIs, so modern network stacks are quite efficient at it.

If you’re looking at creating a server for a high number of concurrent connections under Unix, you need to check out The C10K Problem. For Windows NT, see Scalable Winsock Applications.

Update: Chris Mullins and JD Conley from Conversant have posted information on scaling under .NET. See Chris’s architecture for serving 100,000 simultaneous connections with .NET, and JD’s notes on managing socket buffers.

This article is part of the GWB Archives. Original Author: Q-ologues

Replatforming Guide: Pros, Cons, and Impact

Deciding to replatform is no small feat; it’s like setting sails for new horizons with your digital presence. Weighing the

Cypress vs Selenium: Why Cypress is Better!

Navigating the competitive landscape of web testing tools, Cypress emerges as a noteworthy contender, outshining Selenium with its cutting-edge advantages.