Q-ologues

various ramblings
posts - 5, comments - 4, trackbacks - 1

My Links

News

Archives

Thursday, June 14, 2007

Intricacies of the NT Filesystem

An interesting discussion began in the comments of Raymond Chen's blog entry the other day.  His post was about canonical order of entries in an ACL, but the comments drifted toward some interesting behavior of NTFS under Windows.  It was off topic there, but I think the discussion touched on some important points, so I'm continuing it here.

One of the things touched on was that file deletions are really a directory operation.  NTFS supports hard links, which means a single physical file can be referenced by multiple directory entries.  Normally you would think of a file deletion as removing the physical file, but in fact it is simply removing a link to it from a specific directory.  The physical file is only deleted when all references to it are gone.

This makes the security semantics a bit interesting.  Directories have the DELETE_CHILD permission, which grants the caller permission to remove any entry from it, regardless of permissions on the individual files those entries reference.  However, a physical file also has a DELETE permission for itself, which grants you permission to remove any directory entry that references it -- even if you do not have DELETE_CHILD permission on that directory.


Norman Diamond raised the question of how deletions of open files behave in comparison with POSIX.  Under POSIX, if an open file is deleted, the directory entry for it is immediately removed.  If no directory entries are left, the file is effectively anonymous, but still physically exists until all processes have closed it.

Under NT, deletions are slightly different.  When an open file is deleted, the directory entry used to reference it is marked as "delete pending", but it is not removed until all open handles to the file through it are closed.  While the directory entry is in the "delete pending" state, no one can open the file using that path.  Once all processes have closed the file, if no more directory entries for it exist, the file is physically removed.

Basically, the only major difference in behavior between POSIX and NT is that NT keeps the directory entry around for naming purposes instead of making the file anonymous.


Norman also posted about an interesting issue he encountered when playing with SFU.  In short, he has two directories named "Pinball" and "pinball" that appear to go to the same place.  He assumes this is due to hard links, but NT does not actually have hard links for directories.  He also later implies that Win32 can be made case sensitive, which also isn't accurate.  I can certainly understand why he would reach these conclusions, since information on this subject is rather hard to find and understand.  So how do I explain the behavior he's seeing?

Internally NT treats everything as an object in a large directory tree, and NTFS volumes are accessed through branches on that tree.  The Object Manager is normally case sensitive, but individual API calls can request case-insensitivity.  The Win32 subsystem always requests case-insensitivity with just a few exceptions, such as CreateFile() with FILE_FLAG_POSIX_SEMANTICS.  The POSIX subsystem installed by SFU does not.

The Native API interface to the Object Manager, which both the Win32 and SFU subsystems use, has an option to apply the case-insensitivity flag to all calls through it.  When this option is turned on, no userland process is able to make case-sensitive calls.  As of XP, this option internally defaults to on, which forces even SFU to be case-insensitive.  However, it can be controlled via a registry option, which is what the SFU installer changes when you choose the case sensitivity option.

Presumably at some point shortly after Norman installed SFU, a POSIX application decided to create directory there using a different case than the one that was already present.  Even under Win32, a directory list will show all entries in it, even if they only differ by case.  However, whenever something attempts to access an entry with the case-insensitive flag in effect, the Object Manager will choose only one of those entries.  It is simply not possible to access the other.  Thus he gets the behavior he observed on the Win32 side -- it consistently accessed a particular one.

The behavior he gets from SFU's Korn shell could also be explained by case-insensitivity, where it is only accessing one directory even though he names the other.  But this is SFU, which should be case sensitive, and obviously case sensitivity was in effect at the time one of these directories was created.  What gives?

I'm betting Norman got bit by a little accident Microsoft had with a few Windows updates, such as the one mentioned in KB929110.  In that case, a .NET Framework update mistakenly changed the registry key mentioned earlier to force case-insensitivity on all subsystems.  Whoops.

key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\kernel
value name: ObCaseInsensitive
value type: DWORD

For case-sensitivity support under SFU, that registry value must be present and set to 0 (restart is required for changes to take effect).  A while back I came across this key while trying to figure out why FILE_FLAG_POSIX_SEMANTICS was not working under Server 2003.  Upon discovering the problem I promptly added deny and audit ACEs so I get notified if anything tries to change it.  This stuff needs to be easier to find.

Posted On Thursday, June 14, 2007 3:16 AM | Feedback (0) |

Friday, June 30, 2006

Socket I/O: Unix, Windows, and .NET

This is adapted from a recent email conversation I had about network I/O models for a GUI client running on .NET.  I ended up on a tangent, covering APIs available for 3 major platforms.  Since all this info was gathered in one place, I figured I might as well post it :)

There's also some discussion about when blocking and non-blocking models are appropriate, but I don't actually draw any conclusions.  Hopefully there's enough background here for someone to make the call on their own.


Let's start by clarifying some terms.  Blocking vs non-blocking refers to I/O calls on the sockets; a nonblocking socket will return when it has done as much as it was able to do immediately.  For a recv(), that means it either returns data already in the network stack's buffer, or signals that there's nothing to get.  For a send(), that means it shoves data into the stack's buffer until it's done, or signals that the buffer filled after N bytes.  For a blocking socket, recv() won't return until something is received; send() won't return until all has been sent.

Asynchronous I/O refers to having the entire operation in flight, and getting notification of its completion later.  Unlike nonblocking I/O, which simply avoids waiting if it can't finish the request now, an asynchronous call will complete at some point without any further API calls.

When you use nonblocking or asynchronous I/O, you eventually need to get notification of when you can do something more, which brings up event models.

 

Unix

For Unix, the important point to note is that "everything is a file" as far as I/O goes.  Network, file, console, pipe, device -- it's all accessed through a file descriptor, so event notifications are all oriented around FDs.  UI stuff follows the same path.

For nonblocking I/O, select() is the classic: it accepts 3 arrays of boolean FD flags (one each for read, write, and error) and a wait timeout.  Upon return, the appropriate flags are changed in each array for an event that has occurred on the associated FD.  In the nonblocking socket case, a "read" state means there's more incoming data to recv(), while a "write" state means the internal output buffer has emptied, so it's possible to call send() again.  select() has three major limitations: it can only indicate 3 types of events, it has an often-hardcoded upper limit on the number of FDs you can check, and since it's a scan-array-every-call implementation, it doesn't scale for high load servers.

poll() is the successor to select().  It accepts a single array of structs (FD and associated flags).  Upon return, various flags are set indicating the state of the associated FD.  poll() can provide notification of more types of events, and doesn't have select()'s array size limit, but still has the scalability problem.

Those two are available pretty much everywhere.  The following nonblocking I/O event types are supported by some systems but not others, and all of them have the goal of getting rid of the scalability limitation.  Some of them also differ in how they indicate the events.  select() and poll() provide what's known as "level-triggered" notifications: they return if it's possible to read or write now.  If you poll() and it returns a socket in the read-ready state, and then you don't recv() from that socket but call poll() again, it will again immediately return to tell you that socket is read-ready.  The other type of event indication is "edge-triggered": you get a single notification when the state changes (e.g. from can't-read to read-ready), and then nothing more until you do something to cause another state change.

Solaris sports /dev/poll.  It works like poll(), but without the arrays: you write FDs to it, and then read FD events back out.  The wait call is an ioctl() on the /dev/poll FD; it is level-triggered.

Linux has epoll().  It's the same concept as /dev/poll, just with a dedicated API instead of a device.  The wait call is epoll_wait(); it can be either edge-triggered or level-triggered.

Linux also has realtime signals.  In this model, an FD is associated with a specific signal using fcntl(), which is raised when the status changes.  Unix signals are normally delivered to the process during certain common system calls (they interrupt the syscall, call the program's handler for the signal, and (maybe) return from the syscall afterward).  When using realtime signals under Linux and you have nothing else to do, sigwaitinfo() is the wait call.  rtsigio is edge-triggered.

The BSDs and Mac OS X have kqueue(), which is a generalized/unified event notification mechanism.  Depending on the exact OS, it can handle read/write status on FDs, AIO completion (see below), file/directory changes, other process status, signal delivery, timers, and network interface status.  (Before kqueue(), the non-FD notifications were handled with signals, polling, dedicated APIs, or not at all.)  This is similar to the other mechanisms in that you tell it once what events you're interested in, and then simply wait until something happens.  It also supports passing around your own state pointer/object to avoid a lookup when you get an event.  The wait function is kevent(); it can be edge-triggered or level-triggered.


Asynchronous I/O is not popular on Unix; there are very few suitable implementations for high-performance networking.  AIO is covered by the POSIX/ISO 1003.1 standard (1990ish), and it normally provides completion notification via signals.  Some systems can use kqueue() to deliver notification instead.  It also has an optional wait call of its own, aio_waitcomplete().  POSIX AIO does not support asynchronous completion of connect() or accept().

Some systems have experimented with asynchronous I/O using mechanisms other than the POSIX standard as well.

 

Windows

Under Windows NT, everything's an object (handle).  Regrettably, the Win32 layer doesn't agree, hence the clash with GUI work.

For nonblocking I/O, Winsock supports the classic select() for socket handles, with all the limitations of the Unix version.  No support for GUI work.

WSAEventSelect() can be used to associate a socket with an event object to be signaled for various things, including routing and address changes.  When the event handle is signaled, you call WSAEnumNetworkEvents() on a socket to find out caused it.  There are several possible wait calls; one is MsgWaitForMultipleObjectsEx(), which does support GUI work.  Such wait functions can process many different kinds of objects, but there's a limit of 63 handles, which makes this approach difficult to scale.  WSAEventSelect() is a hybrid of level-triggered and edge-triggered behavior.

WSAAsyncSelect() provides the same notifications as WSAEventSelect(), but posts a window message instead of signaling an event object.  The relevant data for the event is provided in the window message, so no other calls are needed.  This approach is well integrated with GUI work, and the wait call is the standard window message loop.  Note that the name refers to the "select" operation itself; this is not asynchronous I/O.  WSAAsyncSelect() is a hybrid of level-triggered and edge-triggered behavior.


Asynchronous I/O under Windows is called Overlapped I/O.  A wide variety of tasks can be performed in an asynchronous way, so these notification methods apply to more than just network I/O.  All overlapped calls support passing arbitrary user data around.  The primary Winsock functions involved for overlapped I/O are WSASend(), WSARecv(), ConnectEx(), and AcceptEx().

An overlapped operation can signal an event object upon completion; GetOverlappedResult() is called afterward to get the associated data from the operation.  This method has the same usage style and limitations as WSAEventSelect().

WSASend() and WSARecv() support queuing an asynchronous procedure call to the current thread upon completion.  APCs are somewhat similar to unix signals, but require the thread to be in an alertable wait state instead of interrupting system calls.  The APC mechanism supports arbitrary user calls as well.  GUI work is only supported if MsgWaitForMultipleObjectsEx() is used with MWMO_ALERTABLE.

Finally, we have I/O completion ports, which are similar in use to BSD's kqueue: socket handles are associated with a completion port, along with a user state object if desired.  Whenever an overlapped I/O operation finishes on that handle, notification will be delivered to the port.  Arbitrary user notifications can also be sent.  I/O completion ports are designed for multithreaded scenarios, and are integrated with the scheduler to ensure an optimal number of threads are active at any given time.  The wait function is GetQueuedCompletionStatus(); GUI work is not supported.

 

.NET

That brings us to .NET, which is a bit simpler.

For nonblocking I/O, the only notification method is Socket.Select().  Same limitations as the classic; no support for GUI work.

For asynchronous I/O, a BeginXxx() method returns an IAsyncResult object, which is later used to get the result with an EndXxx() call.  A generic version of this pattern can be applied to any delegate, but several libraries in the Framework have more specific versions for greater efficiency and flexibility, including Socket and related classes (Socket.BeginConnect(), NetworkStream.BeginRead(), etc).

IAsyncResult.AsyncWaitHandle provides a waitable object that is signaled to indicate completion.  A typical wait function is WaitHandle.WaitAny() on an array of WaitHandles.  However, there's a limit of 64 handles, and each handle belongs to a single asynchronous operation (instead of an entire socket), making this approach difficult to scale.  GUI work is not supported.

The BeginXxx() methods can also accept an AsyncCallback, which is then called when the operation is complete, usually on a thread from the threadpool.  No wait function is involved.  For GUI work, the callback can notify the main thread using AsyncOperation or SynchronizationContext.

 

Many people seem to have the webserver in mind as the canonical example of a concurrent server.  Implementation limits on scalability aside, a thread per connection using blocking calls is suited to this type of application because of the isolated request-response nature of tasks.  However, there are a lot of applications that revolve around shared state, and a thread-per-client model tends to break down from the lack of isolation in them.  Such applications also tend to not be request-response, which makes blocking calls difficult since you don't know which one is needed at a given time.

The real blocking vs nonblocking vs asynchronous choice depends entirely on the application model you use.  Blocking is for single, sequential tasks.  Nonblocking and asynchronous are better suited to concurrent related or non-sequential tasks, or event-driven applications.  There are tradeoffs to be made in terms of complexity, scalability, vulnerability to remote attacks, etc.

Guarding against resource starvation tends to be particularly hard with blocking sockets.  Typically a timeout on an aggregate task is needed, but that implies an "asynchronous" timer -- something that's hard to do when you're stuck waiting for a single specific thing to happen.  (Unix has a mechanism for this, incidentally.  The alarm() function can be used to a set a timer that sends a signal, which interrupts blocking calls.  Use of this in network apps is not as common as it once was.)

Aside from batch-style programs that do a specific thing, I don't often see an obvious use case for blocking sockets; in my experience most nontrivial applications are either event-driven (often due to a UI) or need to do more than one thing at once.  More commonly, people seem to think blocking is the only way network I/O is done, and build various hacks around that (like a separate thread per connection).

In regard to the popular notion that asynchronous operations are implemented by some library using blocking calls on threads: not likely.  Computers are event-driven by nature, from the hardware all the way up the drivers and network stack to the application.  A blocking operation, semantically, needs to do something like:

    waitObject = CreateEvent();
    NetworkStackStartSend(socket, buffer, waitObject, &resultBuffer);
    SleepUntilSignaled(waitObject);
    return resultBuffer;

Most network stacks are in-kernel, where they can talk to the scheduler directly, so real implementations don't look quite like that.

The stack itself needs to add headers, checksums etc, drop the new packet in a DMA buffer, and tell the NIC to start sending from there.  Then it's done, and the kernel goes on to other scheduled work.  Meanwhile the NIC does DMA stuff and I/O based on the hardware clock, and when it's done with that buffer it fires an interrupt.  The kernel stops whatever it was doing and notifies the network stack, which realizes it's done with the application buffer and the protocol semantics are satisfied.  It sets a success result code, then signals the event object, which tells the scheduler to add that thread to the ready queue.  Interrupt is done, kernel goes back to whatever it was doing, and runs the thread at the next scheduled slot.  Thread is done sleeping and can return to application code.

Blocking is really just a facade for programmer sanity :)  Concurrent work requires exposing at least some asynchrony in public APIs, so modern network stacks are quite efficient at it.

 

If you're looking at creating a server for a high number of concurrent connections under Unix, you need to check out The C10K Problem.  For Windows NT, see Scalable Winsock Applications.

Update: Chris Mullins and JD Conley from Conversant have posted information on scaling under .NET.  See Chris's architecture for serving 100,000 simultaneous connections with .NET, and JD's notes on managing socket buffers.

Posted On Friday, June 30, 2006 5:29 PM | Feedback (4) |

Thursday, December 29, 2005

MSDN Express is Irritating

MSDN Express is the documentation package you get when you install one of the Visual Studio Express products.  It has a few behaviors that just plain irritate me.

I like having a single documentation package to reference, whether I'm working in an IDE or just playing with the command line tools.  So irritation #1: the only way to open MSDN Express, by default, is through the IDE.  This IDE launches two processes: one via a service, the other under your user account, which has the actual Document Explorer window.  If you close the IDE, that Document Explorer window goes away too.  Arrgh.  Luckily this one is an easy fix: just create a shortcut to (en-us) C:\Program Files\Common Files\Microsoft Shared\Help 8\dexplore.exe.  But I have no idea what will happen when more things start to use Help 8.  Will it show me all of them?

Because I like having a single documentation package to reference, I want it to show everything.  Everything.  Irritation #2: you can't turn off the filters.  The “.NET Framework” filter seems to show the most at once, but it won't show (for example) the C Runtime Library Reference.  That requires switching to the “Visual C++ Express” filter.  Which takes time.  And rearranges things.  Arrgh.

Not all the documentation for a product is actually available.  I once wanted to know more about these C++ attribute things.  Visual C++ Express supports them, so they must be in the documentation package, right?  Wrong.  Irritation #3: you have to go online to learn about attributes.  Arrgh.  I can almost excuse this with the reasoning that the focus is entirely on .NET these days, and the CLI actually has real cross-language attributes, so Microsoft may not want people to find these proprietary C++ ones by accident.  And I suppose I should be glad I get offline docs for the C++ compiler at all.  But damnit, C++ attributes are useful.

And finally, irritation #4: it hides documentation!  MSDN Express includes the .NET Framework SDK documentation in the package.  It's installed a little differently from what the separate Framework SDK normally does for some reason, but it is there.  I decided I wanted to look up serialization, and I remembered from way back when that the Framework SDK has these advanced topics that discuss such things, so I went looking for it.  Couldn't find it.  Finally tried “serialization” in the index, which brought up exactly the page I was looking for, even though the “sync TOC” button is greyed out.  It turns out this topic and others are linked from the main “Advanced Development Technologies” page.  These incredibly useful documents are actually installed, if you can figure out how to find them.  But are they in the Table of Contents?  Nope. Arrrghh!

Update:  After realizing the class library reference was missing all of the System.Runtime namespaces, I finally just installed the Framework SDK on top of Express.  Much better.  Fixed #4 and more tools.  Highly recommended.

Posted On Thursday, December 29, 2005 3:56 PM | Feedback (0) |

Thursday, December 22, 2005

NTFS Reparse Points and Hard Links

Eric Newton asks for symbolic and hard links in Windows Vista.  Both of these are partially implemented in 2000 and later, though it's not obvious how to use them.

NTFS has supported file hard links for some time via the CreateHardLink() API, and Windows 2000 implements directory symbolic links using reparse points, a filesystem behavior extension mechanism.  There's an article on CodeProject describing exactly how junction points work.  Symbolic links for files and hard links for directories are not supported.

Sysinternals has a junction utility to manage directory links, and Windows XP includes fsutil for creating file links.

Be careful when using junction points with the shell: Explorer's delete operations nuke everything inside the junction, not the junction point itself!  I came across an icon overlay shell extension that's helpful in avoiding that mistake.

Posted On Thursday, December 22, 2005 4:04 PM | Feedback (2) |

Powered by: