I've spent some time in the last two days testing the resilience of a BizTalk production environment. The environment consists of a two-box BizTalk 2004 server group and a two-box (active-passive) SQL cluster. Testing primarily consisted of rebooting machines, moving SQL cluster groups and, best of all, pulling power cables out of the wall while creating large number of files in a drop folder. We tried several failure scenarios for each of the four machines, and checked carefully for 'lost' messages and any other problems.
I'm glad to report that the testing was successful. At one point, we thought we had lost a single message, but the problem proved spurious. When we killed a process running a little file generation utility we had created, a mal-formed XML file was dropped into the 'in' folder and then subsequently suspended, quite correctly, by BizTalk. In total, we passed something like 100,000 messages through Biztalk while testing, and every one got through.
The only problem we ran into concerned tracking records. Every time we pulled the plug on a server, we were left with a handful of spurious records that showed up in the HAT Operations/Messages view. These records were marked as 'Delivered, not consumed', but in fact the messages were consumed correctly by the service instance. The records were for messages that were being processed at the point we switched off the power. BizTalk did not lose these messages, and correctly routed them to their destination.
The documentation is a little opaque here, but I think the issue is related to TDDS (Tracking Data Decode Service, also known as the BAM Event Bus Service). TDDS is a Windows service that is responsible for transferring and decoding event data from the MessageBox to the tracking database. Microsoft states that tracking data can be "lost to the backlog of the BAM Event Bus service", and that as a result, you cannot "rely on HAT to reveal everything". These comments are associated with recovery scenarios.
BizTalk generally took up to a minute or so to recover from failure (we didn't atually time this), athough in the very last test, a couple of messages seemed to get 'stuck' for about 3-4 minutes before being routed.