A few days ago I posted an article on certain behaviours we had observed and investigated in relation to BizTalk receive pipelines. If anyone has been reading the feedback to the post, you will be aware that since then I’ve managed to make some further headway in this matter. My previous article simply described the behaviour and offered some speculation. I now know more about the causes. This new article is intended to replace the previous article, which you can still read here.
The problems we encountered resolved themselves into two distinct issues. One is an issue with the XmlDisassembler component, and the other is a ‘feature’ of BizTalk’s handling of inbound maps on receive ports. This article describes these two issues, their characteristics and causes and how to avoid them or implement workarounds. It also provides some additional background information on inbound maps and some general guidelines on designing and implementing custom pipeline components.
Inbound Maps
BizTalk allows maps to be assigned to receive ports. You can assign multiple maps to a single receive port. Maps are assigned at port level, rather than at the level of receive locations. One of the great benefits of in-bound maps is that you can use them to transform messages before they hit the message box. This makes it easy to ensure that different messages, submitted to different receive locations, are transformed into canonical formats before being routed to service instances by BizTalk’s subscription mechanism.
BizTalk decides, on a per-message basis, which available inbound map, if any, to use. It does so by inspecting the context of each message, looking for a promoted property called MessageType. This property is central to a number of BizTalk functions, and specifies the type of the message. The value of the property concatenates the target namespace (if any) of the schema that describes the message type, and the name of the document (i.e., ‘root’) element. The document element name is included as a URI fragment specifier.
The MessageType property is chiefly used by the BizTalk subscription mechanism to route messages. It is marked as a promoted property within the message context in order to allow the subscription mechanism to operate on it. The property is generally set by a disassembler component, but could potentially be set by any pipeline component in another pipeline stage.
When one or more inbound maps are assigned to a receive port, BizTalk uses the MessageType property to select a map by comparing the value of the property to the source schema of each map. When a match is found, the map is executed over the content of the body part of the message, and the results are assigned back to the body part. If no match is found, no map is executed. This is not treated as an error, and the un-transformed message is delivered to the message box.
Stream Processing
Maps are executed after pipeline processing has been completed. They operate on the contents of the body part of the message delivered by the pipeline back to BizTalk. This message may be different to the message initially provided to the pipeline. For example, it is sometimes necessary for pipeline components to replace an existing stream with a new stream (possibly within a new message and message part), although this should generally be avoided wherever possible.
Replacing one stream with another is often a sign of lazy pipeline component programming and tends to be inefficient because it generally involves saving away the contents of a stream in a stateful manner in order to subsequently write the content to another stream. The whole point of employing a streaming approach within pipeline processing is to promote an efficient, stateless model of message content processing. The chief philosophical difference between the EPM (messaging) and XLANG/s (orchestration) host instance sub services is the issue of statefulness. The EPM sub service, which manages BizTalk receive and send handlers, is designed to function in a stateless manner. Orchestrations are stateful. For example, EPM services do not persist state to the message box. Orchestrations, by contrast, persist state to the message box every time the process flow reaches a persistence point (denoted by bold borders in the orchestration designer). Orchestrations support dehydration and recoverability models that are not available in pipelines.
If the contents of a stream need to be amended or changed by a pipeline component, this is generally best done by wrapping the original stream in a new stream object that processes the data on the fly during read operations. In the most efficient pipeline designs, this can mean that all content processing is done just once when BizTalk reads the stream after the pipeline has completed its work. At this stage, BizTalk reads the entire stream, and each stream wrapper can perform its work. This approach is naturally stateless, although some temporary buffering is often required during a read operation.
The design and implementation of well designed, efficient pipeline components often requires thought and hard work. Too often, it proves easier to adopt a simpler, less efficient approach. However, beware. Cutting corners in development generally does not pay in the longer term. Interestingly, the two issues we have discovered are both most likely to be encountered if you attempt to short-circuit good design.
Seekable streams
[Update: This issue was fixed in Service Pack 1 for BizTalk 2004] The first of the two problems is that BizTalk 2004 fails to execute inbound maps over seekable streams. It is as simple, and as brutal, as that. If, after a pipeline has completed its work, it returns a message whose body part contains a seekable stream, and if BizTalk matches an inbound map to the MessageType property of the message, the map will always fail with a “The root element is missing” error. If the stream is non-seekable, and positioned at the beginning of the stream, the map will succeed (unless, of course, there is some other problem). If, for example, you create a pipeline component that replaces the body part stream with a new stream using the .NET MemoryStream class, any inbound map will fail (assuming this is the stream passed to BizTalk at then end of the pipeline). The MemoryStream class provides a seekable stream.
BizTalk calls the CanSeek property of the stream object in order to determine if the stream is seekable. This property is inherited from an abstract member of System.IO.Stream, and returns a Boolean value. If the CanSeek property returns true, BizTalk calls the Length property which should return the byte length of the stream. The Length property is not called on non-seekable streams. In either case, BizTalk then repeatedly calls the Read method until the entire contents of the stream has been read. This happens regardless of the value returned by the Length property, even if the length is reported as 0.
If no inbound map is selected, the message and its contents is delivered to the message box without any problem, regardless of the seekability of the stream. If a map is selected, the transform will only succeed if the stream is non-seekable.
In passing, please note that I did wonder if the inbound map issue might be related to stream encoding. I therefore experimented extensively with the Length property, for example by encoding the stream contents in Unicode and returning the number of characters in the stream instead of the number of bytes. This made no difference at all.
Guideline for Custom Pipeline Components
If you are creating custom pipelines using custom pipeline components, take care that the last component in the pipeline returns a body part with a non-seekable stream. More generally, I would recommend that, as a guideline, any component you create should either return the same stream it received, or wrap that stream in a non-seekable stream wrapper class. Only return seekable streams in the rare case it is truly necessary to allow some other pipeline component to randomly access the stream contents. Avoid, therefore, returning ‘raw’ MemoryStream objects. If you are tempted to return a seekable stream, ask yourself if you might better implement your functionality within an orchestration or by using functoids and/or script in a map, rather than within a pipeline component. Remember that seekability always involves a stateful approach because the contents of the stream must be written out to some backing store. This may compromise the efficiency of your pipelines.
Streams are provided by message parts. The IBaseMessagePart interface defines a method called GetOriginalDataStream, together with a property called Data. In the BizTalk documentation, there is a bold statement that says:
“In custom pipeline components, Data clones an inbound data stream, whereas BodyPart.GetOriginalDataStream returns the original inbound stream.”
As far as I can tell, this is only true if you implement IBaseMessagePart on your own custom message part class and write your own code to create a clone! BizTalk provides a message factory object (via the pipeline context) which has a CreateMessagePart() method. Although the message factory is designed for use in custom pipeline components, the message part created by this method does not clone your stream. The references of the streams returned from Data and GetOriginalDataStream() are identical (i.e., they are one and the same stream object).
As a general guideline, you should normally use the message factory to create new message objects, and then return these from your pipeline component. You may also wish to use the unsupported PipelineUtil class to copy property bags (for part properties) and clone message context from an existing message. However, you may wish to avoid using message parts created by the message factory. Instead, consider creating custom message part classes that return non-seekable streams using GetOriginalDataStream() and seekable cloned streams using the Data property.
Christof Claessens has written a first-class article on a rather different stream cloning technique for efficiently processing stream data within a pipeline component. One potential use of his approach is to allow the processing of a seekable cloned stream while still returning a non-seekable stream within the body part. Note that if the cloned stream is writeable, any changes you make to its content will not be handed on to subsequent pipeline components or to BizTalk. The cloned stream is processed on a secondary ‘worker’ thread, which avoids blocking the main thread when it reads the stream. Also note that reading the cloned stream is synchronised to the reading of the original stream. The worker thread cannot read and process stream content until that content has been read on the main thread. This approach allows you to avoid seriously compromising the efficient streaming of data through your pipeline while allowing complex processing of stream content in a multithreaded fashion.
XML disassembler issues
The second of the two issues concerns the XML disassembler component. This component creates and returns messages with non-seekable, read-only streams for the body part content. It does not provide clones via the Data property of the message part object.
The XML disassembler has several uses. Perhaps its most basic function is to inspect the document element of inbound XML messages to determine their message type, and to promote the MessageType property within the message context. The component may also be used to validate XML and to disassemble XML into multiple messages.
The XML disassembler uses a stream wrapper class called XmlDasmStreamWrapper to wrap the body part stream. This, in turn, utilises an instance of the XmlDasmReader which is a specialised XmlReader object. The Read method of the XmlDasmReader class maintains an internal flag to indicate if the reader has detected a new ‘document’ or not. By ‘document’ we mean an Xml node whose content is to be returned as a new message by the GetNext method of the XML disassembler. The new document flag is exposed via the internal IsNewDoc property of XmlDasmReader. This property is read/write.
The XmlDasmReader instance is referenced directly by the XML disassembler component, and therefore, because both classes are in the same assembly, the XML disassembler has access to the IsNewDoc property. The XML disassembler uses this property directly in its GetNext method. The GetNext method of a disassembler returns the next disassembled message, and is called repeatedly by BizTalk until it returns a null value. In this way, a disassembler can convert a single inbound message into multiple messages. The implementation of GetNext (the relevant code is actually in a private method called GetNext2) tests the IsNewDoc property. If the value is true, the code constructs and returns a new message containing the new document content. If the value is false, the code tests a status flag to ensure that all processing is complete and then returns null. This in turn causes BizTalk to stop re-executing the GetNext method.
Unfortunately, the GetNext method fails to use the read/write IsNewDoc property to reset the new document flag to false. Instead, the flag is reset within the Read method of the XmlDasmReader class. The code in this method then goes on to check if there is a new document, and changes the flag back to true if necessary.
This constitutes a very common type of logical error (I have made similar mistakes countless times myself). For GetNext to work correctly, it is entirely reliant on the Read method of XmlDasmReader being called at least once. This, of course, happens when the Read method of the XmlDasmStreamWrapper is called, typically either by a subsequent pipeline component, or by BizTalk at the end of pipeline processing. The implicit assumption is that, after calling GetNext, the stream is always read before BizTalk makes a subsequent call to the GetNext method.
This assumption is not valid. Consider a scenario where you use a disassembler to disassemble an Xml message into many messages, and then build a custom validation component. Your validator tests, say, some context property of your message, and based on its value, decides to either discard or keep the disassembled message. If the message is discarded, it may be replaced by a different message. Because the validator makes its decision based on contextual data, it has no reason to read the body part stream. Perhaps it returns null, instead of a message, or perhaps it creates a new stream and assigns this to the body part. Because the original stream provided by the XML disassembler is never read, the new document flag is never reset to false. The next time BizTalk calls the GetNext method, the XML disassembler creates and returns a new message, regardless of there actually being a genuine new document or not. Because the stream has not been read, the new message is given exactly the same body part content as the previous message. This effectively sets up a never-ending loop in which BizTalk hogs the processor cycles by repeatedly calling GetNext and generating a never-ending series of identical messages. Everything else slows to a crawl. Your only hope is to disable the BTSNTSvc service and kill the process. You can then create work-around code in your custom validator component, recompile and deploy, and re-enable and start the BTSNTSvc service.
The workaround is simple. Before discarding the body part stream of a message provided by the XML disassembler, always read the stream. Now you can happily discard the stream without going into a never-ending loop. Another possibility is to create a custom disassembler that inherits, or wraps, the XmlDasmComp class. You could then write your own code to reset the new document flag. One problem with this approach is that the instance of XmlDasmReader is held in a private field of the XmlDisassembler. You would have to use reflection to get at this instance in order to set its IsNewDoc property.
Conclusions
It doesn’t pay to take shortcuts in designing and implementing pipeline components. Implementing good pipeline component design can actually be quite difficult, so consider using orchestrations or maps instead, especially where you must process message content in a non-streaming fashion. If you do create custom pipelines, design them to be as efficient and stateless as possible, and follow the guidelines outlined above. In particular, take account of the two issues described in this article. They are generally easy to avoid, once you know they are there, but can cause hours of headaches if you are unaware of them.