Dylan Smith

ALM / Architecture / TFS

  Home  |   Contact  |   Syndication    |   Login
  65 Posts | 0 Stories | 42 Comments | 29 Trackbacks

News



Archives

Blogs I Read

Sunday, May 19, 2013 #

The past few weeks I’ve been helping a client come up with an Enterprise Architecture and I realized that I seem to have zero’d in on an EA that I would probably use at most places.

First off what do I mean by Enterprise Architecture?  I know lots of people use this to mean different things, for this post I’m using the term Enterprise Architecture to describe how the various applications and systems in an Enterprise will interconnect and integrate with each other (where necessary). Effective Enterprise Architecture should enable powerful integration scenarios and application re-use, while encouraging loose coupling to minimize the cost of change and impact of change on other systems.

The EA described below relies on Web Services and a SOA approach, while also leveraging the PubSub (Publish/Subscribe) pattern common in Message-Based Architecture.

While this Enterprise Architecture describes the external interfaces each system exposes, and how data flows between them, it does not describe the inner workings of each specific System (that’s the Application Architecture). Having a well-thought out Enterprise Architecture enables flexibility in choosing Application Architectures. Within each system you can choose to use a different Application Architecture, or even change a System’s Application Architecture in the future with minimal impact on other systems.

 

Integration at the Database

Most teams I encounter do integration exclusively by reading/writing directly from each Systems’ databases.  Although the majority of software teams out there probably do integration by database, in my experience the majority of software teams also deeply regret this decision.  Integrating at the DB level tightly couples Applications to the database schema design, making it risky to ever change that design.  It also limits reuse of application logic limiting you to the reuse of data only.

Service Oriented Architecture

Enterprise Architecture consists of breaking down the software ecosystem into independent Systems, and defining a well-known way for those Systems to integrate and/or exchange data. A common approach to manage this integration is to take a SOA approach, and wrap each system in a (Web) Service with a well-defined Service Contract. This moves the inter-system dependencies to the Service Layer rather than the Database layer. At first glance this would seem to simply move the coupling from the DB to the Service Contract, making it risky to ever change the Service Contract. However, it does enable re-use of application logic in addition to data. But more importantly there are well-known techniques to evolve Service Contracts while maintaining compatibility.

The most common approach to Service evolution is to expose separate End-Points for each version of the Service Contract. This way if you wish to modify the Service Contract, you publish a new End-Point with the new Service Contract while leaving the End-Point with the previous Service Contract active. Then implement a compatibility layer that translates service calls from the old Service Contract to the new Service Contract. This way the core System logic needs only support the most recent version of the Service Contract. It also provides a convenient place to introduce logging to understand which Systems are still depending on the old Service Contracts and plan the upgrade work required to move them to the new Service Contract, enabling the eventual retirement of older Service Contracts/End-Points.

PubSub Pattern

The introduction of an SOA approach, would be a marked improvement over integration at the DB level, however there are still challenges with a Service only approach to integration:

  • Availability - Service-level integration introduces availability concerns. Imagine a scenario where System A depends on Systems B, C, and D via their Service Contract. System A either needs to retrieve data from B/C/D to perform some work, or some operation in System A needs to request B/C/D to perform an operation as part of the System A operation (or often both of the above). If any of the B/C/D systems become un-available it will also impact the availability of System A, as now any System A operations that interact with B/C/D will also fail. System A’s availability is now tied to B, C, and D’s availability.

  • Coupling - Let’s imagine that we’re System A developers, and when some important operation in System A occurs, Systems B, C and D need to be notified so they can perform some related action. There are really two options, either B, C and D can poll the System A’s Service constantly querying to see when the relevant data in A has changed. This will have significant performance impacts on System A. The alternative is for System A to explicitly call some Service method in B/C/D when the relevant operation in System A happens. This not only incurs the availability concerns noted above, but what happens when another team develops System E that also wishes to be notified? Does the System E development team now need to ask System A’s development team to make changes to System A in order for System E to work? This is not a good situation to be in.

My approach is that in addition to wrapping every System with a Web Service, we also borrow from Message-Based Architecture and use a PubSub (Publish/Subscribe) pattern. In this pattern each System would publish “Domain Events” when things of interest occur within the System (ideally a System would publish an event every time any data owned by that System changes). Using one of the readily available messaging frameworks, this makes it easy for any System to subscribe to Events from any other System. Whenever anything of note happens in System A it simply publishes an event with the related data, and any other interested systems can subscribe to that Event and react accordingly. This way System A (from the above example) has no knowledge or dependency on the other Systems that may subscribe to its events (System E developers can create their System without having to ask System A developers to make any changes). If System A goes down it will not affect Systems B/C/D, and if System B/C/D go down it will not affect System A.

The other scenario is if System A requires some data from Systems B/C/D to perform an operation in System A. Rather than calling Service methods on B/C/D when that data is required, instead System A can subscribe to the relevant events in System B/C/D when that data changes and System A can maintain its own data cache of data “owned” by B/C/D updating it when the relevant B/C/D events are received. This way the System A operation can complete, even if B/C/D are all unavailable at the time. The important principle to keep in mind with this approach is that a given piece of data can only be “owned” by a single system. System A is free to cache data from B/C/D, but if System A wishes to change any data owned by B/C/D it must “ask” those systems to change it via the B/C/D Web Service. We also need to ensure that the messaging infrastructure we put in place has guaranteed delivery; meaning if System A happens to be down, when it comes back online it will still receive any Events that occurred while it was offline (modern messaging frameworks mostly handle this for us).

Each System under this Enterprise Architecture should look like the following:

Custom System

In this case the “Core Domain” is the actual implementation of that system (we’re assuming in the above diagram that there is some Domain DB contained within it, but that’s not necessary). The actual Application Architecture contained within the Core Domain is irrelevant to the Enterprise Architecture. The “Core Domain” in this case may even be a Commercial Software package such as AX or SAP.

In the case of a Commercial Software system, it often won’t support the Enterprise Architecture proposed here. In that case we need to wrap it with the appropriate integration layer to support the Enterprise Architecture. Consider the below example of a Dynamics AX System:

AX System

In the case of AX I believe it already exposes a Web Service, however, in the case that it didn’t and AX Integration was performed some other way (e.g. DB integration, copying files in a specific format to a specific directory, etc) we would create a custom Web Service that exposed those integration methods over a Service Contract (we don’t want any non-AX Systems talking directly to the AX Database except AX itself, and any Integration Layer we would write). Likewise, AX doesn’t publish Domain Events (and even if it did it wouldn’t do so using the Messaging framework we chose), again we can write some plumbing code to add support for this. In the above diagram we are writing a custom component “AX Event Generator” that would poll the AX Database looking for interesting changes in data and would raise the appropriate Domain Events that other Systems could subscribe to (some COTS Systems may have some notification system or way to “hook” system events eliminating the need to poll the DB). If we wanted AX to respond to Domain Events from other Systems, we would write a simple Event Consumer component that subscribed to Domain Events from other systems and executed the appropriate action in AX.

Using this approach, for other Systems to integrate with AX they no longer need to understand AX Database schema, they don’t need to understand any unusual integration mechanism that AX may use, they only need to understand the Web Service Contract and the Domain Events raised by the AX System (just like every other System).

 

Typically System-To-System Communication is done primarily via Domain Events. Clients (i.e. UI’s) primarily communicate via the Web Service(s).


Sunday, May 5, 2013 #

In the last couple of posts I talked about how larger aggregates make enforcing invariants easier, but smaller aggregates reduce concurrency conflicts.  You need to use domain knowledge to choose aggregate boundaries that minimize the chances of invariants spanning aggregates, and minimize the chances that multiple users will be editing the same aggregate simultaneously.

 

In this post I want to cover how I enforce the invariants (hopefully few) that do need to span aggregate boundaries.  As I see it there are basically two choices:

  1. Multi-Aggregate Locking
  2. Minimize-And-React Approach

If you recall the problem with enforcing these invariants is that you need to acquire a lock on multiple aggregates in order to prevent a race condition.  Lets look at the example of preventing customer orders that exceed that customers credit limit.  In order to enforce that you’d have some code that checked the invariant when creating new orders.  It would essentially do something like:

if (SUM(All-Outstanding-Orders) + NewOrder.Total > Customer.Credit_Limit)

then Reject-Order

else Save-Order

 

In order to avoid race conditions we need to guarantee that none of the data in the if statement changes between the time that the condition is evaluated and the order is saved.  In this case that means none of the existing Outstanding orders can change, and the customers credit limit can’t change, also no new outstanding orders can be created.  If we assume that Customer and Order are separate aggregates, that means that we need to lock the Customer aggregate, and each aggregate corresponding to an Outstanding Order.  The tricky part can be that we also need to ensure no new Order Aggregates for that customer are created.

Most people I talk to about this are surprised at the complexity I seem to be talking about.  They think that they have been writing applications for many years and never had to worry about this consistency stuff.  Indeed, I used to think the same way.  However, it turns out most applications have subtle race conditions such as the above and nobody even realizes it or cares.  And this is perfectly acceptable!  Rare race conditions, may not be worth the development effort to eliminate and/or handle.  However, as an application architect at a minimum I like to know that these race conditions exist and can make a conscious decision whether I want to invest time dealing with them or not.  Even if the decision is that it’s not worth worrying about, I want to make sure that these are explicit decisions, rather than unexpected surprises when they may arise later.

 

So having said that, we know we have an invariant that crosses aggregate boundaries, so there is a race condition.  What are our options?  Of course we can just ignore it. A race condition will rarely occur, and the impact of it occuring (e.g. a customer getting an order approved that may exceed their credit limit) may be acceptable.  Especially in applications that measure the users in dozens, race conditions should be extremely rare.  However, if you measure your users in thousands (or more) the rare race condition, may actually quite frequently.

 

The first option from above is to implement some form of multi-aggregate locking.  For the Customer Credit Limit example, we would need a way to lock the customer aggregate (which contains the credit limit data), all active orders, and also lock all new orders for that customer.  There are a few options for doing this at the application level, and even at that database level (perhaps using transaction isolation levels).  However, I typically try and avoid the complexity involved in this approach.

 

The 2nd option is what I’m calling “Minimize-and-React”.  Rather than trying to prevent the race condition (via locking), try to minimize it (by doing the check at the last possible moment – as we probably are already doing), then put in place a mechanism to detect when the race condition has occurred and react appropriately.  In a lot of cases the “react” portion should probably just be sending an email to a human to investigate.  When using an architecture that uses “Domain Events” you can create what some people call a “Saga” (although not an entirely accurate term).  In this case you would create a Saga for each cross-aggregate invariant you wish to enforce, and have it subscribe to the appropriate events to detect when the invariant has been violated.  Then take the appropriate actions (e.g. send an email to notify somebody, or possibly execute compensating actions).

 

In the Customer Credit Limit example, I could create a Saga that subscribed to the events: CustomerCreditLimitChanged, OrderCreated (and probably other events such as CustomerOrderChanged, OrderCancelled, etc).  Basically, any events which could impact the evaluation of the invariant.  Since the Sage subscribes to Events, which by definition represent actions which have already occurred, it can detect violations of the race conditions without the race conditions present in the Domain Model (Aggregates).  So in the saga I would subscribe to the various events, and in each handler call the code to check whether the invariant has been violated.  Then take the appropriate action in response – typically either sending a notification to somebody, or taking some correcting/compensating action.


Monday, April 15, 2013 #

In the last post we looked at how aggregate boundaries affect our ability to provide consistency guarantees and enforce invariants across our domain model.  What we said is that enforcing an invariant within an aggregate boundary – rather than invariants that span aggregates – is much easier to do.  So based on that we would want to design our software with very large aggregates.  Taken to the extreme we could have the entire domain model within a single aggregate.  This would allow us to easily enforce any invariant without ever needing to worry about consistency across aggregate boundaries.

The downside to having excessively large aggregates is the impact it has on scalability.  I’m not taking about scalability in terms of adding more servers and hardware to increase throughput.  But rather scaling the amount of users using the system.  When you have large aggregates, that also means that when you “lock” your data to provide consistency guarantees you are locking large amounts of data at once.  In the extreme example of having the entire domain inside one aggregate, you will be locking the entire domain model.  If your system is only ever used by a single user at a time, then that is actually perfectly reasonable.  However, most systems we build are used by multiple users at the same time.  If we had one giant aggregate that means that anytime anybody changed any data, it would increment the version number and any other edits in progress will get a concurrency exception when they try to save (when the concurrency check looks at the version # and sees somebody else has changed it in the middle of that user editing it).

If we start to split up our domain model into smaller aggregates, it reduces the likelihood of concurrency exceptions happening.  If we have each Customer as an Aggregate (containing the Orders), then you will only get concurrency exceptions if two users are trying to edit Orders for the same customer.  If you make Customer and Order separate aggregates you only get concurrency exceptions if two users are trying to edit the same order at the same time.

So now we have two competing desires, larger aggregates give us flexibility for enforcing invariants, smaller aggregates give us less chance of concurrency exceptions.  We have to make a tradeoff between these two properties.  We have a little more flexibility than just size of the aggregate though, we can strategically choose how to place those boundaries.  You can have to 2 similarly sized aggregates, that encompass different sets of entities; one of those aggregate boundaries may be better than the other.

What I try to do is choose aggregate boundaries such that most of my system’s invariants will not have to span aggregates, but also try to choose them such that there is a small likelihood that multiple users will be simultaneously updating the same aggregate.  Ultimately this all comes down to examining your business domain, and expected usage patterns of your application in order to make the best decision here.  Let’s look at a couple of examples.

 

Customer / Orders Example

In the Customer/Order example, we’ve talked about 3 different possibilities for Aggregate boundaries:

  1. Single aggregate encompassing the entire domain model
  2. Customer aggregate that contains Order entities
  3. Separate aggregates for Customer and Order

Assuming we have a system that is used by many users simultaneously, we can probably rule out option #1 pretty easily.  In order to decide between #2 and #3 I’d have a discussion with the domain experts, and try to get a feel for the usage patterns for creating/maintaining the Orders data.  Do they have account managers that are responsible for specific customers?  If so it’s unlikely that multiple users will be editing Orders belonging to the same customer, so I would likely go with option #2 because of benefits of easier enforcing invariants across orders.  If it was more of a call-center type business where anybody can enter orders for any customers I might start considering option #3.  However, I might also start asking about their typical scenarios.  If we’re talking about the system that takes online delivery orders for Pizza Hut, it’s pretty unlikely that multiple orders for the same customer are going to be undergoing changes at the same time (by multiple users).  In fact, I’m having a hard time coming up with any example system that takes customer orders that would commonly have multiple users editing orders for the same customer at the same time.  That would lead me towards option #2 from above.  But the key point I’m trying to make is that the decision should be driven by business/domain knowledge, and take into account the consistency vs scalability tradeoffs.

 

Poker Example

Lets look at another example, my software to manage a weekly poker league.  In this case I could see a couple obvious choices:

  1. Single aggregate encompassing the entire domain model
  2. Each Game is an aggregate

If we remember the sample invariants from the last blog post, the examples I used were:

  1. For a completed poker game, total pay-in must equal total pay-out
  2. There can be only one poker game for each week

The first invariant can be enforced easily enough with either choice of aggregate boundaries (all the data involved is contained in a single Game), but the 2nd invariant would span aggregates if we had an aggregate for each game (we need to look at the set of all games in order to validate the invariant).  So there’s a clear consistency advantage for option #1.

If we look at it from a scalability perspective, lets consider whether we are likely to have multiple users editing data at the same time?  In this example scenario it’s actually pretty reasonable to have the entire domain model as a single aggregate.  The only significant data updates are somebody entering in the results of a new game (once a week), or maybe tweaking some past mistakes.  Regardless, it’s unlikely there will be multiple people editing data at the same time, so in this case I would introduce a new entity called League (we need something to act as the aggregate root), and have it contain a collection of all Games.

If we take this example a little further, lets imagine we want to offer our poker league manager as SaaS.  Now we have many leagues stored in our domain model.  In that case it doesn’t seem reasonable to have somebody editing one leagues data lock *all* leagues data (as it would if the entire domain model was still a single aggregate).  In that case it would seem to make sense to have each separate League be it’s own Aggregate.  This also appears to work well as it’s unlikely we would have any invariants that span Leagues.

 

 

In the next post we’ll take a look at what I do when I realize I need an invariant that spans Aggregate boundaries (hint: it’s much more painful than invariants within an Aggregate boundary).


Sunday, April 7, 2013 #

Those who know me know I’m a pretty big fan of the CQRS set of design patterns.  CQRS style architectures typically borrow / build-upon the DDD (Domain Driven Design) set of patterns (in fact before Greg Young coined the term CQRS he was calling it DDDD [Distributed DDD]).  One pattern that’s pretty central in DDD is the concept of Aggregates.  This is the practice of splitting your domain model up into pieces, and these pieces are what we call Aggregates.  Each aggregate may contain several “Entities”, but must contain a specific Entity that is designated as the Aggregate Root.  Examples of Aggregates could be Customer, Product, Order, etc.

 

A lot of people – even people that claim to be doing DDD – will just naturally make almost every entity into it’s own Aggregate.  They are missing an important design decision around scoping your Aggregate boundaries appropriately. As per Evans’ DDD book, Aggregate Roots are intended to define the consistency and transactional boundaries of your system.  This has some really significant implications that make it important to choose your Aggregate boundaries with care.  There’s a bunch of literature providing guidance around choosing your Aggregate boundaries, but in this blog post I want to talk a little bit about what I think about when I do this, and provide some examples.

 

Consistency

When designing software you need to understand what consistency guarantees you have (and probably more importantly the guarantees you don’t have).  I see too many intermediate/advanced software developers take on the task of designing/architecting important software, without properly understanding the consistency aspects of the system and the tradeoffs involved.

 

Consistency is being able to guarantee that a given set of data is all from a specific identical point in time (I’m sure there’s a better official definition, but that’s how I think about it).  This is important because most software has a set of invariants (fancy word for “rules”) that you want to enforce across the domain model.  A few examples of invariants might be (I’m in the middle of building some software to manage our weekly poker league, so I hope you like poker related examples):

  • Total value of unpaid orders for a customer must not exceed that customers’ credit limit
  • For a completed poker game, total pay-in must equal total pay-out
  • Username must be unique
  • There can be only one poker game results for each week (weekly league)

 

The are rules that our software system is expected to enforce.  If a customer tries to place an order that would exceed their credit limit the system should reject it.  Likewise, if somebody tries to enter a username that’s already take the system should reject that to ensure the invariant is kept intact.

   

What might not be immediately obvious is that you need to have some consistency guarantees in order to enforce every single one of those invariants.  “Locking” goes hand in hand with consistency, as that’s typically how you achieve consistency guarantees.  So for the first example (orders + credit limit)  in order to enforce that invariant you need to have a consistent data set representing all of that customers unpaid orders, *and* you need to be able to acquire some kind of lock, so you can ensure that nobody writes a new order in between the time you do the invariant check (sum(orders) + new_order_cost <= customer.credit_limit) and save the new order.  If you can’t lock that data, you end up with a race condition, that could result in the invariant being violated.

   

Most software I encounter uses optimistic locking to achieve this.  Usually this means adding a version # to your entities/aggregates, then checking that it hasn’t changed since you retrieved it when saving.  For example, if the user is editing the customer information, the software will keep track of the customer version # that was retrieved when the user started editing, then when they hit save the system will check that the version # in the database hasn’t changed before it writes the updates (if it has changed it will reject the update with some kind of concurrency exception).  You also need some way to “lock” the Customer aggregate/entity to prevent race conditions between the time we check the version # and actually writing the updates.  For a typical system that uses a Relational DB (e.g. SQL Server), you might be able to rely on DB features to enforce the locking and prevent race conditions.  If you’re doing something like Event Sourcing you will need to implement your own or use a 3rd Party Framework that does this for you.

   

If we come back to the original topic – aggregate boundaries – these come into play because it turns out it’s pretty straightforward to enforce invariants within an aggregate, but if you have an invariant that spans multiple aggregates, it becomes significantly harder.

 

 

Back to the Customer/Orders example.  If we assume that both Customer and Order are separate Aggregates, then they will each have their own version #’s.  In order to enforce the credit limit invariant we need to get all unpaid orders for that customer and sum up the order totals and compare with the customers credit limit.  To do this properly we need to make sure that the data doesn’t change out from under us while we’re checking the invariant, meaning we would in theory need to lock the customer, and every order that we’re are looking at.  But we would also need to ensure that no new orders for that customer are created also.  With the simple version # per aggregate implementation, that is simply not supported (at least not without a lot of added complexity).

   

What if we were to change our aggregate boundaries?  Lets say that Order isn’t a separate aggregate but we have a collection of Order entities contained within the Customer aggregate (Customer entity is the aggregate root).  Now enforcing the invariant is easy, because all the data necessary is contained within a single Aggregate.  We can easily lock the Customer aggregate (using the single version # we have) and enforce our invariant.

   

There’s certainly techniques for enforcing invariants that cross aggregate boundaries, but it definitely adds complexity (more on this later).

   

If we only consider consistency guarantees when designing our aggregate boundaries, then we would want to make our aggregates as large as possible.  The bigger the aggregate, the more power and flexibility we have to easily enforce invariants.  If we take it to the extreme, we could make our entire domain model a single aggregate, with one version # for the entire domain.  However, consistency isn’t the only consideration.  We need to make a tradeoff between Consistency and Availability/Scalability.

 

In the next post I’ll take a look at how Availability / Scalability comes into play when choosing Aggregate boundaries, and take a look at options for enforcing invariants that span aggregate boundaries.


Sunday, March 24, 2013 #

I’ve been working with a lot of clients over the past couple years helping them adopt TFS Lab Management.  One discussion that always comes up is how to architect the infrastructure required to run TFS Lab.  I’m going to try and put down in writing the advice I usually give so I have somewhere to point people to in the future.

There are 3 main components in TFS Lab:

  • Hyper-V Host(s) – A server to host the running Virtual Machines (and yes, it must be Hyper-V)
  • Library Server(s) – A place to store the VM Templates, and stored VM’s.  This is essentially just a network file share.
  • SCVMM Server – The centralized server that manages all this infrastructure.

The Hyper-V host must be a physical server (no you can’t create a VMWare VM and run Hyper-V inside of that – well, my co-worker actually had a client that got that to work, but the performance was horrendous so don’t do that).  This means that setting up TFS Lab for the first time will require you to purchase/acquire at least one physical server.

If your datacenter is run off VMWare like so many of my clients are – it’s usually not a big deal to purchase a server specifically for TFS Lab that sits off in a corner running Hyper-V.  In fact, if you run your datacenter on Hyper-V already, I’d still recommend isolating your TFS Lab away from your main virtualization infrastructure (i.e. setup a new SCVMM and hosts, don’t try to reuse your existing one).

 

Single-Server Deployment

Most of my clients start by adopting TFS Lab for one specific project, with the intention that if they like what they see they will scale up it’s use across the rest of their projects in the future.  What I usually recommend to get started, is purchase a single beefy server (more below on typical server specs and price), and run all components off that one server.

Single Server

For a single team project this works great.  One nice thing is that because everything is on a single server your network infrastructure won’t play an issue in performance.

Note: SCVMM requires a SQL Server to store it’s configuration data.  This is not pictured here, but I will almost always use the same SQL Server that TFS uses for it’s configuration/collection databases (i.e. not any of the servers pictured here).

Notice I host the SCVMM instance inside of a Virtual Machine (and SCVMM manages the host that it is actually running in – sounds kind of wacky but it works fine).  This is contrary to the Microsoft guidance.  Me and my co-workers have setup TFS Lab for many clients, and typically put SCVMM inside of a VM and have had no issues.  In fact there are some important benefits you gain by doing this.  Most importantly is it becomes easier to move the SCVMM server off to a different physical host down the road (as we will do in the below examples).  If you install the SCVMM server on the physical machine (like so many people tend to do), when it comes time to scale out your Lab Infrastructure it is much harder to re-locate that SCVMM server elsewhere.

 

 

Physical Hardware Advice

For teams starting with TFS Lab the typical server hardware I recommend is around a $15,000 server.  Nowadays, that should translate to a server with around 16 physical cores (32 logical with HyperThreading), 128 GB of RAM, 6x 2TB 7.2k RPM SATA drives (2TB RAID 1 Array for native OS + Library, 4TB RAID 10 Array for Host).  Obviously, this depends on the size of the team, and complexity/size of the environment required for your application.  But probably 90% of the teams I help setup Lab for the first time end up going with around a $15k server to start.  Note: There is Microsoft guidance somewhere that recommends not to use Hosts with more than 48 GB of RAM; that guidance is outdated and misleading IMO, and I suggest you disregard it.

See my post on Lab Capacity Planning for a more detailed approach to determining hardware requirements.

I usually create Lab VM’s with 1 CPU + 4 GB RAM, and typically budget about 100GB per VM for the VHD + Snapshots.  Leaving some resources for the host OS, the above machine specs would allow you to create ~30 VM’s.

Note: TFS Lab Performance is very dependent on the disk subsystem.  You want to maximize the number of spindles to increase parallelism, I usually advise going with many cheaper 7.2k RPM drives to maximize spindles and data density.  For ultimate performance SSD is an obvious choice, but TFS Lab still requires a significantly large amount of storage making SSD too expensive for most teams (hopefully that should change in the next couple of years).  There are 2 scenario’s where performance can be an issue, the large transfers from Library->Host that occur when you deploy a new Environment; and the operation of an existing environment.  I tend to focus on the latter, which means paying close attention to the disks used by the Host(s).  I spent a bunch of time recently working with a client to diagnose performance issues; to benchmark disk performance the SQLIO tool and this article are priceless.  Also be careful when it comes to SAN.  SAN storage tends to be much more expensive than local attached disk (especially in the capacities we’re usually dealing with for TFS Lab – watching the face of a SAN Engineer when requesting 10 TB of storage is a fun exercise regardless =)), and there are many more moving parts in a SAN which means more potential bottlenecks.

 

 

Scaling To a 2nd Team

Lets imagine that we’ve got a single-server TFS Lab setup and running, the team using it loves it and a 2nd team (separate TFS Team Project) wants to start using it. 

Sure, you could share the same single-server setup across multiple team projects.  But unless money/hardware is extremely tight I wouldn’t recommend doing that.  The problem is that both teams will be sharing the same finite set of hardware resources (CPU, Memory, Disk), and there usually isn’t much visibility across teams.  What can happen is Team A spins up a bunch of Lab Environments in the morning, then when Team B tries to spin some up in the afternoon they get errors about No Suitable Host Available, because the host has run out of available resources because of Team A.  You can combat this by either over-provisioning the hardware so that it’s unlikely it will get maxed out, or by ensuring that the teams sharing the same Host(s) communicate well to avoid stepping on each others toes.

Instead, what I prefer to do is have dedicated hardware for each Team Project.  Specifically dedicated Host(s) and Library for each Team Project.  The SCVMM instance will still be shared between all Team Projects. If we take the above Single-Server/Single-Team-Project architecture, and scale it out for a 2nd Team Project it might look like below:

Two Team Projects

In this scenario, what I’ve done is dedicate the original server (#1) to Team Project A, and bought 2 new servers: another $15k server for Team Project B, and a smaller/cheaper server (~$2500) dedicated to run SCVMM. 

I moved the SCVMM Virtual Machine off off Server #1 onto the new Server #3.  Because SCVMM was in a VM it makes this migration extremely simple.  SCVMM is a shared resource across all Team Projects so I don’t want it to reside on any hardware dedicated to a specific Team Project.  In this scenario the server that hosts SCVMM Virtual Machine doesn’t even need to be Hyper-V, this is the one case where I don’t mind hosting the VM in the organizations primary virtualization infrastructure (even if that’s VMWare).

Also of note, is that I configure multiple Libraries (one per Team Project), and for this scenario where each Team Project only has a single host server, I place the library on the same physical server as the host.  This has the benefit that the large Library->Host transfers and Host->Library transfers never need to hit the network.

 

 

Scaling a Single Large Team Project

The other important scenario is when you have a single Team Project that outgrows the single-server deployment.  They simply need more resources (# of VM’s, CPU, RAM, Disk) than the original single-server can provide.  In this case I aim for an architecture something like this:

One Big Team Project

As in the previous example I’ve moved the SCVMM VM off to it’s own host.  Since the Library is now shared between several hosts I also move that off to some centralized location.  In the image above I have it located on the physical server that hosts the SCVMM VM; however you could also place it on it’s own dedicated physical server (if I had multiple Team Projects I would definitely do this, as you’ll see below), or you could place it in a VM hosted on the same host (Server #3) or elsewhere.  You want to pay attention to the network routing between your Library and Hosts.  There will be large transfers happening (potentially hundreds of gigs at once), so you want the network between them to be fast and short.  Typically you want to ensure that the hosts and Library are connected to the same physical network switch (I’m starting to see more people putting a 10GigE switch in place just for TFS Lab even if the rest of their network is still slower).

 

 

Mature TFS Lab Infrastructure

The final example is combining these various scenarios together into an organization that has many Team Projects, some large enough to require multiple hosts, and some where a single-server will suffice:

Multiple Team Projects

 

 

Configuring TFS to Dedicate Hosts/Libraries to Team Projects

Something to note, is that the TFS Admin Console only allows you to assign Host Groups and Libraries to Team Project Collections, but not individual Team Projects.  It’s still possible, but you have to use the command-line rather than the TFS Admin GUI.  You have to first assign your Host Group(s) and Libraries to the Team Project Collection in the GUI (make sure to turn Auto-Provision off).  Then you have to run the following commands to assign the various host groups and libraries to specific Team Projects:

TFSLabConfig CreateTeamProjectHostGroup

TFSLabConfig CreateTeamProjectLibraryShare


Sunday, March 10, 2013 #

I’ve spent a bunch of time lately with clients helping them understand why their applications are so slow and how to improve performance.  This often comes down to their use (or misuse) of ORM frameworks such as nHibernate and/or Entity Framework.  I think this probably stems from the fact that ORM’s have gone mainstream somewhat recently, and most developer teams realize they should be using one, but they have never really learned the intricacies of how to use one properly.

 

The first thing I do is pull out SQL Profiler and run through some common scenarios in their application and just get a rough count of how many DB queries happen in each application scenario.  A lot of teams are surprised when they see hundreds or thousands of queries being executed as a result of a single button click in their application.

 

In my experience teams seem to be suffering from one of two problems, either loading too much data at once (eager loading), or loading too little (lazy loading).  The lazy loading problem is probably more common, but the eager loading scenario is easier to explain so I’ll start with that.

 

 

Eager Loading

I’ve run into a few code-bases where they have explicitly turned off lazy-loading in nHibernate (lazy loading is the default behavior).  Unless you explicitly partition your domain model (e.g. using Aggregate boundaries like DDD proposes), not using lazy loading can result in massive amounts of data being retrieved from the DB for seemingly simple scenarios.  If you think of your Domain Model as a giant object graph, where you have many types of objects, most with links to other objects.  When you ask nHibernate for any object, it will automatically retrieve the object you asked for from the DB, *plus* any linked objects, and any objects linked from those, and on and on, until it has populated an entire object graph into memory for you.  When you have any non-trivial domain model, this can be a huge amount of data.  Lets look at an example:

Object Graph

If we have lazy loading turned off, and do a simple operation like asking nHibernate to give us an Invoice with a specific Id.  What will happen is nHibernate will go and retrieve that row from the Invoices table, but it will also get the related InvoiceBatch object, and all of the InvoiceItems, and for each InvoiceItem it will retrieve the Shipment object, and the Product object, and for each Product it will get the Product Group, and so on. 

 

It can get really bad if you have circular references in your domain model – which is fairly common because it is so convenient for writing business logic (e.g. the Invoice object has a collection of InvoiceItems, and the InvoiceItem also contains an Invoice object). In our example, lets assume that InvoiceBatch contains a collection of child Invoice objects, and each Invoice contains an InvoiceBatch object.  When we ask nHibernate for a single Invoice, it will populate the Invoice Batch object, which will in turn populate the Invoices collection and all objects related to every Invoice in that collection.  Lets imagine another example, if we have an Employee object that has a property referencing the Manager (also an Employee object), and also has a collection of Employees representing the Subordinates.  When you retrieve any Employee it will also retrieve the Manager Employee object, then his Manager, and his Manager, and so on until you get up to the top (CEO), then it will get all of the CEO’s sub-ordinates, and all of their Sub-ordinates, and so on.  Ultimately, this means anytime you ask nHibernate to get a single Employee it is actually retrieving *all* employess, along with any other related objects.

 

 

Lazy Loading

The solution to this is to either partition your domain model in some way (e.g. Aggregates as per DDD), or use Lazy Loading (the default in nHibernate).  Lazy Loading works by only retrieving the Invoice object from the DB, then loading any sub-objects only if and when you attempt to access them (aka lazily).  This ensures, that only the minimal set of data that you need to do your work is retrieved from the Database.  nHibernate does it’s lazy loading in a way that is mostly transparent to the developer, when you ask nHibernate for an Invoice object, it is actually generating a dynamic proxy object, that looks like an Invoice (it inherits from Invoice), but has some hooks in there to allow nHibernate to intercept any property access so it can lazy load them as needed.

 

However, Lazy loading has it’s own problems, and these are probably more common due to the fact that lazy loading is on by default.  These problems are commonly called the Select N+1 Problem.  Lets say I had a screen with a grid displaying a list of Invoices, and one of the fields in that grid is Invoice.Customer.Offices[0].Address.City.  What will happen is nHibernate will execute a single query to retrieve all the Invoices I ask for, but then when I try to render it into the grid I’ll have to loop through each Invoice and access the Customer property (which will trigger nHibernate to fire off a SQL query), then access the Customer.Offices collection (another query), then the Office.Address (another query), and finally retrieve the City for display.  These queries will happen separately for every Invoice displayed in the grid.  So if I have 30 invoices displayed in the grid, I could potentially have 91 SQL queries executed.  And that is a relatively simple scenario, in a more complex (realistic) application this problem can become a serious performance concern.

 

What we need is a middle-ground between the first scenario (load everything), and the 2nd scenario (load minimal, and lazy load everything else).  Most modern ORM frameworks will have support for programmatic “eager loading”.  Usually you will have some kind of Repository layer/class in your application.  This is where you want to put this code.  You’ll still leave lazy loading turned on in all your nHibernate mappings, but then in your repository functions you can tell it specifically how much of the object graph should be loaded up-front (and the rest will still be lazy loaded if/when accessed).  With nHibernate this is done via the Fetch/FetchMany/ThenFetch/ThenFetchMany methods.

 

Lets take the previous example where we want to display Invoices in a grid and include a column that displays Invoice.Customer.Offices[0].Address.City.  What we’d like to have happen is for nHibernate to load up this data for all 30 invoices in as few queries as possible (ideally one).  Previously we would have retrieved the list of invoices by doing a simple session.Query (assuming our grid is displaying all Invoices).  Now we might have code in our Repository that looks something like this:

Original GetAllInvoices

I’m going to modify this method to let nHibernate know that I want all Invoices, but I also want it to go ahead and populate the Customer, Offices, and Address objects at the same time.

Eager GetAllInvoices

I’ll let you look up how the Fetch functions work yourself. This should result in a single SQL query that loads all the necessary data (instead of the previous 91 queries).  You may have many areas of your app that require a list of invoices, and some require more or less of the object graph to be loaded.  Often you will see several methods in the InvoiceRepository that return a list of all Invoices, but the different methods will eager load different subsets of the object graph for different uses.

 

When I’m trying to optimize the lazy/eager loading behavior of my code, I’ll find myself spending a bunch of time going through in the debugger with SQL Profiler open, and seeing what code is triggering lazy-loading queries, and starting to build up a list of all the pieces of the object graph I might want to eager load.

 

 

nHibernate Gotcha – Don’t Do Multiple FetchMany

There is a major gotcha to be aware of (at least with the nHibernate eager loading).  If you try to eager load more than one collection in a single query the results won’t be what you expect.  Put another way, never use more than one FetchMany/ThenFetchMany in a single query.  Lets look at an example, lets say we wanted to load all Invoices, and also eager load all Offices and Contacts for the related Customers.  We might try writing code like this:

Broken GetAllInvoices

If you look at the SQL being executed it’s actually doing something like this:

SELECT ...
FROM Invoices LEFT OUTER JOIN Customers ON Invoices.CustomerId = Customers.CustomerId
              LEFT OUTER JOIN Offices ON Offices.CustomerId = Customers.CustomerId
              LEFT OUTER JOIN Contacts ON Contacts.OfficeId = Offices.OfficeId

What happens is that as you add more relationships that are collections, the result set grows large since SQL is doing a cartesian product of all the collections (so if you have 10 invoices, with 10 customers, and each customer has 4 offices, and each office has 7 contacts, you get a resultset of 280 rows).  nHibernate doesn’t deal with this well (it won’t complain, it will just result in an incorrect object graph returned to you – which makes the problem even worse IMO).  I believe, in this example if you examined the resulting object graph, each Customer would show itself as having 28 offices (when in fact it should have 4 offices, with 7 contacts in each).

 

Luckily, there is a solution.  nHibernate essentially has it’s own in-memory cache scoped to the session.  When it goes to lazy load something, it will first look in this cache to see if the object has already been loaded, and if so it can skip querying the database.  (Note: I’m not sure if this is exactly how nHibernate works under the covers, but this is how I conceptualize it).  What we can do, is give nHibernate a few queries, and tell it to use them to pre-populate it’s internal cache.  nHibernate is even smart enough to execute all these queries within a single round-trip to the database.  Now whenever we find ourselves wanting to multiple FetchMany, we can just break that down into multiple queries that nHibernate will use to populate it’s cache.  Here’s the previous example re-written to actually work:

ToFuture GetAllInvoices

In this case I’m executing the first query which will retrieve all Office objects (related to Customers that are linked to our Invoices), and it will eager load each Office’s Contacts collection.  Then I do a separate query to retrieve all Invoices, their related Customer object, and each Customer’s Offices collection.  Both of these SQL queries will be executed as part of a single round-trip (assuming your DB supports that – SQL Server does).  The Contacts will be present in nHibernate cache, so no lazy-loading is required to access them.

 

If you have a significant portion of the object-graph that you want to eager load the “fetch code” can get a little complex.  The silver lining is, that even if you get it wrong you’re not going to break anything, it just means things will be inefficiently lazy loaded when they should have been eager loaded, but your application behavior should still be correct just slow (so long as you obey the single FetchMany per query rule).

 

To finish this post off, here’s what code might look like if you wanted to eager load the entire object graph from the first graphic (note: this code is not tested):

Complete GetAllInvoices


Sunday, February 24, 2013 #

Just sitting in the Seattle airport finally returning home from my first MVP Summit (well in truth I’m flying directly to my next client, no home till next weekend).

As I said this was my first time attending an MVP Summit, so I didn’t know exactly what to expect.  It turned out to be an incredible week, and gives me a new appreciation for the term “drinking from a firehose”.  I’m told that your experience can be very different depending on what Product Group you are associated with.  I’m lucky to be with the Visual Studio ALM group which I’m told is one of the most involved and open of them all.

The week was split up like so:

Mon-Wed – Scheduled sessions/presentations/discussion with the Product Group.

Thu – MVP-2-MVP Day

Fri – Office Hours with the ALM Product Group

And lots and lots of parties in between!

 

The Product Group sessions were 3 jam-packed days where first Brian Harry, then various feature teams got up in front of the room and filled us in on the vision of the product(s) going forward.  I think everything presented/discussed was all new non-public information about feature sets coming in vNext and even vNext+1.  Most of the features being discussed were so early that there is no working code to demo, the discussions revolved around powerpoint slides and storyboards (and sometimes we were discussing features so far in the future that storyboards don’t even exist yet).

These weren’t your typical conference sessions though.  There was lots of interaction, probably half the sessions took the format where the product group just put up a topic for discussion and let the audience drive the discussion around what we felt was needed to solve whatever problem was under discussion (or if the problem even existed in the first place).  We did a bunch of live polls, where various teams would give us a bunch of potential features and get us to rank which ones were most important to us.  And just in general, the audience was very actively involved throughout every day (I swear there were some sessions where the audience did more talking than the presenter(s)).

 

On Thursday we did the MVP2MVP day organized by Neno Loje.  This was a day with back to back 20 min sessions from 9am – 5pm (with no breaks!).  Whoever thought this up originally, kudos to them.  The great thing about this, is it gives the ALM MVP’s a chance to present various topics of interest to other ALM MVP’s.  Unlike conference sessions, where you can’t assume deep knowledge, these sessions can cut away all the fluff because everybody in the audience is already an expert, so you get to focus on just the interesting stuff.  A lot of these sessions were ALM/TFS related projects that various MVP’s are working on.  Some examples off the top of my head:

Friday was scheduled basically over the course of the week thanks to Chuck.  He rounded up the various product owners that we MVP’s said we’d like to sit down and chat with.  These were informal sessions, no powerpoints or demos.  Just frank discussion and the opportunity for us to ask questions or give feedback.

 

And of course there are massive amounts of partying networking that goes on:

Sunday was the get-together of all the Canadian MVP’s.  Us Canadians know how to “network”!

IMG_0284

Monday was MVP Welcome Party at the Hyatt, followed up by minor house party at Ted Newards house with 120 of his closest friends (thanks Ted!).  Ted piled up a mountain of tech books he no longer wanted at the door for anybody to take:

IMG_0310

 

Tuesday had no official events planned, so the Imaginet Crew decided to head out for a quiet night, with some Indoor Sky Diving:

 

Wednesday night was the official MVP Party, where MS rented out the entire Seahawks stadium for the night:

IMG_0338

 

And of course this included Karaoke by Darcy:

and by James Chambers:

IMG_0354

 

And to carry-on a tradition started last year, we ended the night with some Lobster:

IMG_0385

 

I decided to end my weekend with a trip to Crystal Mountain for some snowboarding, which turned out to be a great choice as there was a big dump of snow on Friday night:

IMG_0425


Sunday, February 10, 2013 #

There’s been some chatter lately about an old debate between Feature Branches vs Feature Toggles.  I used to be firmly in the Feature Branches camp, but about a year ago (at the ALM Summit) I became convinced that Feature Toggles are a better choice in a lot of cases.

Feature branches are fairly common.  It is the practice of creating separate branches for each major feature, or perhaps choosing a group of features for each “feature branch”.

Feature Branching

Teams usually adopt the practice of feature branching because of the increased flexibility it gives them for doing releases (often fixed date releases vs fixed scope releases).  By creating say 5 feature branches for the 5 major features the team is working on, if only 3 of them are ready for release when the target release date rolls around, then only those 3 feature branches get merged into MAIN, and the other 2 feature branches continue development and will be merged into MAIN when complete for inclusion in some future release.

 

 

A team I used to work with would not have fixed date releases, but would let the completed features sort of “pile up” until the powers that be decided there was enough completed features to justify a release (or until some really valuable feature was completed perhaps).  Because we were using feature branches, we had the flexibility to release at any point in time, since the half-finished features were isolated off in their own branches.

 

 

The problem with feature branches is there can be a lot of pain related to merges.  Ideally, the developers will frequently merge from MAIN->FeatureBranch (maybe they do this every day if there are any changes to merge in), this way whenever a developer/team completes their feature and merges it into MAIN, all other feature branches will get those changes ASAP.  However, the whole idea behind feature branches, is that the code for each feature is isolated in its own branch until it’s ready to be released, at which point it is merged into MAIN.  This results in one big changeset in MAIN that represents all the code for that feature (possibly weeks of work).  As soon as that feature branch is merged in, all other feature branches will have to merge that potentially massive changeset into their feature branches, resulting in a lot of potential merge pain.

 

 

Now, most of the people in the Pro Feature Branch camp will tell you that you deal with this by keeping your feature branches small, ideally a day or two.  But in reality that is going to be difficult for most teams.  While they might like to do that, they are a long way from being mature enough to do that.  And I don’t buy that it’s entirely a team maturity issue, some features are simply large, and the teams don’t want to have to break it down into tiny releasable features.  Take for example the recent Git changes the TFS team implemented.  They have been working on that feature for many months.  Even if they could break it down into many 1-2 day features, I doubt that would have been desirable for them.

 

 

Not only that, but there is a subtle backwards incentive system at play when it comes to feature branches.  If I make my feature branch really large and long-lived, it’s not *me* that feels the pain.  It’s everybody else when I finally merge it to MAIN.

 

 

So what ends up happening is that feature branches are a mechanism that teams use to explicitly defer integrating/merging their changes together (on purpose!).  This is completely opposite the practice of *continuous* integration that most of would probably say is desirable (even if some of us don’t really understand what it truly means).  Fowler wrote about this in detail some time ago.  By isolating various teams/developers changes off in feature branches, we are explicitly deferring integration.

 

 

So what’s the other option?  We said at the start that teams use feature branches to achieve flexibility around releasing.  If we just go back to having everybody working in one branch, sure we will have achieved continuous integration, but we’re back with the release problems that we tried to solve with Feature Branches.

 

 

This is where Feature Toggles comes into play.  What if we did all work in the same branch, but we used some mechanism to turn on/off features (say via flags in a config file).  Now when I first heard this suggested, I instinctively told the guy he was a goofball, it sounds like this will result in a giant spaghetti mess of code, with if statements surrounding all kinds of crap, leading to a maintenance nightmare.  However, this doesn’t need to be the case.  For one, the “toggle” is only in place while the feature is under development.  Once the feature is complete all toggle code and any config file entries are removed (in fact, when I break down the User Story into tasks, I’ll make sure the last task on the list is “Remove Toggle”).  It does require some conscious thinking about how to develop the feature in such a way that it can be hidden from the user and implemented in such a way to allow it to be turned on/off.  In my experience, this turns out to be easier than you may think.  The vast majority of features can be hidden behind a feature flag if you give it a little thought; and those few that can’t you can still always create a feature branch if you wish.

 

 

Some people worry about the implications of including code for unfinished features in released software (even if hidden behind a toggle). I don’t think this as a big of a deal as some people think. My release process will typically involve creating a Release Candidate branch, and performing pre-release testing on that build of the software to ensure it meets the release quality standards.  If there are any ill-effects resulting from including unfinished features behind a toggle in these Release Candidates they should hopefully be discovered during this testing phase.

 

 

This gives you the benefits of *true* continuous integration.  If somebody writes some code that breaks another developers.  The guy that wrote the breaking code will find out immediately, and he will be the one responsible for fixing it.  Unlike feature branches, where if I write some code that breaks another developers feature-under-development, I don’t know because the other dev’s code is off in its own branch.  I merge my code to MAIN, the other dev merges my code into his branch, and his branch becomes broken, only now it’s his job to fix it, not mine; even though I’m the one that wrote the problem code.

 

 

There are some other benefits that you get along with feature toggles.  For one, you have a super-easy rollback plan if deploying a new feature goes south, simply turn off the feature toggle (assuming you leave the toggle in place until after the first release).  You can do phased roll outs - how about turning on major new features just for a subset of your users, to get some feedback before turning it on for everybody.  The TFS team at Microsoft uses feature toggles extensively.  They turn on a lot features in advance (for the TF Service cloud offering) for MVP’s to try out and provide feedback on.

 

 

Conclusion

To summarize, teams usually adopt feature branches to provide release flexibility.  But they have the undesired effect of deferring integration, and causing lots of merge pain.  By using feature toggles you can have continuous integration, while still retaining the release flexibility of feature branches.  Some care must be taken to implement features in a way to support toggles, and you must be disciplined about removing toggles once features are complete.  You get the added benefits of being able to easily do phased roll-outs on a feature by feature basis too!


Sunday, January 27, 2013 #

For probably over a year now I’ve been hearing lots of hype around package managers and NuGet in particular.  I’ve never really “got it” – that is until last week.  So what, NuGet will download the nHibernate assemblies for me.  I can do that myself easily enough, why on earth do I need a specialized tool to do that for me?!  But it will download not only nHibernate, but all of nHibernate’s dependencies too!  Big deal, that’s never been an issue for me before, usually these 3rd party packages come with all the necessary dependencies (nHibernate includes ANTLR, Moq includes Castle, etc).  I challenged a couple people I respect to convince me that I need NuGet, and despite their best efforts I was never convinced.

Last week I was working with a client who has multiple teams, working on multiple components/products, in parallel, but they all get released at once as part of one big release.  And all these various components and teams, are writing code with dependencies on other teams.  And each component is evolving independently of the others, but eventually they all need to come up with a final version that works with all the other final versions that will make up the release.  I like to compare it to the TFS team at Microsoft.  When they were developing TFS 2012, they wanted it to work with SQL 2012, Visual Studio 2012, .Net 4.5, none of which actually existed at the time TFS 2012 dev was underway (they were all under development also).  Trying to develop against a dependency that is also a moving target presents a number of problems, and it turns out NuGet can be a tremendous help here.

The problem is there are many projects (lets call them packages) that have interdependencies between them, and are potentially developed by different teams and on different release cycles. The challenge is we want to ship an updated product that contains updated versions of many of the packages, and we need to be confident that they all work well together.

The common approach to handling this is to treat each package as separate projects, and any dependencies on other projects are treated as an external dependency (similar to 3rd party dependencies like Log4Net). This is usually handled by having a lib folder within your source tree, and checking in the binaries for the external dependencies. This allows each project team to make an explicit decision about which version of their external dependency they are going to develop against, and choose if and when to update that version the latest version to reduce disruption to their development cycle.

Another important practice, is if a team is developing a package which other teams depend upon, they often want to control which versions of their package are available for other teams to consume. They don’t want every check-in to produce a build that other teams can potentially consume, because often these builds will have half-finished features in there. Typically a team will want to set a higher quality bar for which builds they share with the rest of the world to depend upon. This is usually handled by using a DEV branch and a MAIN branch. The quality standard for code to get into the MAIN branch is higher (no half-finished features), and every MAIN build is available for other teams to consume. Typically a team will update MAIN at the end of each Sprint.

There is another more subtle problem that becomes more significant as the number of packages and dependency graph between them grows. Let’s look at an example:

Dependency Example 1

If we imagine that all 4 packages are 1.0 to start with. We belong to the dev team for A. We are working towards shipping 2.0 of our product, which will include updated versions of all 4 packages, but there are 4 teams each working on a different package.

Our source tree for A contains a lib folder with sub-folders for B and C. And since B and C each depend on D we can either have a separate sub-folder for D and reference it directly from A, or we can include copies of the D binaries in the lib sub-folders for both B and C. Let’s assume that we do the latter, and both the B and the C sub-folders contain the binaries for D.

Team D finishes its work and makes 2.0 available in its MAIN branch. Team B immediately updates their project to pull in the D-2.0 binaries and updates their code to work with it (B-2.0). However, Team C has not yet pulled in the updated D-2.0 binaries yet (and possibly won’t for a while still). As Team A developers we want to pull in the updated B package so we can recompile against it and ensure our A code still works properly. However, we also depend on C which is 1.0 and depends on D-1.0. So which version of the D binaries should we use? We can’t have both D-1.0 and D-2.0. If we have D-1.0 it will potentially break B, if we have D-2.0 it will potentially break C. What are we to do? It’s versioning hell!

The way teams deal with this is to use versioning policies (either explicitly or implicitly). Rather than saying B depends on D-2.0 and C depends on D-1.0, what you need are versioning policies that say B depends on D-2.0 *and up*, and C depends on D-1.0 and up. This way the above scenario can be made to work by using D-2.0. In order to implement this “versioning policy” there are a few options:

1. Update the B and C csproj files so they don’t demand a specific version and don’t reference a strong-name for the D assembly. Then include the D-2.0 binaries in your A source tree.

2. Leave the csproj files alone, and instead introduce a “binding redirect” that indicates that anything that references D-1.0 should instead use D-2.0. This can be configured in the app.config/web.config.

Using a package manager such as NuGet can automate a lot of this work for you. Instead of having to manually walk the dependency graph to figure out which version of D binaries you require (transitive dependencies), NuGet will do this for you and automatically download all the necessary binaries for both direct and transitive dependencies according to the various versioning policies specified.  Then you simply check them in to your lib folder (usually called packages when using NuGet instead of lib).  NuGet not only walks the dependency graph to figure out the appropriate versions of all binaries you require, but it will create the appropriate binding redirects for you in your config file also.

All you need to do is create some automated builds on the various MAIN branches that will publish each package to your own private NuGet server.  Now when a team wishes to update their dependencies they simply update the versioning policy to specify a newer version number, then let NuGet do the necessary work, and check-in any updated binaries.

Note: Everything above assumes that there are no cycles in your dependency graph. If there are you’re basically screwed.  Refactor to eliminate cycles immediately.


Sunday, January 13, 2013 #

One of the most awkward things to deal with in code is Temporal Coupling.  It leads to messy fragile code, that is difficult to maintain.  What do I mean by Temporal Coupling?  When you consume a class/component in your code and it requires you to do certain actions in a specific order.  Lets look at a couple of examples where I commonly see this:

  1. When using a File class, you must first call Open(), then perform one or more actions with the File class (i.e. reading/writing), then you must remember to call the close method.
  2. Custom data access classes that return DataReader objects.  Because a DataReader requires an active connection to the database, it becomes the caller’s responsibility to close that connection once it’s done with the DataReader.

When I’m faced with code that has temporal coupling – either code I wrote, or external code that I depend on -  I almost always use lambda’s to isolate and encapsulate this temporal coupling.  The easiest way to explain this pattern is to show a few examples.  So let’s go through some code examples of the 2 scenarios listed above.

First an example of what the code might look like with the temporal coupling in place:

image

For the purposes of this sample lets assume that FileSample is a class provided by an external library (maybe the .Net framework, or maybe a 3rd party library).  In this example we have a method that has the FileSample instance injected into it (presumably already setup to point at the correct file).  But the FileSample class has a temporal coupling problem forcing the consumer to ensure they call Open first, do their work, then call Close to ensure that the resources are properly cleaned up.

To improve this by using Lambda’s I want to isolate the temporal coupling into one place, so that all consumers of the FileSample class no longer need to worry about it.  If FileSample was code that I controller, I could do this directly inside the FileSample class.  In this example I’ll assume that FileSample comes from an external library, and I’ll create a FileWrapper class that encapsulates the temporal coupling issues.

image

image

Lets look at one more example:

image

Here we have a DataAccessLayer (that we have the code for), that returns a DataReader object.  The problem is that DataReaders require an open DB connection, so the caller of GetDataReader must always remember to call close when they are finished to close the DB connection.

In this case we control the code for DataAccessLayer, so instead of creating a wrapper we can add a new method directly to the class that doesn’t expose temporal coupling issues.

image

And finally the new method inside the DataAccessLayer class.

image

Next time you find yourself having to deal with classes/components that expose temporal coupling issues, consider using lambda’s to isolate and encapsulate that temporal coupling.


Sunday, December 30, 2012 #

One of my favorite and most underused features introduced in VS 2010 was Layer Diagrams.  It’s a really simple tool to learn and use, but amazingly powerful.

It’s a diagramming tool that allows you to draw a diagram consisting of boxes and arrows, where the boxes are meant to represent your layers/components, and the arrows represent dependencies.  If you’ve ever been asked to whiteboard out the architecture/layers of your application, you probably got up to the whiteboard and drew some diagram with maybe a dozen boxes and some arrows.  This is essentially what the layer diagram you create should look like.  The real power comes from what you do after this.  You can then drop actual code artifacts (assemblies, namespaces, classes, etc.) onto the boxes, and Visual Studio will automatically validate the diagram by inspecting your code and comparing the actual dependencies in the code to the ones in the layer diagram.  If you’re code takes a dependency that is not represented in the layer diagram (e.g. The UI talks directly to the database instead of going through the middle-tier) Visual Studio will throw a validation error.  You can easily include this in your TFS Build to fail the build when this happens.  Let’s take a look at an actual example from my on-going hobby project.

I’m using a CQRS & Event Sourcing style architecture for this system, if I were to sketch it on a whiteboard (or in this case paper) it might look something like this:

Sketch

To create a Visual Studio Layer Diagram you create a new Modeling Project, then add a Layer Diagram to it (Note: I believe this is a Visual Studio Ultimate only feature):

New Project

New Item

Then you simply pull up the toolbox and start drawing boxes and arrows (aka Layers and Dependencies)

Toolbox

 

This is the Layer Diagram I’ve been using that roughly matches the sketch at the start of this post:

Layer Diagram

You can choose any color you wish for the boxes, in this case I chose red to represent things that map to assemblies, Yellow represents Databases (this is just for illustration purposes only, there’s no code artifact you can drop in the yellow database boxes that you can validate dependencies against), and Blue represents the smaller pieces that may live inside of assemblies.

The last step is to associate code artifacts with the layers in the diagram.  In the example above if you notice the little numbers in the upper right hand corner of each layer, that indicates how many code artifacts have been associated with each layer (you can view the details in the Layer Explorer window).  To associate code artifacts with the layers you simply drag-drop them from either solution explorer or Architecture Explorer onto the Layers.

You can try to validate the diagram by right-clicking on the diagram canvas and choosing Validate Architecture.  If you get validation errors they might look something like this:

Validation Errors

In the above case I purposefully broke the diagram by deleting the dependency (arrow) between the Domain Layer and Events Shared Components.  The selected error is telling me that the Game.AddPlayer method depends on the BaseEvent.AggregateId property, but that dependency is not represented in my diagram.  The 2nd line of the errors points out specifically which layers are involved, the Domain Layer.Domain (which is the sub-layer – blue box – called Domain within the bigger red box Domain Layer) and the Event Shared Components.Events.

There is also a convenient feature where you can simply draw your layers, associate them with code artifacts, then right-click and select Generate Dependencies and it will inspect the code an automatically draw the arrows/dependencies onto the diagram for you so that it matches the actual dependency structure of the code.

I verified that this does in fact validate and then updated my TFS Build to automatically validate on every build by adding an MS Build argument in the Build Definition:

Build Definition

 

For the above Layer Diagram some of the “rules” that it enforces could be stated like so:

  • UI shouldn't know about Events
  • Query Side shouldn't know about Commands
  • Domain shouldn't know about persistence (i.e. Event Repository)
  • Event Repository shouldn't know about the Domain
  • Events and Commands should be standalone classes with the only dependency on the Common assembly
  • Common assembly should not depend on anything else

If any of these invariants are ever broken we’ll be made aware immediately by the broken build, and we can decide whether to update the Layer Diagram to represent the new architecture, or we can fix our changes to properly adhere to the chosen architectural rules.

 


Sunday, December 16, 2012 #

I’ve been working on a hobby project for the past little while, and I wanted to blog my way through it as I go.  My intent is to use a lot of new technologies, agile best practices, and some trendy architecture patterns, and use the project as an example of how these various technologies and practices can be used to good effect.

The project itself is creating an application to allow my group of friends to manager our weekly Poker League.  I’ve mentioned this before on my blog many years ago, it’s a project I’ve always wanted to do but never was able to find the time. My group of friends have held a weekly poker game for something like 8 years running now.  We play once a week and one of my buddies keeps track of all the stats and publishes them on a website: St James Poker League.  The goal is that this application would replace that website.  It would also solve a couple of real problems that we have today:

  1. During the game we record the results by somebody writing them down on whatever scrap of paper is lying around (this is often a paper towel, or a random used envelope).  Then we hope that it makes it to our stat keeper in one piece, and doesn’t get lost between the time of the game and whenever the results are entered into the website.
  2. I’m not sure exactly how our stat keeper updates the website, but I think it’s some combination of Excel and manually editing HTML files.  And he can be pretty tardy in updating the site, sometimes a couple of months go by before we get updated stats (although when he wins the stats seem to get updated pretty quick).

In theory this application should solve both these problems by allowing any of our regulars to enter the game results live as the game is happening, or just as it finishes (there’s always a computer laying around somewhere).  This eliminates the paper-towel results tracking, and gives us instant stat updates.

But lets get to the part that might have some interest for my readers.  I’m sure you probably don’t care all that much how I track my weekly poker stats.  I plan to implement this project using a lot of the agile practices, ALM tooling, and architecture patterns that I commonly recommend to my clients.  And I plan to blog about the interesting bits as I go along.  Specifically some of the things I plan to use and possibly blog about include:

  • Using the new Agile Backlog Management tooling in TFS 2012 to build up a project backlog, plan our sprints, and track progress
  • Using the Story Mapping technique to help organize, visualize, and plan the backlog (this an awesomely powerful technique that doesn’t get enough attention, and has poor tooling support today)
  • Writing Effective Automated Tests – specifically I plan to show some examples of code I “cheated” on and did Test-After and ended up with some very ugly tests.  Then I refactored the code to be more test-friendly.
  • Using CodedUI test framework in VS 2012
  • Setting up a TFS Build to help drive quality into your code
  • Using Visual Studio Layer Diagrams to manage dependencies
  • Choosing appropriate aggregate boundaries when using DDD in your domain
  • How to use CQRS pattern (and Event Sourcing) while still keeping things simple
  • Taking a typical Web Services app and deploying it to Azure (this is my first time doing this – exciting!)
  • Using TFS Lab Management to automate your build-deploy-test process
  • Using Microsoft Test Manager (MTM 2012) to manage your manual test process
  • Publishing an app to the Windows Store
  • Rewriting a WPF App into Metro

What I’m hoping will be the cool part of this series is that I’ll be able to discuss all of these topics in the context of an actual project and application under development (and I’ll make all the source available).

Stay Tuned!


Sunday, December 9, 2012 #

The Winnipeg .Net User Group hosted a VS 2012 Launch Event at the Imax in Winnipeg on Thursday, Dec 6.  Doing presentations on the giant Imax screen is always fun, and I did the first 2 sessions on:

  • End-To-End Application Lifecycle Management with TFS 2012
  • Improving Developer Productivity with Visual Studio 2012

Thanks to everybody that came out, and if anybody is interested my slide decks can be downloaded here:

Also the Virtual Machine that I used to do my demo’s can be downloaded from Brian Keller’s blog here:

 


Thursday, November 29, 2012 #

I’ve been an Agile Coach at a lot of different clients over the years, and I want to share an approach I use to help them adopt and mature over time.

It’s important to realize that “Agile” is not a black/white yes/no thing. Teams can be varying degrees of agile. I think of this as their agile maturity level. When I coach teams I want them to start out being a little agile, and get more agile as they mature. The approach I teach them is to use the definition of done as a technique to continuously improve their agile maturity over time.

We’re probably all familiar with the concept of “Done Done” that represents what *actually* being done a feature means. Not just when a developer says he’s done right after he writes that last line of code that makes the feature kind-of work. Done Done means the coding is done, it’s been tested, installers and deployment packages have been created, user manuals have been updated, architecture docs have been updated, etc. To enable teams to internalize the concept of “Done Done”, they usually get together and come up with their Definition of Done (DoD) that defines all the activities that need to be completed before a feature is considered Done Done.

The Done Done technique typically is applied only to features (aka User Stories). What I do is extend this to apply to several concepts such as User Stories, Sprints, Releases (and sometimes Check-Ins). During project kick-off I’ll usually sit down with the team and go through an exercise of creating DoD’s for each of these concepts (Stories/Sprints/Releases). We’ll usually start by just brainstorming a bunch of activities that could end up in these various DoD’s. Here’s some examples:

  • Code Reviews
  • StyleCop
  • FxCop
  • User Manuals Updated
  • Architecture Docs Updated
  • Tested by QA
  • Tested by UAT
  • Installers Created
  • Support Knowledge Base Updated
  • Deployment Instructions (for Ops) written
  • Automated Unit Tests Run
  • Automated Integration Tests Run

Then we start by arranging these activities into the place they occur today (e.g. Do you do UAT testing only once per release? every sprint? every feature?). If the team was previously Waterfall most of these activities probably end up in the Release DoD. An extremely mature agile team would probably have most of these activities in the DoD for the User Stories (because an extremely mature agile team will probably do continuous deployment and release every story). So what we need to do as a team, is work to move these activities from their current home (Release DoD) down into the Sprint DoD and eventually into the User Story DoD (and maybe into the lower-level Check-In DoD if we decide to use that).

We don’t have to move them all down to User Story immediately, but as a team we figure out what we think we’re capable of moving down to the Sprint cycle, and Story cycle immediately, and that becomes our starting DoD’s. Over time the team makes an effort to continue moving activities down from Release->Sprint->Story as they become more agile and more mature. I try to encourage them to envision a world in which they deploy to production as each User Story is completed. They would need to be updating User Manuals, creating installers, doing UAT testing (typical Release cycle activities) on every single User Story. They may never actually reach that point, but they should envision that, and strive to keep driving the activities down closer to the User Story cycle s they mature.

This is a great technique to give a team an easy-to-follow roadmap to mature their agile practices over time. Sure there’s other aspects to maturity outside of this, but it’s a great technique, that’s easy to visualize, to drive agility into the team. Just keep moving those activities (aka “gates”) down the board from Release->Sprint->Story.

I’ll try to give an example of what a recent client of mine had for their DoD’s (this is from memory, so probably not 100% accurate):

Release

  • Create/Update deployment Instructions For Ops
  • Instructional Videos Updated
  • Run manual regression test suite
  • UAT Testing
    • In this case that meant deploying to an environment shared across the enterprise that mirrored production and asking other business groups to test their own apps to ensure we didn’t break anything outside our system

Sprint

  • Deploy to UAT Environment
    • But not necessarily actually request UAT testing occur
  • User Guides updated
  • Sprint Features Video Created
    • In this case we decided to create a video each sprint showing off the progress (video version of Sprint Demo)

User Story

  • Manual Test scripts developed and run
  • Tested by BA
  • Deployed in shared QA environment
    • Using automated deployment process
  • Peer Code Review

Code Check-In

  • Compiled (warning-free)
  • Passes StyleCop
  • Passes FxCop
  • Create installer packages
  • Run Automated Tests
  • Run Automated Integration Tests

PS – One of my clients had a great question when we went through this activity. They said that if a Sprint is by definition done when the end-date rolls around (time-boxed), isn’t a DoD on a sprint meaningless – it’s done on the end-date regardless of whether those other activities are complete or not? My answer is that while that statement is true – the sprint is done regardless when the end date rolls around – if the DoD activities haven’t been completed I would consider the Sprint a failure (similar to not completing what was committed/planned – failure may be too strong a word but you get the idea). In the Retrospective that will become an agenda item to discuss and understand why we weren’t able to complete the activities we agreed would need to be completed each Sprint.


Tuesday, November 6, 2012 #

I’ve been meaning to blog about a great experience I had earlier in the year at Prairie Dev Con Calgary.  Myself and Steve Rogalsky did a session that we called “Agilist, Heal Thyself!”.  We used a format that was new to me, but that Steve had seen used at another conference.  What we did was start by asking the audience to give us a list of challenges they had had when adopting agile.  We wrote them all down, then had everybody vote on the most interesting ones.  Then we split into two groups, and each group was assigned one of the agile challenges.  We had 20 minutes to discuss the challenge, and suggest solutions or approaches to improve things.  At the end of the 20 minutes, each of the groups gave a brief summary of their discussion and learning's, then we mixed up the groups and repeated with another 2 challenges.

The 2 groups I was part of had some really interesting discussions, and suggestions:


Unfinished Stories at the end of Sprints

The first agile challenge we tackled, was something that every single Scrum team I have worked with has struggled with.  What happens when you get to the end of a Sprint, and there are some stories that are only partially completed.  The team in question was getting very de-moralized as they felt that every Sprint was a failure as they never had a set of fully completed stories. How do you avoid this? and/or what do you do when it happens?

There were 2 pieces of advice that were well received:

1. Try to bring stories to completion before starting new ones.  This is advice I give all my Scrum teams.  If you have a 3-week sprint, what happens all too often is you get to the end of week 2, and a lot of stories are almost done; but almost none are completely done.  This is a Bad Thing.  I encourage the teams I work with to only start a new story as a very last resort.  If you finish your task look at the stories in progress and see if there’s anything you can do to help before moving onto a new story.  In the daily standup, put a focus on seeing what stories got completed yesterday, if a few days go by with none getting completed, be sure this fact is visible to the team and do something about it. 

Something I’ve been doing recently is introducing WIP (Work In Progress) limits while using Scrum.  My current team has 2-week sprints, and we usually have about a dozen or stories in a sprint.  We instituted a WIP limit of 4 stories.  If 4 stories have been started but not finished then nobody is allowed to start new stories.  This made it obvious very quickly that our QA tasks were our bottleneck (we have 4 devs, but only 1.5 testers).  The WIP limit forced the developers to start to pickup QA tasks before moving onto the next dev tasks, and we ended our sprints with many more stories completely finished than we did before introducing WIP limits.


2. Rather than using time-boxed sprints, why not just do away with them altogether and go to a continuous flow type approach like KanBan.  Limit WIP to keep things under control, but don’t have a fixed time box at the end of which all tasks are supposed to be done.  This eliminates the problem almost entirely.  At some points in the project (releases) you need to be able to burn down all the half finished stories to get a stable release build, but this probably occurs less often than every sprint, and there are alternative approaches to achieve it using branching strategies rather than forcing your team to try to get to Zero WIP every 2-weeks (e.g. when you are ready for a release, create a new branch for any new stories, but finish all existing stories in the current branch and release it).


Trying to Introduce Agile into a team with previous Bad Agile Experiences

One of the agile adoption challenges somebody described, was he was in a leadership role on a team he had recently joined – lets call him Dave.  This team was currently very waterfall in their ALM process, but they were about to start on a new green-field project.  Dave wanted to use this new project as an opportunity to do things the “right way”, using an Agile methodology like Scrum, adopting TDD, automated builds, proper branching strategies, etc.  The problem he was facing is everybody else on the team had previously gone through an “Agile Adoption” that was a horrible failure.  Dave blamed this failure on the consultant brought in previously to lead this agile transition, but regardless of the reason, the team had very negative feelings towards agile, and was very resistant to trying it out again.  Dave possibly had the authority to try to force the team to adopt Agile practices, but we all know that doesn’t work very well.  What was Dave to do?

Ultimately, the best advice was to question *why* did Dave want to adopt all these various practices. Rather than trying to convince his team that these were the “right way” to run a dev project, and trying to do a Big Bang approach to introducing change.  He would be better served by identifying problems the team currently faces, have a discussion with the team to get everybody to agree that specific problems existed, then have an open discussion about ways to address those problems.  This way Dave could incrementally introduce agile practices, and he doesn’t even need to identify them as “agile” practices if he doesn’t want to.  For example, when we discussed with Dave, he said probably the teams biggest problem was long periods without feedback from users, then finding out too late that the software is not going to meet their needs.  Rather than Dave jumping right to introducing Scrum and all it entails, it would be easier to get buy-in from team if he framed it as a discussion of existing problems, and brainstorming possible solutions.  And possibly most importantly, don’t try to do massive changes all at once with a team that has not bought-into those changes.  Taking an incremental approach has a greater chance of success.

I see something similar in my day job all the time too.  Clients who for one reason or another claim to not be fans of agile (or not ready for agile yet).  But then they go on to ask me to help them get shorter feedback cycles, quicker delivery cycles, iterative development processes, etc.  It’s kind of funny at times, sometimes you just need to phrase the suggestions in terms they are using and avoid the word “agile”.


PS – I haven’t blogged all that much over the past couple of years, but in an attempt to motivate myself, a few of us have accepted a blogger challenge.  There’s 6 of us who have all put some money into a pool, and the agreement is that we each need to blog at least once every 2-weeks.  The first 2-week period that we miss we’re eliminated.  Last person standing gets the money.  So expect at least one blog post every couple of weeks for the near future (I hope!).  And check out the blogs of the other 5 people in this blogger challenge:

Steve Rogalsky: http://winnipegagilist.blogspot.ca

Aaron Kowall: http://www.geekswithblogs.net/caffeinatedgeek

Tyler Doerkson: http://blog.tylerdoerksen.com

David Alpert: http://www.spinthemoose.com

Dave White: http://www.agileramblings.com (note: site not available yet.  should be shortly or he owes me some money!)