Been meaning to blog about this topic for a while and it's an area that I've been wondering if it will come up in the future as a governance challenge for organizations using the cloud. Let's consider the problem from the pre-cloud days in an enterprise scenario.
The Challenge in the past
Imagine that I am writing a WCF service which is running on premise to integrate with a line of business application. In the code I am using .net 4.0 developed with Visual Studio 2010 and I choose to reference a 3rd party SDK which we will say for argument sake is log4net. As a developer who wrote this WCF service in December 2011 and using version 1.2.10 of log4net which was the current version of the software at the time I completed my development and deployment at the end of December and everything was successful and since then the service has been happily deployed on a Windows 2008R server. The organization has had no functional requirement to make any changes to that component since its first release and other than the usual server patches which may be required and are fairly low risk we should be pretty comfortable without having to release a new version of that component (unless there is a functional change required) until we start to get close to the support end dates for the key dependencies the component has. These are listed below:
- Visual Studio Support ends 14th July 2015
- Windows Server 2008 R2 ends 13th January 2015
- .net 4.0 mainstream support is in line with the operating system its running on
Between this time and now there have been some updates to log4net with the following releases:
- V1.2.11 = February 16th 2012
- V1.2.12 = September 19th 2013
- V1.2.13 = November 23rd 2013
Although there have been these updates released to log4net from a risk management perspective the component is running on an unchanged platform so there is no reason currently for me to consider changing the component just yet and I can be pretty confident about it just working for a while yet.
What about the Cloud though?
Well this brings me on to the thing I have been wondering about. At the same time as the above WCF service I have just mentioned was developed, we were also developing some components which touch the cloud. I believe the things I am about to say are relevant to any component that interacts with or depends on stuff in the cloud but in this article I will talk through a specific example from a past project.
In the example here in addition to the WCF service we also have another service which uses the Windows Azure Service Bus. This component is an on premise component which is a listener to the Windows Azure Service Bus Relay which will receive messages and forward them to other on premise WCF services using the WCF Routing Service capability. This component was developed and released back at the end of December 2011 also and used v1.6.0 of the Windows Azure Service Bus SDK which was the current one at that time.
In this particular project it is also the case that there has been no business functional reason to change this component either and this component is also sitting on an on premise Windows 2008 R2 server listening for messages and gets its usual server patches applied following the enterprise standard.
What is the big difference between these projects though is that the on premise WCF service has had completely static dependencies since the end of December 2011 and we can be highly confident that it will just work whereas the Azure Service Bus listening component has dependencies on The Windows Azure Service Bus SDK and the Windows Azure Service Bus platform itself which since the end of December 2011 have changed quite a lot! At this point please consider that I am only using Service Bus as the example and I think this applies to all cloud dependencies which can change outside of your control.
If we take a look at the change list for the Windows Azure Service Bus SDK during this time we have a list like the below:
- v1.6.0 = Dec 06th 2011
- v1.7.0 = Jun 07th 2012
- v1.8.0 = Oct 26th 2012
- v2.0.0 = April 30th 2013
- v2.1.0 = May 22nd 2013
- v2.0.1 = April 30th 2013
- v2.1.1 = July 31st 2013
- v2.1.2 = July 31st 2013
- v2.1.3 = Sept 11th 2013
- v2.1.4 = Oct 19th 2013
- v2.2.0 = Oct 22nd 2013
- v2.2.1 = Oct 23rd 2013
- v188.8.131.52 = Nov 6th 2013
In the 2 years since we released this component there have been 13 releases of the Windows Azure Service Bus SDK and this SDK has a direct dependency on the Azure Service Bus platform which I have no control over the rate of change so from a risk management perspective this puts me in a difficult position where I need to think about how I might protect myself from any changes which may cause me problems. Fortunately for us and hats off to the service bus team our component is still running v1.6.0 in production at present and it still works absolutely fine. However I do know that if we were to upgrade to the newer versions we would need to make some configuration changes to the WCF configuration as some of the token provider configuration has changed a little bit. While that is specific to this example you get the point that change does happen and change means the introduction of some risk even if it's small.
If you consider other projects where your usage of the cloud is much higher perhaps with more components touching or hosted in the cloud then your risk profile is going to be higher and more spread out.
Ok so what does this mean?
In the real world today most companies are already are using the cloud or are seriously considering using it. In my opinion based on the conversations I have had with people in the industry I don't think this particular challenge is something that people are really thinking about yet. Typically enterprise level organizations are slow to move and only change what needs to be changed (often because the business want it changed, not because IT does) so that leaves an obvious conflict where you have one thing changing quite fast and something else that trails behind.
My gut feeling is that with cloud governance being immature at present things like this will result in cases where solutions could become broken because organizations aren't keeping their applications up to date with the platform, particularly when those applications are in maintenance mode.
This also leaves the challenge about wanting to deploy non-functional changes to components. In many organizations that I have seen the common response to a desire to add a non-functional change to a component that is not broken and has no business change required and has no key performance benefits is to request that a business case is put forward for the change. The organization often doesn't understand the dependencies between their systems and thinks if they change one small part they need to retest entire systems so a small change can become a huge retesting effort and large cost. That is often the reason why small technical changes aren't done in the enterprise as soon as they should be or are rolled up into a release driven by functional change requirements.
What can we do about this?
The first and most important thing is that if you are going to adopt the cloud then you need to accept this is a challenge you will face and you may need to up your game to be able to deal with it. I guess when it comes down to it, it's really a case of being able to manage the risk and have a good application life cycle management process to help you be capable of deploying new versions of components in a light weight and in expensive way. Some practical tips that I think would be good as a guide are discussed below.
In the architecture area it's important to have a good understanding of your solution dependencies and to actively monitor your cloud platform provider to understand what things in their pipelines will may have an effect on you. Arranging for R&D to be done with beta versions of SDK's and new features can help you to mitigate this risk, but also find new opportunities which you can benefit from.
Having a standard about how far behind the latest version of components you will allow your applications to be is a good thing and if you don't have one you should, but also you should consider that this may need to change for the cloud and its more regular release cadences. Often I hear architects define their standard as being N+1 or N+2 (whether they actually adhere to the standard or not) but In the above example after 2 years with log4net we were 3 revisions behind the latest build which is probably quite safe whereas with the Azure Service Bus SDK we were 1 major and 2 minor releases behind the latest. Now without knowing more detail on what's changed it's a bit more difficult to guess how risky that might actually be. I would assume that being 1 major release behind is a small amount of risk and a few minor releases if probably ok, but it's really down to how well you understand what the likelihood of breaking changes is from the vendor and what their major and minor changes actually mean. If you also consider this is just the SDK that has changes and for a platform as a service (PAAS) offering there may have been an unknown level of change that we may not be aware of which is happening behind the scenes on the PAAS platform. Certainly at a glance being 13 releases behind the latest seems more risky than 3. The key thing is you need to understand the technical detail of what's changing and that's the job of the technical architects combined with probably some of your support team.
In development this is where some organizations really need to up their game. If we develop solutions in a componentized and stable way where we can replace/update one component in a solution without making the whole solution become fragile then we could be capable of handling this quite well. From a development perspective one of the secrets to this is about how you do your testing. I'm a huge fan of behavior driven development (BDD) and test driven development (TDD) approaches and if you are developing a component that is unit tested well and has good BDD tests you should be able to make even quite large changes to a component and still be very confident that when you take it outside of the development environment as long as its deployed and configured correctly it will just work. Although this is the place where I like my teams to be, it still surprises me how many organizations I come across where a deployment to test is made and they have no idea if the solution is going to work or not. The key point here is good development testing is the biggest way to mitigate risk.
I also find that for integration components build using a BDD style often have a better understanding of the things that depend on this component. This happens when the tests are written from the perspective of the things that will use the component and validate the behavior they expect. This helps to let developers understand these dependencies as well as test them in the development arena through testing against stubs.
In addition to testing development teams need to ensure they are using continuous integration approaches. Building your codebase every day and executing some tests is a good way will help to ensure it's still working and supportable. The last thing you want to do is get the latest code having not used it for a while and find out it just doesn't build and when you fix that half of your tests don't work. In the example above we mitigated a lot of risk by having the continuous integration server execute some tests which flexed the component by making calls via Windows Azure Service Bus. If there was a change introduced which broke something that we hadn't seen then this would be one of the first places that would detect it.
In the testing area we need to be able to take a risk based approach to testing. In large enterprises it's often the case that a small change results in a test team believing they need to retest huge portions of the system. Taking the time to understand the change and the associated dependencies can help you to identify the areas which do need testing and those which may only have a small amount of testing or none at all. In a solid componentized solution with a change like this I would hope we only need to do basic regression testing to be confident about changes that are related to issues like I'm talking about in this article.
In addition to a risk based approach, automation of testing is also a good thing to have. If you can click a button and have a whole bank of automated tests executed flexing parts of your solutions then this will really help you to deal with these things. The overall key to testing for these types of change is that we want to be in a position where we are able to do a small amount of testing which is relatively in expensive and to get the change shipped.
In the release area having a release and deployment process which is simple yet effective and offers a strong roll back approach is the best way to help you be effective in this area. If you are able to reliably deploy your components in deploy without huge effort then you have mitigated a lot of risk and saved potentially a lot of cost which are two of the key things related to the deployment area which make enterprise organizations reluctant to do these type of changes.
One of the best examples of what good looks like in this area involved the use of the CA Lisa Release Automation product (formerly Nolio) which was implemented with an organization I have worked with where we ended up with a superb deployment story. In addition to the physical deployment process we also had a governance model to help us know what version is deployed where, when and by who and to be able to schedule deployments or roll backs. This was implemented by a good friend of mine Mo Uppal who I think is a great thought leader in this space (but unfortunately doesn't blog about his experiences) but has presented at the ALM summit in the past which went down very well and Sharma Kirloskar who was a key member of our automation team.
If we accept that we are more likely to need to patch components outside of functional releases, we now need to change the process for interacting with the business to be able to accept and process the fact that IT needs to do these changes. To some degree you may consider that this is a tradeoff for the benefits you get with using the cloud but we just need to help make the enterprise capable of making changes and doing them in a way that doesn't make the business get really worried because IT wants to change something.
If you take the Azure Service Bus SDK example from above, from a technical perspective I have a very high level of confidence that if I were to update the component with the latest version of the Service Bus SDK and follow our development process of building it locally and letting the build run tests then checking it in and letting the build server do the same to produce our build package. The output would be a release package which we could deploy with confidence in 5 minutes and if there was a problem you could roll it back in 5 minutes. With the nature of the change technically it if it works on the build server with our development tests it's very unlikely the system test team would find any issues with it but they could run their automated tests anyway to keep everyone happy. In theory if we could do this from start to finish in very little time (a few hours). The challenge though is that typically in enterprise IT senior managers and business stakeholders are used to large complex and fragile systems and general pain so have a very pessimistic attitude to risk even when the organization may have small pockets of development which really have a strong and reliable process. This makes getting the business and management to approve these trivial changes to be a non-trivial task.
So far I have talked about this challenge in terms of a component with a software dependency in the PAAS space but a similar challenge is probably true in the IAAS space too. With applications changing less frequently in the enterprise than the rate of change that cloud providers change at, IAAS is one of those areas which your organization might consider a safer area where the rate of change will be slower. This maybe the case but I would bet that there is still a need for the enterprise to be capable of speeding up. Let's take the average enterprise. I would bet most of them still have some Windows 2003 servers somewhere in their data center. In the real world enterprise data centers run servers, operating systems and applications that are sometimes outside of main stream support and sometimes even outside of extended support. The attitude is often "if it's not broke, why change it". In the IAAS space you need to think about the fact that you could be forced to changed it or at least discouraged from not changing it. Let's take the example where an average organization moves a large portion of its testing infrastructure to Windows Azure (or Amazon it doesn't really matter). So going up there today the gallery offers you the chance to build servers with Windows 2008 R2 SP1 onwards. So you might already have a problem that your servers are Windows 2008 and you need to upgrade. Well this difference between your production and test kit introduces some risk that you need to evaluate and manage. But say you're on Windows 2008 R2 SP1 and you setup your whole environment, what happens in 1 year time when Windows 2008 R2 SP1 ends mainstream support. Will you still be able to get this image from the gallery to create a new server to add to your test environment? Will there be a new service pack, if there is well that introduces the risk factor again.
Maybe the answer to that is to manage your own images and to upload them yourself, giving you more control of the virtual machine and what's on it. Well this might work but again it could introduce some risk. If it were on premise you would be in control of the version of VMWare or Hyper-V that your running and you would know that your guest operating system was compatible but in the cloud your provider will be continually patching and updating the underlying virtual machine host platform and maybe it will not be compatible or not supported with your guest.
The key thing is that infrastructure is in the same boat and while typically infrastructure departments are better (in my opinion) at understanding their system dependencies and also used managing the roll out of patches to servers the cloud brings this idea of keeping up to date to a whole new level.
In conclusion I hope this article doesn't come across as a doom and gloom story as its not intended to. What I want to articulate is that for the enterprise, the cloud brings in a new way of thinking. While the organization may have significant short term success with the cloud, you need to do some thinking about how you will manage and leverage this investment in the long term. This new way of thinking means that there are also new challenges you will need to handle otherwise you could feel some pain in the future.
If I could give one piece of advice on the most important thing organizations need to do so that they can embrace the cloud for the long term it is to accept that the cloud is constantly changing and you will need to invest in your people to give them ongoing support and training to help them keep up to speed with these changes and to get the benefits in the long term. Let's face it the enterprise is typically not great at investing in developer and IT pro training but it needs to change from a once per year one week course to an ongoing thing because the cloud is changing so regularly. You should probably also consider mentoring from experts with strong cloud experience.
As a dirty plug at the end of my article I would say that to address the training gap the best tip is to buy subscriptions to Pluralsight for your architecture, development, integration and support teams. This will give them access to training on many of the technologies associated with the cloud and let your staff train on a continuous basis and to keep pace with the rapidly changing cloud space rather than sending them away for a week's training at a significantly higher cost. Just to declare my bias that I have authored courses for Pluralsight but still use their training a lot.