I’ve heard (and used) YAGNI (You Ain’t Gonna Need It) quite often in my software development career. It’s a battle cry for shipping a minimum viable product and letting the real-world usage dictate what new features and improvements are really needed. Generally speaking I think that this ruthless minimalism is a good thing. I think we’ve all fallen into the “pie in the sky” thinking about adding lots of bells and whistles to whatever feature we’re working on. I for one also know the feeling of spending a lot of time on one aspect of a new feature only to later discover that no one really uses it. I like to think that, over time, I’ve begun to develop some sense of when a given feature is likely to be useful and when I should YAGNI it out of my task list, but then again I also feel like the more I know the less I know. Lately I’m finding that when I’m in doubt it’s best to err on the side of doing less and keeping things as simple as possible.
That said, there is a certain classification of feature that I sometimes regret omitting in the name of YAGNI. I recently read a fantastic post on Oren Eini’s blog titled, “On Professional Code”. I think this quote from that post sums it up quite nicely:
…a professional system is one that can be supported in production easily. About the most unprofessional thing you can say is: “I have no idea what is going on”.
This post really hit home for me because I’ve recently transitioned into a support role (or devops, if you like) at work. One of the things that I’m now responsible for is assisting our support team with troubleshooting difficult issues and figuring out what’s going on when the phone is ringing off the hook with end-users complaining that the “system is slow” and they can’t get their work done. While it’s a relatively rare occurrence, I absolutely hate having to say, “I don’t know what’s going on”. When there’s an odd error being thrown at a user who is trying to do something in our system it’s terribly frustrating to have to crack open the source code of the system to figure out why it’s being thrown. The error messages that are logged are often loaded with developer-speak or things like, “this shouldn’t happen”. I’m one of the more tenured developers on our team so I know that if I have difficulty understanding why a given error is being raised our support staff has almost zero chance of being able to figure it out on their own. When they can’t figure it out, they have to ask for help and when they have to ask for help me or one of my co-workers has to stop working on something else (usually some improvement to our infrastructure, an internal tool that will make our lives easier, or some performance profiling/tuning) to help them and make sure our customers can get their work done. I feel very comfortable in saying that these kinds of poorly documented and understood error conditions can be tremendously costly to any company. If you have to read the source code of your application to fix an issue that doesn’t require changing the source code (i.e. it’s a bug in the system), then you’re failing to write “professional code” as Oren defines it in that post.
As a developer I know that I’m as guilty as anyone of writing sub-par error handling code and leaving cryptic error messages in the log. I can’t speak for anyone but myself, but I think that this happens for one of two different reasons:
- Mistake: Sometimes I simply didn’t think that a particular piece of input would ever end up being passed into that routine that I wrote. I think I’m usually a pretty good practitioner of defensive programming, but sometimes I make a mistake.
- YAGNI: Other times I might have made a conscious decision to check for invalid input or assert that data was in the state the I expect it to be in, but I figure that the terse off-the-cuff error message will be adequate for troubleshooting this issue that “shouldn’t really ever happen anyway”. I have more important user-facing aspects of this feature to complete still, so I can surely apply YAGNI to this error message and move on.
As long as software is written by human beings we’ll always make mistakes. I’m not interested in exploring the mitigation of damage caused by human error in this post because that’s a topic that’s been explored in-depth by folks much smarter and more experienced than I am. Instead I’d like to focus on issue #2: that application of YAGNI to error-handling features. I’ve come to the realization lately that there is no such thing as ‘YAGNI’ when it comes to exposing information about what’s going on in a live production application. In fact, I’ve started using another five-letter acronym to describe the development of features like this: YCNHE or “You Can Never Have Enough”.
In my opinion, the cost-benefit ratio for adding useful information to error handling code is such that you can almost never have enough. Any time I add a message to some code that will end up being logged I ask myself the following question: “Who will read this message and what action will they need to take?”. Almost no one can see a message like, “Unexpected value for parameter InputType” and know what needs to happen to fix it. This is just an informative statement with no imperative command or pointer to additional information. Granted there’s probably a ton of contextual information that was captured along with the message like the timestamp from the server, a stack trace showing what routine or line of code was being executed, and the username of the person that was running the application when the error occurred. That type of contextual information can be very useful to a developer trying to reproduce or fix a defect but it’s nearly useless to a non-developer trying to resolve an issue.
So what can you do to help alleviate this issue? Here are a few thoughts I’ve been formulating and will be experimenting with in the coming months:
- Encourage all developers to ask themselves, “Who will read this message and what action will they need to take?” whenever they are logging a message. Asking yourself this question leads to better error messages. Some quick examples of easily added information that can be very useful to non-developer staff:
- Does this message represent something that needs to be addressed immediately or is it just some information being captured for future reference?
- What is the primary key / identity of the records from the database that were involved in the error?
- What were the specific values that were input that caused the overflow/out of range/divide by zero/whatever run time error.
- Consider using numeric error codes as a point of reference in all logged errors. Often a logged error message can’t (and shouldn’t) contain all of the information that might be needed in order to correct the issue. By using error codes you can easily setup external documentation about each unique error code that can serve as a place to capture additional troubleshooting steps, support staff notes, and all of the other various bits of information that can be useful when troubleshooting an issue in production. Using error codes can also help you establish relative severity in your errors. For example, you could say that codes in the 50000+ range represent potentially fatal errors that need to be sent out to e-mail/pager notifications immediately while stuff in the 10000 range might just need to be logged for future reference if needed.
- Create “dry run” and/or “debug” modes for complicated procedures/algorithms. If you can allow support staff to do a “dry run” of a complicated procedure that a customer is trying to do that will create a detailed debug-level log of everything that happened you can let them see how the problem gets broken down into steps by the code and on which step things fell apart. This is the type of detail that developer might need to use an interactive debugger for, but if a particular process in your system is complicated enough that you need to step through it a lot why not let all members of the team (developer or otherwise) get that same step-through experience?
- Make it easy for all members of the team to build and access support tools. The people that support your application in production always need new tools. They need a way to un-lock that bit of data for an end user who mistakenly locked it too soon. They need to be able to execute a query to see how many customers make use of a configurable feature. They need to be able to insert new rows into lookup tables that don’t get changed frequently enough to warrant the construction of an user-facing interface. Sometimes it’s faster and easier to simply do these one-off tasks manually on-the-spot than to build a tool but doing things manually simply doesn’t scale as your team and customer base grows. If you can lower the overhead involved with building these kinds of tools then you’re far more likely to actually build them. Command line applications are great for this because you don’t have to get bogged down in the creation of UIs for these tools that will only be used by internal staff.
- Have a resource/team dedicated to improving the situation. Our support/devops team is still new and getting ramped up, but I’ve already found it immensely helpful to to be able to improve our error messages and internal documentation on the spot whenever I need to answer a question for a member of our support team. The developers that build customer-facing features won’t always have the foresight to build features in a way that will make them easy to troubleshoot in production so having a team dedicated to that job will help ensure that it gets done.
I think it’s pretty commonly accepted in the software development community that “good” code is easily readable and maintainable by other developers, but I think that this notion needs to be expanded. Good code should also be easily supportable by people who aren’t developers. Next time you write a new feature take some time to think about how that feature might need to be supported in production. Are there things that would be relatively easy to add that might pay big dividends down the road in terms of improving the supportability of the system in production?