Saturday, November 27, 2010
Managing Software Development Flow

This is the third post in a series about software development flow, which I'm describing as the conversion of customer requests (both for new features as well as bug reports) into working software. In the first post on this topic, we talked about how a software development organization can be viewed as a request processing engine, and how queuing theory (and Little's Law in particular) can be applied to optimize overall throughput and time-to-market (also called "cycle time"). In the second article, we revisited these same concepts with a more intuitive explanation and started to identify the management tradeoffs that come into play. This article will focus mostly on this final area: what metrics are important for management to understand, and what are some mechanisms/levers they can apply to try to optimize throughput?

I am being intentionally vague about particular processes here, since these principles can be applied regardless of particular process. I will also not talk about particular functional specialities like UX, Development, QA, or Ops; whether you have a staged waterfall process or a fully cross-functional agile team, the underlying theory still applies. Now, the adjustments we talk about here will have the greatest effect when applied to the greatest organizational scope (for example, including everything from customer request intake all the way to delivering working software), but Little's Law says they can also be applied to individual subsystems (for example, perhaps just Dev/QA/Ops taken together, or even just Ops) as well. Of course, the more you focus on individual parts of the system, the more likely you are to locally optimize, perhaps to the detriment of the system as a whole.

Managing Queuing Delay

As we've seen previously, at least some of the time a customer request is moving through our organization, it isn't being actively worked on; this time is known as queuing delay. There are different potential causes of queuing delay, including:

  • batching: if we group individual, independent features together into batches (like sprints or releases), then some of the time an individual feature will either be waiting for its turn to get worked on or it will be waiting for the other features in the batch to get finished
  • multitasking: if people can have more than one task assigned, they can still only work on one thing at a time, so their other tasks will be in a wait state
  • backlogs: these are explicit queues where features wait their turn for implementation
  • etc.

The simplest way to observe queuing delay is to measure it directly: what percentage of my in-flight items don't have someone actively working on them? If your process is visualized, perhaps with a kanban board, and you use avatars for your people to show what they are working on, then this is no harder than counting how many in-flight items don't have an avatar on them.

[ Side note: if you have also measured your overall delivery throughput X, Little's Law says:

NQ / N = XRQ / XR = RQ/R

In other words, your queuing delay RQ is the same percentage of your overall cycle time R as the number of queued items NQ is to the overall number of in-flight items N. So you can actually measure your queuing delay pretty easily this way.]

The primary mechanism, then, for reducing queuing delay is to reduce the number of in-flight items allowed in the system. One simple mechanism for managing this is to adopt a "one-in-one-out" mechanism that admits new feature requests only when a previous feature has been delivered; this puts a cap on the number of in-flight items N. We can then periodically (perhaps once a week, or once an iteration) reduce N by taking a slot out: in essence, when we finish one feature request, we don't admit a new one, thus reducing the number of overall requests in-flight.

Undoubtedly there will come a time when a high priority request shows up, and there would be too much opportunity cost to waiting for something else to finish up so it can be inserted. One possibility here is to flag this request as an emergency, perhaps by attaching a red flag to it on a kanban board to note its priority, and temporarily allow that new request in, with the notion that we will not admit a new feature request once the emergency feature finishes.

Managing Failure Demand

Recall that failure demand consists of requests we have to deal with because we didn't deliver something quite right the first time--think production outages or user bug reports. Failure demand can be quite expensive: according to one estimate, fixing a bug in production can be more than 15 times as expensive than correcting it during development. In other words, having work show up as failure demand is probably the most expensive possible way to get that work done. Cost aside, however, any amount of failure demand that shows up detracts from our ability to service value demand--the new features our customers want.

From a monitoring and metrics perspective, we simply compute the percentage of all in-flight requests that are failure demand. Now the question is how to manage that percentage downward so that we aren't paying as much of a failure demand tax on new development.

To get rid of existing failure demand, we need to address the root causes for these issues. Ideally, we would want to do a root cause analysis and fix for every incident (this is the long-term cheapest way to deal with the problems), but for organizations already experiencing high failure demand, this might temporarily drag new feature development to a halt. An alternative is to have a single "root cause fix" token that circulates through the organization: if it is not being used, then the next incident to arrive gets the token assigned. We do a root cause analysis and fix for that issue, then when we've finished that, the token frees up and we look for the next issue to fix. This approach caps the labor investment in root cause analysis fixing, and will, probabilistically, end up fixing the most common issues first. Over time, this will gradually wear away at the causes of existing failure demand. It's worth noting that you may not have to go to the uber root cause to have a positive effect--just fixing the issue in a way that makes it less likely to occur again will ultimately reduce failure demand.

However, we haven't addressed the upstream sources of failure demand yet; if we chip away at existing failure demand but continue to pile more on via new feature development, we'll ultimately lose ground. The primary cause of new failure demand is trying to hit an aggressive deadline with a fixed scope--something has to give here, and what usually gives is quality. There may well be reasons that this is the right tradeoff to make; perhaps there are marketing campaigns scheduled to start or contractual obligations that must be met (we'll save a discussion for how those dates got planned for another time). At any rate, management needs to understand the tradeoffs that are being made, and needs to be given the readouts to responsibly govern the process. "Percent failure demand" turns out to be a pretty simple and informative metric.

Managing Cycle Time

Draining queuing delay and tackling failure demand are pretty much no-brainers: they are easy to track, and there are easy-to-understand ways to reduce both. However, once we've gotten all the gains we can out of those two prongs of attack, all that is left is trying to further reduce cycle time (and hence raise throughput) via process change. This is much harder--there are no silver bullets here. Although there are any number of folks who will claim to "know" the process changes that are needed here, ranging from Agile consultants, to other managers, to the folks working on the software itself, the reality is that these ideas aren't really guaranteed solutions. They are, however, a really good source of process experiments to run.

Measuring cycle time is important, because thanks to queuing theory and Little's Law, it directly corresponds to throughput in a system with a fixed set of work in-flight. Furthermore, it is very easy to measure average cycle time; the data can be collected by hand and run through a spreadsheet with little real effort. This makes it an ideal metric for evaluating a process change experiment:

  • if cycle time decreases, keep the process change as an improvement
  • if cycle time increases, revert back to the old process and try something different
  • if cycle time is not affected, you might as well keep the change but still look for improvement somewhere else

Keeping "no-effect" process changes in place sets the stage for a culture of continual process improvement; it encourages experimentation if nothing else (and the cycle time measurements have indicated it hasn't hurt). Now, regardless of the experiment, it's important to set a timebox around the experiment so that we can evaluate it: "let's try it this way for a month and see what happens". New processes take time to sink in, so it's important not to run experiments that are too short--we want to give the new process a chance to shake out and see what it can really do. It's also worth noting here that managers should expect some of the experiments to "fail" with increased cycle time or to have no appreciable effect. This is unfortunately the nature of the scientific method--if we could be prescient we'd just jump straight to the optimized process--but this is a tried and true method for learning.

Now, process change requires effort to roll out, so a good question to ask here is how to find the time/people to carry this out. There's a related performance tuning concept here known as the Theory of Constraints, which I'll just paraphrase as "there's always a bottleneck somewhere." If we keep reducing work in-flight, and we have the end-to-end process visualized somewhere, we should be able to see where the bottleneck in the process is. The Theory of Constraints also says that you don't need to take on any more work than the bottleneck can process, which means, depending on your process and organizational structure, that we may find that we can apply folks both "upstream" and "downstream" of the bottleneck to a process change experiment without actually decreasing overall throughput. Furthermore, by identifying the bottleneck, we have a good starting point for selecting an experiment to run: let's try something that will alleviate the bottleneck (or, as the Theory of Constraints says, just move it elsewhere).

Conclusion

In this article, we've seen that managers really only need a few easy-to-collect metrics on an end-to-end software delivery flow to enable them to optimize throughput:

  • total number of items in-flight
  • number of "idle" in-flight items (not actively being worked)
  • number of in-flight items that are failure demand
  • end-to-end average cycle time

We've also identified several mechanisms, ranging from reducing work-in-progress to root cause fixes of failure demand, that can enable managers to perform optimizations on their process at a pace that suits the business. This is the classic empirical process control ("inspect and adapt") model that has been demonstrated to work effectively time and again in many settings, from the shop floor of Toyota factories to the team rooms of agile development organizations.

Thursday, November 25, 2010
Intuitions about Software Development Flow

In a previous post, I described the underlying theory behind optimizing the throughput of a software development organization, which consists of a three-pronged attack:

  1. remove queuing delay by limiting the number of features in-flight
  2. remove failure demand by building in quality up front and fixing root causes of problems
  3. reduce average cycle time by experimenting with process improvements

In this article, I'd like to provide an alternative visualization to help motivate these changes. Let's start with some idealized flow, where we have sufficient throughput to deal with all of our incoming customer requests. Or, if we prefer, the rate at which our business stakeholders inject requests for new features is matched to the rate at which we can deliver them.

Queuing Delay

Now let's add some queuing delay, in the form of some extra water sitting in the sink:

If we leave the faucet of customer requests running at the same rate that the development organization can "drain" them out into working software, we can understand that the level of water in the sink will stay constant. Compared to our original diagram, features are still getting shipped at the same rate they were before; the only difference is that now for any particular feature, it takes longer to get out the other side, because it has to spend some time sitting around in the pool of queuing delay.

Getting rid of queuing delay is as simple as turning the faucet down slightly so that the pool can start draining; once we've drained all the queuing delay out, we can turn the faucet back up again, with no net change other than improved time-to-market (cycle time). There's a management investment tradeoff here; the more we turn the faucet down, the faster the pool drains and the sooner we can turn the faucet back up to full speed at a faster cycle time. On the other hand, that requires (temporarily) slowing down feature development to let currently in-flight items "drain" a bit. Fortunately, this is something that can be done completely flexibly as business situations dictate--simply turn the knob on the faucet as desired, and adjust it as many times as needed.

Failure Demand

We can model failure demand as a tube that siphons some of the organization's throughput off and runs it back into the sink in the form of bug reports and production incidents:

Our sink intuition tells us that we'll have to turn the faucet down--even if only slightly--if we don't want queuing delay to start backing up in the system (otherwise we're adding new requests plus the bug fixing to the sink at a rate faster than the drain will accommodate). Now, every time we ship new features that have bugs or aren't robust to failure conditions (particularly common when rushing to hit a deadline), it's like making the failure demand siphon wider; ultimately we're stealing from our future throughput. When we fix the root cause of an issue, it's like making the failure demand siphon narrower, and we not only get happier customers, but we reclaim some of our overall throughput.

Again, there are management tradeoffs to be made here: fixing the root cause of an issue may take longer than just triaging it, but it is ultimately an investment in higher throughput. Similarly, rushing not-quite-solid software out the door is ultimately borrowing against future throughput. However, it's not hard to see that if we never invest in paying down the failure demand, eventually it will consume all of our throughput and severely reduce our ability to ship new features. This is why it is important for management to have a clear view of failure demand in comparison to overall throughput so that these tradeoffs can be managed responsibly.

Process Change

The final thing we can do is to improve our process, which is roughly like taking all the metal of the drain pipe (corresponding loosely to the people in our organization) and reconfiguring it into a shorter, fatter pipe:

This shows the intuition that if we focus on cycle time (length of the pipe) for our process change experiments, it will essentially free up people (metal) to work on more things (pipe width) at a time, thus improving throughput. There is likewise a management tradeoff to make here: process change takes time and investment, and we'll need to back off feature development for a while to enable that. On the other hand, there's simply no way to improve throughput without changing your process somehow; underinvestment here compared to our competitors means eventually we'll get left in the dust, just as surely as failing to invest cash financially will eventually lead to an erosion of purchasing power due to inflation.

Summary

Hopefully, we've given some intuitive descriptions of the ways to improve time-to-market and throughput for a software development organization to complement the theory presented in the first post on this topic. We've also touched on some of the management tradeoffs these changes entail and some of the information management will need to guide things responsibly.


Credits: Sink diagrams are available under a Creative Commons Attribution-ShareAlike 2.0 Generic license and were created using photos by tudor and doortoriver.


Friday, November 19, 2010
How to Go Faster

Ok, I'm going to tell you how to make your software development organization go faster. I'm going to tell you how to get more done without adding people while improving your time to market and increasing your quality. And I'm going to back it all up with queuing theory. [ By actually explaining the relevant concepts of queuing theory, not just by ending sentences with "...which is obvious from queuing theory", which is usually a good bluff in a technical argument being had over beers. Generally a slam dunk in mixed technical/non-technical company. But I digress. ]

An Important Perspective

It's worth saying that this article assumes you've figured out how to deliver software incrementally somehow, even if that's just by doing Scrumfall. The point is that you are familiar with breaking your overall feature set down into discretely deliverable minimum marketable features (MMFs), user stories, epics, tasks, and the like. If you have any customers, you are probably also familiar with production incidents and bugs, which are also discrete chunks of work to do. Now, here's the important perspective:

Your software development organization is a request processing system.

In this case, the requests come from customers or their proxies (product managers, etc.), and the organization processes the request by delivering the requested change as working software. This could end with a deployment to a live website, publishing an update to an app store, or just plain cutting a release and posting it somewhere for your customers to download and use. At any rate, the requests come into your organization, the software gets delivered, and then the request is essentially forgotten (closed out). Now, looking at your organization this way is important, because it means you can understand your capacity for delivery in terms borrowed from tuning other request processing systems (like websites, for example) for performance and scale. Most importantly, though, is that this mysterious branch of mathematics called queuing theory applies to your organization (just as it applies to any request processing system).

A Little Light Queuing Theory

One of the basic principles in queuing theory is Little's Law, which says:

N=XR

where N is the average number of requests currently being processed by the system, X is the transaction rate (requests processed per unit time), and R is the average response time (how long it takes to process one request). In a software development setting, R is sometimes called cycle time.

To put this in more familiar terms, suppose we have a walk-in bank with a number of tellers on staff. If customers arrive at an average rate of one person per minute (X) and it takes a teller an average of 2 minutes to serve a customer (R) then Little's Law says, on average, that we'll have XR = 1(2) = 2 tellers busy (N) on average at any given point in time. We can similarly flip this around: if we have 3 tellers on staff, what's the maximum average customer arrival rate we can handle?

X = N/R = 3/2 = 1.5 customers per minute

Ok, the last thing we need to talk about is: what happens if we suddenly get a rush of customers coming in? Anyone who has entered a Starbucks or visited Disneyland knows the answer to this: a line forms. (The time a customer spends waiting in line is known as "queuing delay" if you want to get theoretical about it.) Let's go back to our bank. Suppose we just have 5 people suddenly walk in all at once, in addition to our regular arrival of one person per minute. What happens? Well, we get a line that is 5 people long. But if we only have 2 tellers on staff, then people come off the line at exactly the same rate that new people are entering from the back, which means: the line never goes away and always stays 5 people long.

What does this look like from the customers' point of view? Well, we know they'll spend 2 minutes with the teller once they get up to the front of the line, and we know that it will take 5 minutes to get to the front of the line, so my average response time is:

R = RV + RQ = 2 + 5 = 7

where RV is the "value added time" where the request (customer) is actually getting worked on/for, and RQ is the amount of time spent waiting in line (queuing delay). Now we can see that on average, we'll have:

N = XR = X(RV + RQ) = 1(2 + 5) = 7

people in the bank on average. Two people at the tellers, and five people waiting in line. We all know how frustrating an experience that is from the customer's point of view. Now, let me summarize this section (if you didn't follow all the math, don't worry, the important thing is that you understand these implications):

  1. If you try to put more requests into a system than it can handle, lines start forming somewhere in the system.
  2. If the request rate never falls below the system's max capacity, the lines never go away.
  3. Time spent waiting in a line doesn't really serve much useful purpose from the customer's point of view.

Software development as customer request processing

If your experience is anything like mine, there is an infinite supply of things the business stakeholders would like the software to do, which means the transaction rate X can be as high as we actually have capacity for. This means one of the primary goals of the organization is figuring out how to get X as high as possible so we can ship more stuff. At the same time, we're also concerned with getting R as low as possible, since this represents our time-to-market and can be a major competitive advantage. If we can ship a feature in a week but it takes our competitors a month to get features through their system, who's more reactive? Every time the competition throws up a compelling feature, we can match them in a week. Every time we ship a compelling feature, it takes them a month to catch up. Who's going to win that battle?

Now, one of the tricky things here is that software development is often far more complicated than our example bank with tellers, since we tend to staff folks with different skillsets. If I have a team of one graphic designer, three developers, a tester, and a sysadmin, it's really hard to predict how long it will take that team to ship a feature, because they will have to collaborate. If I want to hire someone to help them, is it better to hire another tester or another designer? Probably I can't tell a priori, because it depends on the nature of the features being worked on, and it's really hard to measure things like "this user story was 10% design, 25% development, 50% testing, and 15% operations." Nonetheless, we can look at this from another point of view, which is that I have a fixed number of people in the organization, and each person can only be working on one thing at a time (just as a teller can only actively serve one person at a time), and they are probably (hopefully) collaborating on them.

This means the maximum number of things you can realistically be actively working on is less than the number of people in the organization.

If we have more things in flight than that, we know at least some of the time those things are going to be sitting around waiting for someone to work on them (queuing delay). Perhaps they are sitting on a product backlog. Perhaps they are simply marked "Not Started" on a sprint taskboard. Perhaps they are marked "Done" on a sprint taskboard but they have to wait for a release to be rolled at the end of the sprint to move onwards towards production or QA. As we saw above, this queuing delay doesn't increase throughput, it just hurts our time-to-market. Why would we want that?

First optimization: get rid of queuing delay

Ok, as we saw above, we know that the total response time R consists of two parts; actual value-adding work (RV) and queuing delay (RQ). Typically, it's really hard and time consuming to try to measure these two pieces separately without having lots of annoying people running around with stopwatches and taking furious notes. Fortunately, we don't have to resort to that. It is really easy to measure R overall for a feature/story: mark down when the request came in (e.g. got added to a backlog) and then mark down when it shipped. Simple.

Now, let's think back to our bank example where we had a line of people. Most software development organizations have too much in flight, and they have lines all over the place inside, many of which aren't even readily apparent because that's just "the way we do things around here." Lines are bad. Now, we know the only way to drain these queues is if the incoming feature request rate is less than the rate at which we ship them. Sometimes we can try hiring more "tellers", but in a recession that's not always an option. Instead, for many organizations, the best option is admission control, which is to say that we don't take on a new request until we've shipped one out the other side. You can think of this as having a certain number of feature delivery "slots" available, and you can't start something new until you've freed up a slot. This at least prevents you from having your lines get any bigger.

In order to drain the lines out of the system, the easiest thing to do is to periodically retire a slot after it ships. In other words, don't let something new in just that once. This will reduce the overall number of things in flight, and since presumably everyone is still working hard, what we've just gotten rid of must be queuing delay. Magic! So we can just keep doing this and draining queuing delay out of the system, improving our time to market all the time, without necessarily having to change anything else about the way we do things. When do we stop? We stop once we have people standing around not doing anything. At that point, all the queuing delay is out of the system (for now), and we know that we're at a level where all of our "tellers" are busy. To summarize:

  1. We can remove queuing delay from our delivery process simply by limiting and reducing the amount of work in-flight; this improves time-to-market without having to change anything else.
  2. We can keep doing this until people run out of things to work on; at that point we've squeezed all the queuing delay out.

Second optimization: reduce failure demand

The next thing to realize is that the N things we have in flight actually come in two flavors: value demand and failure demand. In our case, value demand consists of requests that create value for the customer: i.e. new and enhanced features. Failure demand, on the other hand, consists of requests that come from not doing something right previously. These are primarily things like website outages (production incidents), bug reports from users, or even support calls from users asking if you've fixed the problem they previously reported. If you have someone collecting these, then these are requests that your organization as a whole has to deal with. On the other hand, for each request of failure demand, someone is busy triaging/fixing it when then could be creating new value. In other words:

N = NV + NF

where NV is value demand and NF is failure demand. Or, if we look at things this way:

X = N/R = (NV + NF)/R = NV/R + NF/R

we can see that the failure demand is stealing a portion (NF/R) of our organization's throughput! This is, incidentally, why spending extra energy on quality up front results in lower overall costs (as Toyota showed); failure demand essentially requires rework.

This means that another way to improve overall throughput of the organization is to reduce failure demand, reclaiming that portion of your throughput that's getting siphoned off. One way to do this involves figuring out how to "build quality in" on new development, but since software development is a creative process (different every time for every feature), it's not possible to actually completely prevent bugs. That said, there are many techniques like test-driven development and user experience testing that can help improve quality. The other way to reduce failure demand involves vigorously fixing root causes of failure as we experience them. In other words, when we fix a problem for a customer, we should fix it in a way that prevents that type of problem from ever occurring again, for any customer. This keeps overall failure demand down by preventing certain classes of it, thereby reserving that precious organizational throughput for delivering new value. To summarize this section:

  1. Improve value delivery capacity by reducing failure demand (production incidents and bug reports).
  2. The cheapest way to reduce failure demand is by building in quality up-front.
  3. When serving a failure demand request, we can reduce overall failure demand by also fixing the root cause of the problem.

Final optimization: cycle time reduction

Ok, now we've gotten to the point where RQ = 0 (or near zero), so R = RV. Now at this point, let's look back at Little's Law:

N = XR

We've already established via draining out our queuing delay in the first phase what our target N is (number of requests in-flight). But we still want to ship more with the same number of people; we want X to go up. But recall that:

X = N/R

If our N is fixed due to the number of people we have on staff, then the only way to increase throughput is to reduce R. Now is where we start to look at process changes and automation. How do we make it so that it takes people less time to handle a request? Focusing on this improves not only time to market but also overall throughput. And furthermore, if we are measuring R over time, we have an easy way to do this: change the process in a way you think will help, and then measure if R went down or not. If it didn't help, try something else. If it made things worse, go back to the old way. Rinse, repeat. The things to try are going to be different for every organization, and one of the best sources of ideas will be the folks actually doing the work. But this doesn't require any kind of high-tech tracking software -- post-it notes on walls with the start and end dates written on them are more than sufficient to measure R and carry these experiments out.

  1. As failure demand and queuing delay are squeezed out of the system, the only way to improve throughput is by reducing response time.
  2. Response time can only be reduced by process changes.
  3. By measuring response time, we have a convenient experimental lab to understand if process changes help or not.

Say, haven't I heard this all before?

Well, yes. You may have heard pieces of this from all sorts of places. The feature "slots" we were talking about before as a means to limit "work-in-progress" (WIP), and are often called kanban. The notion of continually adapting your process to improve it is a tenet of Scrum. Test-driven development and pair programming are methods from Extreme Programming (XP) of building in quality up front. Failure demand is sometimes called out as a form of technical debt, and the list goes on and on.

Hopefully what I've done here, though, without putting a name on any kind of methodology, is explain why all these things are good ideas (or are good ideas to try). Ultimately, practices won't help unless they do one of three things:

  1. drive out queuing delay (RQ);
  2. reduce value-adding response time (RV); OR
  3. reduce failure demand (NF/R)

In general, the easiest way to do these for an organization is:

  1. reduce the number of things in-flight
  2. aggressively beat back failure demand by fixing root causes and building in quality up-front
  3. measure response (cycle) time and improve via process experimentation

Fortunately, all of those things are very, very easy to measure. If you can mark a request as either value or failure demand, if you can count the number of things in-flight, and if you can measure the time between starting something and shipping it, that's all you need.

Update: See the next post on this topic for a more intuitive motivation of the theory presented in this article.

Friday, October 29, 2010
Tales of Test-Driven Development

Inspired by a talk about Clojure given by Rich Hickey at Philly Emerging Tech earlier this year, I've been toying with building a Java library of pure (immutable) data structures, starting with the Map implementation based on Phil Bagwell's Hash Tries. Yes, I know I could probably just figure out how to use them straight out of Clojure by interoperating, but that would deprive me of an interesting coding exercise.

At any rate, I hadn't really gotten around to doing this in any great detail yet, and as it turns out, I'm glad I didn't. I was fortunate to be able to attend a refactoring and test-driven development (TDD) class taught by Bob Martin this week, and one of the code examples we ran through was the "Bowling Game" of writing an algorithm to score ten frames of bowling. Prior to developing this with a TDD approach, we identified a pretty simple object-oriented design including things like Games, Frames, Rolls, TenthFrames, etc. Yet when we actually got down to it, it turned out we just needed a pretty simple algorithm built into the single Game class--ultimately a much simpler design.

Rewinding to the pure hashmaps: I had previously been thinking about how to decompose the internal datastructure for the hash tries into classes like InternalNodes and LeafNodes that would implement a common TrieNode interface, and then mark all of those things as package private so I could hide all the messy implementation details from the client behind a PureHashMap facade. Hoo boy.

Instead, over the course of a very few hours today, I instead took the following approach: first, I used TDD to develop a brain-dead simple implementation of a PureHashMap using real HashMaps as the backing store, but cloning them on modification. Didn't even make an attempt at the Hash Trie implementation. Wouldn't work well at all on large data sets, but it did allow me to develop a set of unit tests that documented the required functional behavior of a PureHashMap, and it didn't really take long at all.

Next, I did something crazy: I completely threw away the simple implementation of PureHashMap, breaking all the tests. Then I started hacking the hash trie implementation in there until I could get all the tests to pass one by one. Now, this was ugly, cut-n-pasted, massively high cyclomatic complexity code--I shudder to think of it. But it really didn't take too long to get that working either, with the tests as a guide. Finally, as any TDD afficionado would know, once I had all my tests working again, I was able to easily but mercilessly refactor until it was all cleaned up.

The end result was something far better than I could have imagined: a relatively understandable PureHashMap hash trie implementation in a single class file with 100% unit test coverage, built in less than an afternoon. That's powerful stuff. Thanks, Uncle Bob.

Sunday, September 26, 2010
The Power of Visualizing Iterative Waterfall

We're going through a process mapping exercise at work just to try to understand how we get things done. Now, we are running what I would describe as "scrumfall"; doing Scrum for development but having that sit inside a traditional waterfall process. The waterfall is run iteratively and pipelined, although the degree of true pipeline isn't what most people think it is, due to developers having to support upstream and downstream activities like backlog grooming and addressing bugs in QA. I thought I would work through the exercise to try to define the value stream that our features actually experience.

Recorded user story appears on a backlog
Defined user story has acceptance criteria and estimate
Prioritized user story has been assigned a priority/rank
Committed user story has been pulled into a sprint
Coded user story has been marked 'complete' in the sprint
Accepted user story has been shown and accepted in a sprint review
Released user story has been included in a versioned release
Tested enclosing release has achieved an acceptable quality level
Approved enclosing release has been approved for launch (go/no-go)
Deployed enclosing release has been deployed to production

Several of Scrum's standard meetings (plus some other common ones) show up here: backlog grooming moves stories from "recorded" to "defined", sprint planning moves stories from "prioritized" to "committed", daily scrum moves stories from "committed" to "coded", and the sprint review moves stories from "coded" to "accepted".

Just having laid it out and thinking about it, some observations:

  1. we ask our product owners and other stakeholders to sign off on a particular user story twice, once at the sprint review, and once at the go/no-go meeting.
  2. user stories that may well be production-ready upon reaching "coded" get batched and bound to surrounding stories and thus become deployment dependent on them afterwards, even though they may not be functionally dependent on them
  3. interestingly, even though good user stories are supposed to be independent of one another (the "I" in INVEST), we nonetheless batch them together into sprints and treat them as a unit
  4. we don't have a good way to understand what happens to stories that don't get completed in a sprint, or bugs that are deemed non-launch blockers, or production incidents

Another thing as we think about batch size is whether a two week sprint iteration is actually tied to any relevant process capability metrics. For example, interesting metrics to consider here are some things that are per-batch (sprint or release); these are things that take roughly the same amount of time whether they contain one story or one hundred:

  • how long does it take us to produce a release?
  • how long does it take us to deploy a release?
  • how long does it take us to run a full regression test?
  • what is the lead time for scheduling a meeting with all the necessary folks in it?
And then there are some activities that are dependent on the complexity of a particular story:
  • how long it takes to define acceptance criteria for the story
  • how long it takes to code the story
  • how long it takes to define and update test cases for the story
  • how long it takes to discuss the story and determine if it was acceptably implemented

Batching makes sense if the organization's overall throughput bottleneck is on a batch-size-independent step, in which case, sizing the batch so that it runs in cadence with the cycle time of the bottleneck will maximize throughput. To make that more concrete, let's say we only have certain deployment windows available and can only do a deployment once a week; if this is the slowest part of our process, then we should take batches of work in a way so that upstream steps produce a deployable release once a week. Or, if the slowest part is running a full regression of manual tests over three days, then again, we should take batches that can be finished in three days. Perhaps the product owner is only available once a month to carry out sprint planning or sprint review; then we should batch at a month.

It might seem weird that the calendar of your product manager might be the bottleneck in your software development process, or that it makes sense to roll a release of completed work every three days, but that's queuing theory for you. Optimizing an overall system's throughput means organizing the work according to the current bottleneck's constraints (even if that means non-bottleneck parts might not be locally optimized) and/or moving the constraint elsewhere in the system (Theory of Constraints).

Interestingly, putting the entire workflow up on a kanban board would make a lot of this very obvious, even if all we did was put up WIP limits corresponding to obvious limitations (I can only deploy one release at a time, and I can only test as many releases as I have QA environments, etc.). The great thing about kanban-style development is that you don't have to change your process to start using it; you just model your current one, visualize it, and then watch what happens. You probably have all the information needed to track the metrics that matter, although you may have to start writing down times when various emails pass through your system (like whether the release went out, or whether the new build got deployed to QA, etc.).

However, to me, the most powerful reason to start visualizing the flow is that it shows you exactly what parts of your process you should change, and when. There's nothing like being able to show a product manager that their availability is driving overall throughput to encourage spending more time with the team. Or being able to show a development manager that the amount of time being spent doing bugfixing rework in QA is the bottleneck--encouraging practices like TDD. In other words, being able to make an empirical case for the potential use of Agile practices that aren't currently in place, and then being able to show that they worked. This is a good way to bring about an Agile evolution grounded in facts relevant to the current organization and not just based on opinion or philosophy.

Tuesday, August 31, 2010
Testable System Architecture

At work we were having a discussion about how we wanted to do SSL termination for a particular web service. We had narrowed the possibilities down to doing hardware SSL termination in our load balancer or doing software SSL termination in an Apache layer sitting in front of our web apps.

During the course of the conversation, we talked about factors like performance (would there be a noticeable effect on latency), capacity (were we already CPU bound on the servers that would run the Apaches), maintainability (is it easier to update configs on a single load balancer or to script config changes across a cluster with 40+ servers), cost (how much does the SSL card cost), and scalability (will we be able to expand the solution out to higher traffic levels easily).

I think this was a pretty typical example of taking a reasoned approach to system design and trying to cover all the potential points of view. However, it ended up that we left a big one off: testability.

The business rules about which URLs need to be SSL terminated and which ones don't (or shouldn't) need to be encoded somewhere, and we'd already ruled out doing the SSL termination in the application itself for other reasons, so that means they'd be encoded in either a load balancer config or an Apache config. Which one of these is easier to get under automated test on a developer workstation? For an agile shop where quality and time-to-market are of primary importance, this is a question we can't forget to ask when designing our system architecture.

Friday, August 13, 2010
RESTful Refactor: Combine Resources

I've been spending a lot of time thinking about RESTful web services, particularly hypermedia APIs, and I've started to discover several design patterns as I've begun to play around with these in code. Today, I want to talk about the granularity of resources, which is roughly "how much stuff shows up at a single resource". Generally speaking, RESTful architectures work better with coarser-grained resources, i.e., transferring more stuff in one response, and I'll walk through an example of that in this article.

Now, in my previous article, I suggested taking each domain object (or collection of domain objects) and making it a resource with an assigned URL. While following this path (along with the other guidelines mentioned) does gets you to a RESTful architecture, it may not always be an optimal one, and you may want to refactor your API to improve it.

Let's take, for example, the canonical and oversimplified "list of favorite things" web service. There are potentially two resource types:

  • a favorite thing (/favorites/{id})
  • a list of favorite things (/favorites)
All well and good, and I can model all sorts of actions here:
adding a new favorite
POST to /favorites
removing a favorite
DELETE to the specific /favorites/{id}
editing a favorite
PUT to the specific /favorites/{id}
getting the full list
GET to /favorites
Fully RESTful, great. However, let's think about cache semantics, particularly the cache semantics we should assign to the GET to /favorites. This is probably the most common request we'd have to serve, and in fact it ought to be quite cacheable, as in practice (as with a lot of user-maintained preferences or data) there are going to be lots of read accesses between writes.

There's a problem here, though: some of the actions that would cause an update to the list don't operate on the list's URL (namely, editing a single entry or deleting an entry). This means an intermediary HTTP cache won't invalidate the cache entry for the list when those updates happen. If we want a subsequent fetch of the list by a user to reflect an immediate update, we either have to put 'Cache-Control: max-age=0' on the list and require validation on each access, or we need the client to remember to send 'Cache-Control: no-cache' when fetching a list after an update.

Putting 'Cache-Control: max-age=0' on the list resource really seems a shame; most RESTful APIs are set up to cross WAN links, and so you may be paying most of the latency of a full fetch that returned a 200 OK even if you are getting a 304 Not Modified response, especially if you have fine-grained resources that don't have a lot of data (and a textual list of 10 or so favorite items isn't a lot of data!).

Requiring the client to send 'Cache-Control: no-cache' is also problematic: the cache semantics of the resources are really supposed to be the server's concern, yet we are relying on the client to understand something extra about the relationship between various resources and their caching semantics. This is a road that leads to tight coupling between client and server, thus throwing away one of the really useful properties of a REST architecture: allowing the server and client to evolve largely independently.

Instead, let me offer the following rule of thumb: if a change to one resource should cause a cache invalidation of another resource, maybe they shouldn't be separate resources. I'll call this a "RESTful refactoring": Combining Resources.

In our case, I would suggest that we only need one resource:

  • the list of favorites
We can still model all of our actions:
adding a new favorite
PUT to /favorites a list containing the new item
removing a favorite
PUT to /favorites a new list with the offending item removed
editing a favorite
PUT to /favorites a list containing an updated item
getting the full list
GET to /favorites
But now, I can put a much longer cache timeout on the /favorites resource, because if a client does something to change its state, it will do a PUT to /favorites, invalidating its own cache (assuming the client has its own non-shared/private cache). If the resource represents a user-specific list, then I can probably set the cache timeout considering:
  • how long am I willing to wait for another user to see the results of this user's updates?
  • if the same user accesses the resource from a different computer, how long am I willing to allow those two views to stay out of sync? (bearing in mind that the user can usually, and pretty intuitively, hit refresh on a browser page that looks out of date)?
Probably these values are a lot larger than the zero seconds we were using via 'Cache-Control: max-age=0'. When you can figure out how to assign longer expiration times to your responses, you can get a much bigger win for performance and scale. While revalidating a cached response is probably faster than fetching the resource anew, not having to send a request at all to the origin is waaaaaaay better.

The extreme case, here, of course, would be a web service where a user could just get all their "stuff" in one big blob with one request (as we modelled above). There are many domains where this is quite possible, and when you factor in gzip encoding, you can start to contemplate pushing around quite verbose documents, which can be a big win assuming your server can render the response reasonably quickly enough.

Wednesday, August 11, 2010
Thoughts on Hypermedia APIs

The REST architectural style is defined in Roy Fielding's thesis, primarily chapter 5, where the style is described as a set of architectural constraints. A quick summary of these constraints is:

client-server
The system is divided into client and server portions.
stateless
Each request from client to server must contain all of the information necessary to understand the request.
cache
Response data is implicitly or explicitly marked as cacheable or non-cacheable.
uniform interface
All interactions through the system happen via a standard, common interface. This is achieved by adhering to four sub-constraints:
identification of resources
Domain objects are assigned resource identifiers (e.g. URIs)
manipulation via representations
Actions occur by exchanging representations of current or intended resource state.
self-descriptive messages
Messages include control data (e.g. cache-related), resource metadata (e.g. alternates), and representation metadata (e.g. media type) in addition to a representation itself.
hypermedia as the engine of application state
Clients move from one state to the next by selecting and following state transitions described in the current set of representations.
layered system
Components can only "see" the component with which they are directly interacting.
code-on-demand (optional)
Clients can by dynamically extended by downloading and running code.

Achieving a RESTful architecture with XHTML

Mike Amundsen proposed using XHTML as a media-type of choice for web APIs rather than the ubiquitous Atom (or other application-specific XML) or JSON representations commonly seen. By using XHTML profiles, we are able to define the semantics of the data contained within a particular document, as well as the semantics of contained link relations and form types.

Now, let's throw a few simple rules into the system:

  1. all domain objects (including collections of domain objects) are resources and get assigned a URL
  2. beyond an HTTP GET to the API's "home page", a client simply follows standard XHMTL semantics from returned documents; namely, doing a GET to follow a link, and constructing a GET or POST request by filling out and submitting a form.
  3. retrieval (read) of resource state should be accomplished by GET, and modification of resource state should happen with POST (via a form).

Interestingly, this means that in addition to programmatic clients being able to parse XHTML (as a subset of XML) and apply standard XHTML semantics for interactions, it is possible for a human to use a browser to interact with the resources (or, as my colleague Karl Martino put it, "you can surf an API!").

Evaluation

So how well does this match up against the REST constraints? By leveraging HTTP directly as an application protocol, we can get a lot of constraints for free, namely: client-server, statelessness, caching, layered system, and self-descriptive messages.

Now, we also get a uniform interface, because all of our domain objects are modelled as resources with identifiers, reads are accomplished by retrieving XHTML documents as representations, and writes are accomplished by sending form-encoded inputs as representations. Finally, because a client accomplishes its goals by "clicking links and submitting forms", the hypermedia features of XHTML let us model the available state transitions to the client, who can then select what to do next and know how to follow one of the available transitions. Also, because an update to a resource is modelled as a PUT to the same URL we would use to GET its state, this plays nicely and naturally with standard HTTP/1.1 cache semantics (invalidation on write-through).

Finally, we're not using code-on-demand, in our case, although we could include Javascript with our XHTML representations to provide additional functionality for that human "surfing" our API, even if a programmatic client would ignore the Javascript. However, code-on-demand is listed as an optional constraint anyway.

Coming soon...

This is an intentionally high-level post that I'm intending will be the first in a series of posts that go over specific examples and examine some practical considerations and implementation patterns that are useful. Hopefully, we'll also be able to illustrate some of the architectural strengths and weaknesses that the REST architectural style is purported to have. Stay tuned!

Friday, March 26, 2010
Why Lab Week is So...Awesome

At work, once a quarter, we have "lab week": folks are allowed to form groups and work on self-directed projects. We usually finish up at the end of the week with a "science fair" where folks set up posters and demos of what they've worked on for the week (we even had cookies and lemonade at today's science fair!). I am always amazed at the amount of innovation and progress that comes out of these weeks; in some ways it outshines what our organization manages to do over the rest of the quarter. In this post I'd like to reflect a bit on what makes lab week so awesome.

Merit-based Project Ideas

Where we work, you have to actually recruit for lab week. This means you typically have to pitch your idea if you have one to enough people to get them to join your team. We actually set up special lunchtime meetings just for "Pitch Day". Ultimately this means that the projects that get worked on are the most innovative--because those are the most exciting ones and the easiest to recruit for. This is wisdom of the crowds at its finest; rather than having a select and small group of management identify a roadmap, it's a free-for-all where anyone can submit an idea and the best ones attract teams and get worked on. [Ed: this is not to say that management isn't needed to guide the execution and selection of ideas, just that they need not be the only source for idea generation]

Self-Selecting Teams

Again, due to the recruiting nature of lab week, groups are self-forming. As I think about it, it's amazing how well the group dynamics and team composition work out. I just re-read the section of Malcolm Gladwell's Tipping Point where he talks about Dunbar's number of 150: in a group smaller than that, you know your relationship to everyone as well as everyone's relationship to each other. In practice, this means when you are recruiting for lab week, you are consciously and unconsciously choosing folks that will bring the needed skills and experience to your project in a way that's compatible with the rest of the group.

When I look at it this way, it's not surprising that lab week teams gel very quickly and immediately start working together well. When you have responsibility for deciding who you work with, you end up wanting to work with your team. The group dynamics just sort themselves out effortlessly. Apparently Gore (makers of Gore-Tex) organize their high-tech development in this fashion.

Ownership and Buy-in

With a self-selected team and a self-selected project, folks on a lab week team are implicitly engaged in what they do, because it is their work. They own it, front to back, and pour their effort into it. You can see the pride in the demos, in the cute homemade posters (it does bear a striking resemblence to a stereotypical grade school science fair!).

On the well-known Gallup Q12 survey for how engaged your employees are, lab week covers: knowing what's expected of you (set by you!), having the needed raw supplies (often because the work is chosen with an eye to the possible), having an opportunity to do what you do best (self-selected projects), having your opinions count (recruiting, self-organizing teams), having dedicated teammates (self-selecting teams with ownership), having opportunities to learn (the whole point of lab week). That's six of the twelve right there. No wonder people willingly put in overtime on their lab week projects.

Timeboxed Exploration

Lab week lasts exactly a week. You don't have time to fully productionize what you do, and you have to focus. Ultimately this forces you to separate out all the chaff and focus on the real core of your idea and just deliver that, because that's all the time you have. This is the XP notion of Design Spikes, but in reality it's the Pareto Principle in full effect: do the 20% of the work that has the 80% impact.

Looked at it in another way, because the work effort is limited to a week, it is a way for the company to do rapid exploration with minimal risk or expense. I'd wager the company gets way more value in terms of idea creation during lab week than they miss out on from not doing "normal" work. It's clear the executives agree, because after each lab week, they agree to have the next one!

Thinking Outside the Box

Anything goes during lab week; this is a chance for folks to play with new technologies (it seems there is never a shortage of new technologies or of engineers who want to tinker with them) or practices. Our group used CRC Design Cards and did full-on TDD for our project, a first for many of us, and while I think we built a pretty cool project, the main benefit (echoed by group members) was what we learned about development this week. We ran a Cobertura report at the end of lab week and discovered we had 91% branch coverage in our code and an average cyclomatic complexity of 3 (without measuring during the week or even shooting for particular measures here). On a lab week project, no less. Wild.

In many cases, I think the science fair shouldn't be "What we built during lab week" so much as a presentation of "What we learned during lab week", which I suspect is actually the majority of the value offered to the company and the participants.

Permission to Experiment

Finally, it is clear that lab week is a no-pressure situation. There are no contracts to fulfill, no launch deadlines to meet (other than the science fair, I guess!), and you don't have to get approval from any more people than it takes to form a team. Almost anything goes (I have heard that underwater basketweaving is off limits, but not much else). If it doesn't work out, no problem, you go back to your "day job" next week, until the next lab week rolls around. There is nothing like this kind of no-risk environment (even the name "lab week" is suggestive) to foster creativity.

In Summary

Someone remarked at how much positive energy permeated the science fair. It was fun, and there were a lot of really cool ideas. People get into lab week where I work, and it makes the experience...awesome.

Friday, March 19, 2010
Agile Architecture Kanban

We've recently spun up a new software architecture group at work, and at least some of what the architects are expected to do is provide "consulting" services: providing feedback on technical designs and approaches, doing technical research, providing technical opinions to product managers, etc. Since many of these are similarly sized, and "cycle time" for getting a response to our clients is an important metric, we opted to manage this work using a kanban system.

After a month-long iteration, we stopped to take a look at some of the data we had collected. We were able to produce a statistical process control chart, indicating our cycle time in business days (measuring the time between when a customer asked for something to be added to our consulting backlog and the time when we finished it), something like this one:

This shows our average cycle time was around 6 days, and that our process was under statistical control; all samples were less than the upper control limit (red line) at 11 days (3 standard deviations above the average). This means that we had a relatively predictable process. Now, at the same time, we were able to produce a cumulative flow diagram, like this one:

which showed the number of consulting "stories" in each state of the workflow. One of the things we were able to derive is the average arrival rate for the stories, by finding the slope of the line between the starting and ending points on the "ready" line. We were also able to find our average throughput by finding the similar slope between the starting and ending points of the "done" line. What we found (and which you can see on the graph), was that the request rate was higher than our throughput (by about 0.2 stories per day), which resulted in a slowly but persistently growing backlog. Now, we happened to measure our average cycle time about halfway through the month, and found that it was 4.5 instead of 6 back then. In the ten business days between measurements, our average cycle time went up by around the amount our backlog length grew, as predicted by the difference between our customers' request rate and our service rate.

It would appear even architects are subject to queuing theory.

Going forward, in order to remain responsive to our clients (many of our engineering teams run two week sprints, so we wanted to shoot for an average cycle time of 3 days), we realized we were going to have to limit the size of our backlog. In other words, we were going to have to essentially issue a 503 (Temporarily Unavailable) response to some of our clients and simply not take their request onto our backlog and ask them to come back later, so as to remain responsive to our other customers. Just like we'd do in a web application server that was overloaded. Perhaps we'll even develop a cute picture of a flying acquatic mammal to try to soften the "not yets" we'll have to start handing out.