Posted in May 2012

Why are projects late?

It’s this time of the week again, time for another episode of (drumroll) SD Action!

Last time I introduced a basic project management model, this time let’s look at what this baby can do.

Let our base project be a project with 100 tasks. The team size is 200 people, each of whom can accomplish 0.005 tasks per week, this leads to… Oh, I don’t know. Here’s a graph:

Yup, the amount of work to be done (see the previous post for the model framework) goes down at a steady rate and the project is done by the one hundredth week. Nice. I can hear the more experienced project managers go “yeah, right!” Nothing ever goes as smoothly, people make mistakes! You’re supposed to add buffers and such, 30% is the standard practice.

Hm, let’s see what happens if we allow people to make mistakes. On the model, this amounts to there being 20% of chance of a task to have the need to be re-done and the rework generation and discovery flows kicking in. Given the one fifth chance of a mistake, how much should we add to the project duration? 20%, right? Not exactly. You see, you might make mistakes on the bug fixes as well… You guessed it, here’s a graph:

What kind of sorcery is this? The project duration did not grow by 20% and not even 30%. It grew by 110%! Blimey, we just missed our deadline.

Oh well. Sure. Mistakes happen. But what if the mistakes are costly, generating more work to remove the previously done stuff? Remember the example of having to chip out old concrete before pouring new. Here we go:

Yes, this added another 55 weeks to the project. This is one year. By allowing mistakes to cause additional work. Of course, the relationships are more subtle but they are way too geeky to explain here. The deconstruction rate depends on how much of the project is done: it is 0 for about 50% and grows to 1 (in the later phase, as much of effort goes into deconstruction as into rework) as the project progresses. These assumptions are probably different in your field but in my world, one year got added to the project by making a fairly reasonable assumptions of mistakes costing effort.

As said earlier, the team size is 200 people. Given that at this point we are looking at a five-year project, it would be reasonable to assume that there is employee churn. Of course, the newcomers must learn the ropes before they can be productive and, in fact, the entire team starts out this way having about half the productivity. Let’s assume there is 10% employee churn annually, hiring is started immediately to replace the leavers (6 weeks to fill a position on average) and that it takes four weeks to get acquainted with the project.

This is actually not half bad, we loose only 5 weeks or so. It turns out that 10% churn in 200-person team is not a big deal. What is curious, though, that most of the lag is caused by the the fact that the team size actually goes down. How come? You see, given the parameters, the churn turns out to be faster than hiring. People leave until annual churn drops to the same level as hiring and stops there, the model stabilizes. In our case, this means there are 195 productive people, 3 people are constantly in incubation and 2 are just lost. This is where system dynamic modeling excels: solving this symbolically would have involved constructing and solving a system of differential equations but I just drew a couple of boxes and pressed a button.

How many of you have spotted a fatal flaw in the model? You did? Right! Give the gal a cookie!

Let’s give others a moment, shall we…

Yes, right. The thing is that the current model assumes testing starts immediately. The moment anybody writes a line of code or draws a line, it gets tested and, after a while, possible mistakes end up back at the work queue. Unfortunately, this is not how stuff happens in many cases.

Let’s take construction. Firstly, the architect dreams up a house. Then a bunch of engineers figure out the structure of the thing. Then people come and work on pipes, ventilation and drains. And finally somebody devises a loom of electric wires. And then people go and start building it only to discover that a ventilation duct must pass directly through a structural beam. And a cable ladder crosses a flight of chairs. At about chest height. Bummer. With the way construction is done in this country, I’m assured, there are very little means to discover such mistakes before construction actually begins. In our model, I’ve made it so that there is no rework discovery until about a third through the project, then everything proceeds normally. This is how it goes:

Sweet mother of baby Jesus! 80 weeks! Of course I’m overdoing things a bit. Some testing does happen earlier. True. But the current model does not account for any customer spec changes or for any risk materializing so, broadly speaking, the order of magnitude – about 30% – should be in the ballpark. What is worth, though, is this:

The graph shows the ratio of percentage of work actually done and the percentage of work believed to be done. For all other cases, it peaks pretty early on and starts declining nicely but for late testing, it remains very high until very late. For a project manager this means that they have no idea whatsoever how the project is progressing. Which is a Bad Thing ™.

Let’s recap. By adding only four simple aspects of project behavior, our project has grown 350% in the worst case and about 250% for sensible testing behavior. And we still have not talked about risks or awkward acceptance tests or multiple contractors or, or… Oh God.

See, this is why projects are late. Project managers are faced with dynamically complex systems that can go off on wild tangents for any reason and usually only have their gut to rely on. Of course, being under deadline pressure and lacking concrete evidence they give in and promise these 100 weeks or possibly 150. Well, they should go and simulate their project model and see what comes out the other end. In short, they should observe System Dynamics in Action!

Tagged , , , ,

On managing projects

Projects go wrong. They often do. They tend to go wrong inexplicably, when everything was just about done. They go wrong by orders of magnitude, we hear about a massive project costing billions being closed every other week. What the hell? How come? I mean, these people get paid and get paid well and they still can’t manage a project to be on time, on budget and bang on functionally?

Well, I guess they just can’t help it. The reason, as point out previously, is that humans are notoriously bad at predicting the behavior of even simple dynamic systems let alone a billion dollar 3-year project involving thousands of people in tens of companies.

And you all know what’s coming now. SD to the rescue! Simulate!

This and I suspect a couple of following posts will be on project management and thus it would make sense to establish the basics before plunging into modeling details. This is the basic model structure we’ll be using:

The model assumes that there is a set of work to do and that the work is divided into tasks. The tasks might be writing code, digging holes, it doesn’t matter. The main thing is that work flows out of the “Work to do” box towards two others: “Work done” and “Undiscovered rework”. You see, when you do something it might be OK or it might need changing later. Because you messed up, because somebody else messed up, it does not matter. The main thing is that you don’t know in advance if your work is indeed correctly done or needs to be re-done. That undiscovered rework flows back into work to do via process of rework discovery. Which for us, software folks, is simply called testing. We go “oh, dang” and more work appears on the todo list. Finally, there is a stream flowing in to undiscovered rework called “Deconstruction work”. This one counts for the need to demolish the incorrectly done work. When you pour 200 square feet of concrete incorrectly, you need to bang it to tiny pieces with hammers before it can be poured again. That sort of thing.

Of course, the model as depicted is just a scaffolding. The whole model (based on schoolwork in certain MIT courses but heavily modified) is too complex to go into detail here but the surrounding details can be roughly divided into following parts

  • Scope changes like scope creep, customer changing their mind etc. These things mainly influence the “Work to do” box
  • Personell issues like employee turnover, staffing decisions and such. This is going to have an impact on work flowing out of the Work to Do box. In trade magazines, this is called “productivity”
  • Rework discovery and impact. When and how testing happens and what is the nature of the bugs discovered including the amount, extent and dynamics of deconstruction work

At this point you should be going “dang, this is complex”. You are? Good. Because projects can be incredibly complex. For once, it is not trivial to estimate what the actual amount of time spent on the project would be. Even if you know what the estimates for the factors are, the math is non-trivial and you’d unlikely to be right on the money based on a gut feeling.

We’ll going to go into some more details in the following posts but here are some things this sort of modeling can do for you:

  • Deadline and resource estimates for large projects. Given the process model of your project, given the conditions, project size etc., what does the work estimate and load dynamics are going to be?
  • Process optimization. What happens if we change our development process? What happens if we start testing earlier? What happens if we start doing regular instead of continuous deliveries? Changes in staffing policies?
  • What-if analysis. Given our current project management framework, what happens if half of the team leaves? Customer adds a ton of new requests?
  • Root cause analysis. Our project went like so. OH GOD! WHAT HAPPENED? Model your process, make the result match your project and see if the results improve if you change the policies

Project management is one of the areas, where system dynamics has the most immediate and tangible practical application so the next couple of weeks are going to be interesting!

Allright, that’s all for now. Take care and observe System Dynamics in Action!


Viral growth and population size

I did it again. Oops.

Previously i made a statement about how infection spreads faster in a small community. It turns out that I was wrong. I ran the simulation for the estimate for the total number of internet users in the world and Estonia and, blimey, the numbers came out the same – in both cases, the growth rate is the same up to the point of explosion. When that happens, the larger population, naturally grows faster and flatlines on a higher absolute number (but, curiously, on the same relative percentage). This comes down to the fact that while a smaller population increases the probability of contacts with the infected person, it also decreases the contact frequency by the same fraction.

And yet, I can’t help but to feel the population size should matter. Our social networks are not uniform, there are clear clusters of people around. There is also clear empirical evidence with Facebook growing in small communities first and obscure countries like Estonia being over-represented in places like Orkut. But the model presented in the previous post is pretty damn irrefutable, anything I can think of is basically an extension of the same model and thus is likely to exhibit the same behavior. The reason, I believe is in the infectivity rate. I think that the network can go exponential locally. If the particular way a particular network implementes infectivity matches with the particular modus of social interaction within a community, the infectivity rate between the members of the community goes up and the network gains speed. And this, of course, is easier to do locally than globally. One might have a very particular idea about what exactly Estonian teenagers are thinking about right now (think but it is much harder to devise something that appeals to Estonian, British and US teenagers as well as Chinese PhD students. Of course, just supporting Estonian teenagers is not enough and copying the system to Latvia is not going to work as you’d loose all the mass pull effect of the established community while also risking getting the infectivity parameters wrong.

So there you go, one of the answers to MJ’s question “how do we detect things that are about to explode before we do so we can get filthy rich” is as follows. You look for things that are catering a local target group very well in a way that is scalable without copying. Colleges, kindergartens, office environments. Things like that.

One other thing that is clear from the model presented previously I did not point out is the lack of the concept of an “active user”. They do not matter. You might have hundred million people coming to your site to look at stuff but unless they actually create something provoking comment, begging for a “like” or asking for support in a mob war, there is not going to be another hundred mill.

So there’s a second answer to MJ’s question. Look for companies that do not track active users but track contributing users instead. Look for ways a company makes it easy to get infected without triggering the subject reaching for a face mask (very infectious diseases do not spread because they become known and the number of contacts goes down because of public awareness). Look for companies that understand that distinction.

Allright, that’s all for now. Take care and observe System Dynamics in Action!

On virality. And kitties.

It’s Friday again and thus time for a new post. I promised to discuss how to recognize apps that are about to go exponential and some of that thinking will fix the issues that have been pointe out with my FB model (users never churning but rather becoming less active over time, for example). But before we can get to that, let’s talk about what is viral growth. To do this, I’ll resort to one home assignment of last summers ESD.74 class (taught by professor J. Bradley Morrison at the MIT) simply because it covers the very topic. Viruses. I’m using this model mainly because whatever I’d come up with would look exactly like this simply because of me having taken that course. There you go.

The model looks like so:

The three main stocks are pretty self explanatory the only tricksy part being that “Recovery Rate” is both the speed at which people recover from SARS is the same speed at which the population of susceptible people grows. Simply put, people who recover, become susceptible again, there is no immunity. Otherwise the model is based on the idea that infection happens when an infections person has any sort of contact with a healthy person and the number of such events depends on the likelihood of the person you are contacting being healthy and the number of contacts people have.

The model has the following key variables:

  • Infectivity, that defines the likelihood of a contact leading to infection
  • Total population defining the scale of the problem we are observing
  • Contact frequency determining the number of contacts each person has on average per unit of time
  • Recovery rate that defines both the speed at which people stop being infectious and the average time people can infect others

In order to put all this in a better context, let’s use an example. An example, hm. Wait a minute, a friend just sent me a note and there’s a link in there. Click. Oh, a KITTY! OMG!1!!! Hilarious! Gotta share it…. Nope, back to blogposting. And there’s our example.

  • Infectivity now means the probability that whoever sees the video I shared is going to re-share it
  • Total population is the size of my world, i.e. the number of people on the Internet that are in my n-th degree friend cloud with n being reasonable small. This, today, is effectively the entire internet population
  • Contact frequency is the number of times a person on average sees something shared by their friends in any medium
  • Recovery rate is slightly more complex. It is not simply the period during which I send out the links, it is the period during which people are likely to see your link. In FB context, this depends on the time your link remains visible in the news feed before it gets buried under new stuff as well as how long do you keep sharing the video for

Allrihty, let’s take it away.

Let our base case be 2% infectivity (a surprisingly small percentage of people finds kitties amusing. I don’t get these people), everybody seeing an average of 20 kitty videos a day and let the video be “infectious” for 3 days. All fairly reasonable numbers.

What is remarkable is that for the first 150 days nothing really happens. But during the next 150 all hell breaks loose and we hit 90 million cases. This is one of the reasons these things come as a surprise: the explosion takes time to ramp up and people tend to forget about a kitty video pretty fast.

Let’s now consider what would happen if our video was really really cool. What’s cooler than a kitty? Two kitties. Fighting. To the music of Burzum. This would amount to infectivity, oh I don’t know, 3 per cent? Sounds reasonable? A graph showing the current number of people rather than total tells the story:

Oh, bummer. Not only does the explosion happen five times sooner it is also much quicker and results in three times the infected population!

This is scary, right? As an antidote, let’s see what happens should infectivity be only slightly lower, 1.8% instead of 2. As a result of, say, the subject of your share failing to mention kitties. Or Burzum.

I actually had to resort to excel to get this one to even show properly. Yes, the vertical axis is now a log scale and while our base case dealt with millions of people the video with a slightly dull title deals with thousands. Zero point two percentage points difference in infectivity levels yields three orders of magnitude worse performance. By the way, isn’t it cool how the log scale makes the curves all straight outlining the exponential nature of the phenomena we are observing?

I won’t burden you with another graph but the effects of playing with the infectivity period are almost as severe: 2.8 instead of 3 days gives about 31 thousand instead of 3.6 million users within our period.

Obviously my goal was not to educate you guys on kitties, I had a couple of points to make. And lest my storytelling skills foil this mission, let’s lay them out nice and straight:

  • Exponents are weird! It is really hard to grasp how exponents work, especially in complex arrangements. Human mind is just not built for this.
  • Small lapses have big impact. Even minor changes in how your viral campaign is set up will have a massive impact on whether you’ll reach millions or hundreds of users. Note that the differences I’ve played with here are almost within statistical margin of error. Hence, even if your do everything perfectly but your estimate on how long your video will be visible is just a wee bit off, the campaign is going to be a big flop. Or, a moderately well-executed campaign might explode simply because the stars were aligned just right

Allright, that’s enough SD Action! ™ for today. See you next week!