Thursday, September 29, 2011

small data and root cause analysis

Sometimes it's not the "big data" that you need to solve the problem at hand. Sometimes you need small, focused, local data and a whole lot of smarts about the analysis.

Once upon a time, in a galaxy far away, I was responsible for corporate quality. Metrics were popular at the time, but they were somewhat abstracted from the nitty gritty of actual customer problems. Well, we had a very unhappy customer. "Wicked unhappy" as we say in the depths of New England. They kept telling us our system was buggy and unreliable. Everybody thought it was a bunch of different software problems. Engineering and support management fixated on responding to bug reports quickly as they came in. Timely fixes are good for sure, but there seemed to be something else going on.

I collected the raw data, that is, all the bug reports we'd gotten from that customer. I assembled the multitudes (support, quality, software engineering, hardware engineering). I drew a bunch of buckets on the whiteboard and we categorized each one according to the root cause. Sometimes we had to do a root cause analysis to determine which bucket it belonged in. Surprisingly, many seemingly unrelated bug reports stemmed from the same root cause. We counted how many were in each bucket. Suddenly we knew where to look.

The numbers pointed to an area we hadn't considered to be an issue: the disk-mirroring hardware. Digging a little deeper into the data, there seemed to be a correlation with a particular supplier and with systems built during a certain window of time. Turns out we'd gotten a batch of defective controllers from our supplier. Their testing hadn't caught it, nor had our hardware testing. The problem only showed up when running the complex software on top of it all.

Hardware replaced. Customer satisfied. Lesson learned. Sometimes data is just what you need to lead you in the direction of the cause.

1 comment:

  1. And sometimes, the least likely causes - or, as in your case, set of causes - are the culprits. More than that, it just goes to show that the best way to finally uproot a problem is to go beyond theorizing and actually explore ALL possibilities.

    Rigoberto Stokes