Thursday, September 29, 2011
small data and root cause analysis
Once upon a time, in a galaxy far away, I was responsible for corporate quality. Metrics were popular at the time, but they were somewhat abstracted from the nitty gritty of actual customer problems. Well, we had a very unhappy customer. "Wicked unhappy" as we say in the depths of New England. They kept telling us our system was buggy and unreliable. Everybody thought it was a bunch of different software problems. Engineering and support management fixated on responding to bug reports quickly as they came in. Timely fixes are good for sure, but there seemed to be something else going on.
I collected the raw data, that is, all the bug reports we'd gotten from that customer. I assembled the multitudes (support, quality, software engineering, hardware engineering). I drew a bunch of buckets on the whiteboard and we categorized each one according to the root cause. Sometimes we had to do a root cause analysis to determine which bucket it belonged in. Surprisingly, many seemingly unrelated bug reports stemmed from the same root cause. We counted how many were in each bucket. Suddenly we knew where to look.
The numbers pointed to an area we hadn't considered to be an issue: the disk-mirroring hardware. Digging a little deeper into the data, there seemed to be a correlation with a particular supplier and with systems built during a certain window of time. Turns out we'd gotten a batch of defective controllers from our supplier. Their testing hadn't caught it, nor had our hardware testing. The problem only showed up when running the complex software on top of it all.
Hardware replaced. Customer satisfied. Lesson learned. Sometimes data is just what you need to lead you in the direction of the cause.
Posted by Janet Egan at 3:24 PM