There’s a cool paper on a tool to do semi-automatic debugging: Triage: diagnosing production run failures at the user’s site. While Triage was designed to diagnose bugs at a customer site (where the software developers don’t have access to either the configuration or the data), I think a similar tool would be very valuable even for debugging in-house.
They use a number of different techniques to debug C++ code.
- Checkpoint the code at a number of steps.
- Attempt to reproduce the bug. This tells whether it is deterministic or not.
- Analyzes the memory by walking the heap and stack to find possible corruptions.
- Roll back to previous checkpoints and rerun, looking for buffer overflows, dangling pointers, double frees, data races, semantic bugs, etc.
- Fuzz the inputs: intentionally vary the inputs, thread scheduling, memory layouts, signal delivery, and even control flows and memory states to narrow the conditions that trigger the failure for easy reproduction
- Compare the code paths from failing replays and non-failing replays to determine what code was involved in that failure.
- Generate a report. This gives information on the failure and a suggestion of which lines to look at to fix it.
They did a user study and found that programmers took 45% less time to debug when they used Triage than when they didn’t for “real” bugs, and 18% for “toy” bugs. (“…although Triage still helped, the effect was not as large since the toy bugs are very simple and straightforward to diagnose even without Triage.”)
It looks like the subjects were given the Triage bug reports before they started work, so the time that it takes to run Triage wasn’t factored into the time it took. The time it took Triage to run was significant (up to 64 min for one of the bugs), but presumably the Triage run would be done in background. I could set up Triage to run while I went to lunch, for example.
This looks cool.