Metrics: Definitions

Top  Previous  Next

Code Collaborator collects a variety of raw metrics automatically.  This section defines these metrics; a later section discusses what these metrics can tell us.

Lines of Code

The most obvious raw metric is "number of lines of source code."  This is "lines" in a text-file context.  Often this is abbreviated "LOC."

Code Collaborator does not distinguish between different kinds of lines.  For example, it does not separately track source lines versus comment lines versus whitespace lines.

For code review metrics, often you usually want to use general lines of code and not break it down by type.  Often the code comments are just as much a part of the review as the code itself -- check for consistency and ensuring that other developers will be able to understand what is happening and why.

Time in Review

How much time (person-hours) did each person spend doing the review?  Code Collaborator computes this automatically.  This raw metric is useful in several other contexts, usually when compared to the amount of code reviewed.

Developers (rightly) hate using stopwatches to track their activity, but how can Code Collaborator -- a web server -- automatically compute this number properly?

Our technique for accurately computing person-hours came from an empirical study we did at a mid-sized customer site. The goal was to create a heuristic for predicting on-task person-hours from detailed web logs alone.

We gave all review authors and reviewers physical stop-watches and had them careful time their use of the tool.  Start the stopwatch when they began a review, pause if they break for any reason -- email, bathroom, instant messenger, bathroom.  The times were recorded with each review and brought together in a spreadsheet.

At the same time, we collected detailed logs of web server activity.  Who accessed which pages, when, etc..  Log data could easily be correlated with reviews and people so we could "line up" this amalgamation of server data with the empirical stopwatch times.

Then we sat down to see if we could make a heuristic.  We determined two interesting things:

First, a formula did appear.  It goes along these lines: If a person hits a web page, then 7 seconds later hits another page, it's clear that the person was on-task on the review for the whole 7 seconds.  If a person hits a web page, then 4 hours later hits another page, it's clear that the person wasn't doing the review for the vast majority of that time.  By playing with various threshold values for timings, we created a formula that worked very well -- error on the order of 15%.

Second, it turns out that humans are awful at collecting timing metrics.  The stopwatch numbers were all over the map.  People constantly forgot to start them and to stop them.  Then they would make up numbers that "felt right," but it was clear upon close inspection that their guesses were wrong.  Some people intentionally submitted different numbers, thinking this would make them look good (i.e. "Look how fast I am at reviewing!").

So the bottom line is: Our automated technique is not only accurate, it's more accurate than actually having developers use stopwatches.  The intrinsic error of the prediction heuristic is less than the error humans introduce when asked to do this themselves.

Defect Count

How many defects did we find during this review?  Because reviewers explicitly create defects during reviews, it's easy for the server to maintain a count of how many defects were found.

Furthermore, the system administrator can establish any number of custom fields for each defect, usually in the form of a drop-down list.  This can be used to subdivide defects by severity, type, phase-injected, and so on.

File Count

How many files did we review?  Usually the LOC metric is a better measure of "how much did we review," but sometimes having both LOC and number of files is helpful together.

For example, a review of 100 files, each with a one-line change, is quite different from a review of one file with 100 lines changed.  In the former case this might be a relatively simple refactoring; with tool support this might require only a brief scan by a human.  In the latter case several methods might have been added or rewritten; this would require much more attention from a reviewer.