Friday, June 24, 2011

Testing Boundaries

In Programming Pearls, Jon Bentley uses binary search as a case study of how difficult it can be to implement a seemingly simple algorithm (emphasis added).

I've assigned this problem in courses for professional programmers. The students had a couple of hours to convert the description above into a program in the language of their choice […] We would then [validate with] test cases. In several classes and with over a hundred programmers, the results varied little: ninety percent of the programmers found bugs in their programs

Bentley then goes on to develop an implementation using formal verification techniques: identifying an “invariant” condition, and proving that that invariant in fact does not vary during execution of the algorithm. Unfortunately, his code has a bug, one that found its way into the JDK.

While one can point to this bug as a refutation of formal correctness proofs,* I'm more interested in whether testing is a valid alternative. Bentley clearly didn't think so: “I wasn't always convinced of the correctness of the code in which no bugs were found.” His concern is reasonable: tests can only verify that a specific input leads to a specific output. It may be sheer coincidence that the code under test produces this output, and it may give incorrect results for different input.

Which means that a simple test, one that creates an array and searches for values in that array, isn't enough. It could be that your code works for an array that has an even number of elements, but fails for one that has an odd number. Or perhaps it works fine for all arrays with more than one element, but throws an exception if passed an empty array. And how do you know that you're really doing a binary search?** One obviously infeasible approach is to do exhaustive testing of all possible arrays.

An alternative is to adopt the “spirit of testivus,” let a tool generate random testcases for you, and assume that a sufficiently high number of tests means everything works. But from the perspective of “this is correct,” however, random tests are the same thing as arbitrary tests: they may find a bug, they may not; you'll never know if others exist.

A better approach is to adopt the core of Bentley's argument, which is to understand the inflection points of the algorithm: those places where the behavior potentially changes. For binary search, I started with the following:

  • 0: every method that involves indexing should have a test for a zero-length object (and in the case of methods using Strings, a test that passes null).
  • 1: This is the end-point of a binary search loop: you have a single element in your array, and either it's the correct value or it isn't.
  • 2, 3: These are the points at which a binary search divides the array and recurses (or loops).
  • 4: This is a superfluous test. I wrote it because I wanted to verify two passes through the divide-and-conquer logic, but then realized there was no point. Still, every test is sacred, so I checked it in.

If you remember high school algebra, you're realize this is proof by induction. If you can demonstrate that you have exhaustively tested every array size through all inflection points, then you can reasonably say that it will return the expected value for larger sizes. Add some hook points that record each pass through the loop (this could be a Comparator that counts its invocations), and you can verify that it's an O(logN) search.

Except … my code doesn't actually test all inflection points. There's another inflection point at Integer.MAX_VALUE / 2 + 2: the possibility of integer overflow. And unless you have a 64-bit JVM and over 4 Gb of physical memory, you have no way to test this case.

I don't have the answer. Tests can develop a high level of confidence, but they can't make guarantees. And to develop that level of confidence, you essentially create a formal proof. But compared to the situation with no tests, that level of confidence may be enough.


* Something that I am all too willing to do: I have a degree in History rather than Computer Science because at nineteen years old I realized that there was a big difference between proof on paper and bytes in memory. But that's a topic for another posting.

** I believe that binary search, like quicksort, is an algorithm that cannot be implemented following the strict rules of test-driven-development. Fixing your tests by doing “the simplest thing that could possibly work” will give you a linear search.

See www.youtube.com (NSFW if you don't have headphones, potentially offensive if you do).

Actually, you could test a binary search that works on arbitrary indexable objects rather than Java arrays: create an implementation that stores its data on disk. Although when Java's search methods were being developed (circa 1999), a typical workstation disk wouldn't have had the needed capacity.

Monday, June 20, 2011

Testing the Complete API

In my article on code coverage I was careful to stress that a coverage tool can only tell you what you haven't tested. It can't tell you if you're missing mainline code. Which made a recent bugfix to one of my open-source libraries even more embarrassing: the class had 100% coverage but was missing 7 characters of mainline code. And those seven characters were the difference between correct behavior and a bug that would manifest as non-deterministic behavior in consuming code.

When I have a bug, my first task is to write a unit test that exposes it (and demonstrates that I fixed it). Then I look in all similar code for the same bug (and in this case I found it). Finally, I think about how the bug came into existence, and more importantly, how come I didn't uncover it when I was first implementing the class and writing tests.

In this case, the answer was simple: the bug was in the single-byte read method of an InputStream implementation, and since I rarely use that method I got sloppy when writing tests for it. I wrote “happy path” tests that validated correct behavior, but did not try to trick the class into exhibiting incorrect behavior.

This experience has pushed me to think, again, about the role of unit tests. And it reminded me that there's a big difference between the way that a software developer thinks and the way that a good QA tester thinks. The former wants the code to work, the latter wants it to fail. Both mindsets are needed, but I'm not sure that any one person can hold both at the same time. And that's what's needed for unit tests to truly exercise code.

And I also wonder how such tests fit into the mindset of test-driven design. In my experience, TDD is very much focused on the happy path. Each test describes a small piece of the overall functionality, a “specification in code.” Can detailed edge-case tests contribute to that specification without cluttering it? And if yes, how does a developer switch mindsets to come up with them?