A different kind of testing for genetic programming systems

Draft of 2018.05.24

May include: GPtesting&c.

Over the last week, I was helping run the 2018 Genetic Programming Theory and Practice workshops in Ann Arbor. Quite a bit of very good research and engineering was reported, and a few themes started to emerge that haven’t been quite so prominent in earlier years of the workshop.

The one that stood out was the remarkable (to me) diversity of attitudes among my peers towards “testing”: what the notion means, when we’re building software that is supposed to itself produce novel working software, how to go about it, what “white box” and “benchmark” and other words familiar from the software engineering literature have come to mean in this subtly different context.

First, a bit of framing.

Recall that when we design and implement genetic programming systems—which, rightly, several people at the workshops have taken to calling “automatic programming” systems, since we’re not always using “genetic” operations to explore the set of “programs”, and which I still prefer to call “generative programming”—we are manually constructing a software system which, when run, we want to be able to produce new software that fits a given specification. Since most people who think of genetic programming imagine symbolic regression as the use case, let’s start there.

In symbolic regression, the use case is: Given some arbitrary suite of training data, each being an observation of the many “input” or “independent” variables and possibly many “output” or “dependent” variables, we want our built system to be able to eventually produce an equation, over the independent variables, which “models” the output variables. Usually this equation includes the standard mathematical functions you’d expect to see in any science or engineering lab report—arithmetic, transcendental, logarithmic, possibly some piecewise “conditional” stuff, and of course a collection of multiplicative and additive constants scattered throughout.

The “program” in symbolic regression is typically therefore an equation, and the task the GP software is given is to be able to build, evaluate—that is, “test”—and modify a potentially large set of these equations, searching for those that have low error with respect to the training data.

But, it turns out, reducing error of the models discovered with regards to the training and test data is necessary but not sufficient. In many of the talks and discussions this week, we were reminded that automatically-discovered symbolic models are often expected to seem “reasonable” to practitioners in the domain of interest. For example, Michael Korns pointed out several times that financial models used in trading are expected to be “white box” models, not just in the sense of being deterministic and traceable (“White box” in terms of source code), but also what we might call “extra white”, in the sense that they make sense and reveal intention to what you might call “a practitioner skilled in the art”.

Michael’s examples were about the business risks of using these models in a real trading situation, when the external dynamical regime changes. If an “opaque” model, like a large neural network, is in use—because it’s doing very well and making loads of money—there are no obvious problems while the world in which it is “good” still applies. But (to use Michael’s example), if there is a tacit assumption, because of the training data used, that the price of oil is and always will be $80 per barrel or higher, then there is a matching tacit risk that the models will suffer from what you might call “catastrophic generalization error”. That is, they might suck totally, and lose loads of money, if the world produces cheap oil.

Michael’s argument for what I’m calling “extra white box” models isn’t that one needs to plan for every possible Bad Thing beforehand, but rather that when some inevitable Bad Thing crops up, there needs to be a way of quickly adapting the model.

Remarkably, I think Michael’s really saying that these automatically built models need to be more agile.

That is, a neural network is in a real way a blob of canonical spaghetti code: It’s a sequential multiplication of several arbitrary-looking large matrices, interspersed by a bunch of nonlinear compression functions. Numbers go in, and numbers come out the other end, but there is no easily labeled “string” connecting the price of oil in dollars per barrel to the behavior out the other end. Though of course one could find such a “string”, by prodding and probing the network with various artificial “probe” values of salient inputs and watching what happens out the other end.

But that knowledge wouldn’t be prescriptive in any sense. Even when we find the connectionist chain that links some salient input to one or more output behaviors, there are no clear-cut remediations we can apply to make the thing not be so damned stupid. We can measure its stupidity, but not advise on how to smarten it up without retraining on new data. In other words, we can’t necessarily repair it when the world drifts outside of our optimal dynamical regime.

Michael’s argument for “extra white box” models, then, is something about repair in the face of these changes. Consider a model like the ones I’ve been writing about here, a Push-like “tree” function that encodes a function mapping twelve “input” variables onto a single scalar value \(\mathbb{R}^{12} \rightarrow \mathbb{R}^{1}\). You can imagine that there are many ways, with just a few higher-order polynomials and nested transcendental functions, we could make a twelve-dimensional equation into a nightmare of uninterpretable nonlinear opacity. For Michael’s approach to work, he tries to focus on solutions that avoid the “really weird” stuff, so that—his argument goes—when the world strays outside of the tacit assumptions built into the model, there is a potential route to quickly and easily compensate by “looking at the code”.

That’s the argument, at least.

When there are “weird” chunks of model, subject to extreme nonlinearities, like \(sin(k_{22} x_3^7 - \frac{cos(x_{11}+k_3)}{(k_4 sin(k_9 x_{2}^3)+k_{7})})\), you will be challenged to identify what to do when \(x_2\) is suddenly doubled or halved, or even if it starts falling slightly outside the range used in training. Michael’s approach—one shared by many symbolic regression practitioners—is to focus on some sense of “parsimony” as well as accuracy. Not because the world is itself parsimonious (in any “real” way, at least), but because such models provide a better chance of being maintainable in practice.

This should be starting to sound familiar to my agile software development friends.

And, just to nail the theme before I dive in here, Michael’s not the only one of us who thinks this way. Trent McConaghy reminded us of his FFX algorithm, which he first presented in 2011 at our workshop series. This is a fast and intrinsically parsiomonious algorithm, only a few dozen lines of Python long, which does something like symbolic regression on large numerical data sets without all that rambling “exploration” part. The algorithm’s used in circuit design software his company produced, which in turn is used by nineteen of the twenty largest chip manufacturers in the world, and therefore has almost certainly been used to optimize design parameters of almost all the chips in the room where you’re sitting right now. In the device you’re using right now.

FFX avoids “weird shit” happening by focusing on the smaller power-sets of simple mathematical functions. It limits nesting, it limits polynomial exponents, and as a result it sacrifices some amount of the resulting models’ numerical accuracy in exchange for algorithmic speed and the simplicity of the resulting expressions. It doesn’t necessarily work for arbitrary datasets (for some definition of “work”; see below), but it does the job Trent needed it to do.

Where Michael Korns and Trent McConaghy use a sort of order-driven parsimony in their work—searching for little expressions before they search for big ones—the alternative is some kind of ParetoGP approach, like the one Katya Vladislavleva uses in the DataStories system for identifying KPIs in arbitrary data sets. I’ve talked about this before, and recommended that everybody who does symbolic regression should be using some variant. Here, we can let the evolutionary search get “weird”, and explore the potential payoffs of nonlinear and larger expressions, but we simultaneously select for both accuracy and parsimony. That is, given two models with similar accuracy, we prefer the simpler one; given two models with similar complexity, we prefer the more accurate one.

Unlike the Korns and McConaghy approaches, where the result of symbolic regression is expected to be a single “winner” model, the result of ParetoGP search tends to be a larger collection of models, none of which is both simpler and more accurate than the others. This suite may seem like a hedge, but the point then is not to force a user to select one arbitrarily, but rather to examine them as a suite of alternative versions of the world. Collectively they can be seen to form “opinions” about what variables are salient, and in what nonlinear relations to one another, and can surface details of dynamics in a way that a single model may not communicate.

Here too, we end up collecting models that include “sensible” small, not-very-nonlinear chunks. And it seems as though there is an argument to be made—at least in the context of this sort of “extra white box” work—that these little sub-expressions are easier to understand than a big be-all end-all spaghetti model. That, insofar as we can see them at all against the background of the models they’re part of, they’re something like “modules” or “pieces of intent”.

What does it mean for an automatically-generated model to display “intent”?

A litany of tests

So let me step away from this backgrounder, so I can point out what it was I started to sense at this workshop.

Let’s talk about the tests we do when we make and use a GP system.

There are two phases where “testing” might get confused. One manually writes the software that implements GP, and like all programs a human runs, that software needs to “work” and “be done”.

But when we use the working software that we’ve written manually (and tested), its “work” in the sense of “use case” is to produce more software: equations, functions, data structures, complex code in one or more representation languages that does what was asked by the user of the GP software. That generated software structure itself needs to “be done”, but it is “tested” according to some set of criteria qualitatively different from the criteria we used for the GP system itself.

So: if we are good software developers, we unit test our GP system. That is, we have confidence that the parts we have written manually do the things they are intended to do.

Further, we will do some kind of “acceptance testing” of our GP systems. That is, we will run them through their paces, and expect the stuff they generate to fit some project-specific envelope. Beyond the bare functionality of things like, “When an error occurs, the right exceptions are thrown and handled,” we will often have in hand a set of benchmarks or toy problems: These acceptance tests of the GP system tend to sound like, “It can solve a difficult quartic polynomial at least \(p%\) of the time, in time \(t\) or less,” or, “It can balance a pole on a cart.”

Now lately, there’s also been a lot of talk about “benchmarking” for our GP systems. This is, in my experience, a sort of weak and flexible sort of “benchmarking”. What tends to happen is that people propose and use “difficult” problems to get some sense of the capabilities of their system, and as these propagate through the citation networks, they’ve become a sort of de facto “benchmark” used to compare between some systems, at least those with overlapping capabilities. This has led to some early work on compiling collections, but in practice these problems often suffer from representational specificity, or get set aside as novel search operators and search dynamics are implemented. In other words, they almost always play the role of “acceptance tests”, despite the sense that people are “comparing” old and new GP systems.

People use these “benchmarks” to signal that their new GP system is “done”; that it performs approximately “as well as” the older version of the same system, or better by some standard. But rarely is there even a feasible mechanism to compare qualitatively different GP systems to one another: it’s apples and oranges, and there’s scant reason to expect a pure symbolic regression problem to even be capable of handling a software synthesis or molecular design task.

And that’s about the extent of what people have called “testing”. There’s unit testing, which checks the quality of the code that does GP. There’s acceptance testing, which checks the functionality of the code that does GP. And there’s “benchmarking”, which is often reducible to acceptance testing in practice, which checks the functionality of the code that does GP against other (possibly earlier versions of the same) code, using defined standard problems and performance metrics.

But that’s not all.

Say what you mean

Lately I’ve been evolving FizzBuzz programs with various GP systems, and looking at how they work. You might think this is a kind of “benchmarking” as well, but really I’m not asking with these tests whether the GP systems can “solve” the various FizzBuzz problems. Of course they can “solve” such a trivial little problem. Rather, I’m looking at how they end up actually solving it, and considering what that code tells me about the systems’ internal dynamics, and capabilities.

On the one hand, we sometimes want the systems we’ve built to poop out what I’m calling “extra white box” solutions. That is, given a huge combinatorial space of possible solutions, we want—in these situations—for it to “find” the one we think of as “right”. In the worst case, this can seem to be an amazing needle-in-a-haystack notion of success, and when it does work we can be very impressed.

But on the other hand, there are at least as many times when we use GP in search of novelty. That is, we definitely do not want the boring old answer that makes us comfortable, but rather something weird and insightful and which we can (for example) file a patent for. I’m thinking here of applying GP to unsolved problems in mathematics, or to look for non-incremental optical or circuit designs, or in almost any aesthetically-driven application. There is no “right” answer here, and further we don’t want GP to keep harping on the same new answers whenever we do the search. We want, in other words, novelty as such.

What is it to be novel? That’s a matter for another day. But it seems to me that we want to be able to say to any given GP system, on one day: Give me an answer that’s not too weird, and on some other day say to the same system: Delight me.

I’ve been trying to frame these capabilities over the last week or so.

In a little lightning talk I gave the other day, I called them “Exploration testing” and “Exploitation testing”. The first is the capacity (when asked) to be delightful, and the latter to be reliable.

When I described the things I was thinking about to Barbara the other day, she mentioned that it sounded like I was asking for the system to pass its comprehensive and oral exams for a Masters degree. On the comprehensive written exam, you usually need to demonstrate familiarity with the core concepts and tenets of your discipline, and report best practices and what’s essentially expected in a given stylized situation. In the oral exam, you usually need to demonstrate the ability to concoct novel solutions to unforeseen situations, and to show how it is you’re thinking.

Hang on. That bit right there seems to be common, doesn’t it?

In one case, you’re expected to show you can think in the conventional way, if asked. And in the other case, you’re expected to show you can think in an unconventional way, when asked. But in both, you are on the spot to demonstrate your intention along the way. Not only to walk the path from problem to solution, but to narrate that path along its course.

This is more than just “competence” or “mastery”, I think now. But what is it?

The “Intention” of Automatic Systems

The other day (a few weeks back, now) Chet Hendrickson and Ron and I were chatting about related stuff. Chet noticed that part of what he was hearing in my complaints and musing reminded him of Kent Beck’s four rules of simple design.

There’s a lot to think about in that simple observation.

Let’s look at the list, which is (at least partially) ordered:

  1. Tests pass
  2. Expresses intent
  3. No duplication (DRY)
  4. Small

Suppose we think about what we’re asking our GP systems to do in these terms.

The first one, “tests pass”, translates to something like “error over training cases is minimized”, right? We seem to focus on that one.

The third and fourth ones, we seem to have well in hand, too. That is, it’s possible with some judicious applications of parsimony pressure and/or refactoring in the course of—or after—a GP run, to minimize the amount of duplication and useless code. We can, if we put our minds to it, evolve things that work and which are relatively succinct.

But that second one, huh?

What is the “intention” of a portion of code that’s been “written” by an automated system? Maybe this axis of exploration/exploitation can help clarify.

Let’s think about FizzBuzz. In the experiments I’ve done, I’ve observed an interesting set of stuff:

  • When I ask GP to solve “classic” FizzBuzz—that is, to output the one long string that lists the integers 1–100, and for each produce a line that may have “fizz” and/or “buzz” appended, where those indicate divisibility by 3 and 5 specifically—it’s super hard and annoying. This is essentially asking GP to return a single string constant every time it’s run.
  • When I ask GP to solve a “simpler” FizzBuzz task, where I give it an arbitrary integer argument, and it only needs to return a single string with that value, appended with “fizz” and “buzz” appropriately if the argument is divisible by 3 or 5, it typically can solve it, but what I get is often a very complicated algorithm indeed. More often than not, the constants 3 and 5 don’t even appear in the programs; instead, they’re “stored” by counting the number of things in some stack, or by incrementing or decrementing the number of characters in the constant “buzz”, or by some weird-ass roundabout “trick” that isn’t even clear enough to be called “clever”.
  • Then when I ask GP to solve a “harder” version of that “simplified” FizzBuzz task—one where I give it an argument integer \(n\), plus divisor arguments \(i\) and \(j\), and ask it to print the \(n\) and follow it with “fizz” when \(n\) divides \(i\) evenly, and follow that with “buzz” when \(n\) divides \(j\)—I have noticed something interesting. The problem is solvable, though as one might expect it takes a bit longer (in some sense), but the solutions I get almost always include a section of code that looks like it is intended to determine if one integer divides into another without a remainder.

That last one is particularly interesting. By asking for a more general solution, I think what I’m seeing in these preliminary explorations is a more specific and modular approach developing. Somehow, by surfacing the integer arguments instead of implicitly asking GP to “discover” or “infer” them along the way, the “notion” of divisibility-by-an-arbitrary-integer becomes clearer in the evolved code, than when divisibility-by-three and divisibility-by-five are both left implicit.

I wonder what that means….