Overlapping “answers”

Draft of 2018.04.27

May include: GP ↘ Push ↗ &c.

In the last two parts of this arc, I’ve spent some time looking at the ways our representations of arithmetic expressions—and, by extension, more general “computer programs”—can affect the ways in which we assume they “run”. Among the genetic programming community over the last two or three decades, quite a few structurally and dynamically distinct representation schemes for even “simple” programs have arisen. I’ve been comparing the older and more broadly used “tree” format for expressions to the Reverse Polish Notation approach used by languages designed more recently for genetic programming use, like Lee Spector’s Push.

Specifically, I’ve been highlighting some of the consequences of the RPN approach’s evaluation dynamics. Looking (for the moment) just at simple arithmetic “expressions” composed of the ten simple tokens (k_1, x, k_2, /, -, k_3, y, cos, *, +), last time I looked at the way each item that is pushed to the numeric stack in the course of evaluation “is” a “result”, and how the execution of arbitrary code in this style can result in multiple items being “left” on the stack over time.

For example, here’s a manual trace of a short snipped of RPN “code” using those ten tokens, showing the program being executed token by token (on the right), and the number stack on the left, where I’ve also explicitly replaced each value on that stack with the specific RPN code that would be required to calculate it:

STACK                                                    PROGRAM
[]                                                (k_2 y k_1 / k_3 / * k_1 k_2 * cos k_3 x *)
[(k_2)]                                           (y k_1 / k_3 / * k_1 k_2 * cos k_3 x *)
[(y) (k_2)]                                       (k_1 / k_3 / * k_1 k_2 * cos k_3 x *)
[(k_1) (y) (k_2)]                                 (/ k_3 / * k_1 k_2 * cos k_3 x *)
[(y k_1 /) (k_2)]                                 (k_3 / * k_1 k_2 * cos k_3 x *)
[(k_3) (y k_1 /) (k_2)]                           (/ * k_1 k_2 * cos k_3 x *)
[(y k_1 / k_3 /) (k_2)]                           (* k_1 k_2 * cos k_3 x *)
[(k_2 y k_1 / k_3 / *)]                           (k_1 k_2 * cos k_3 x *)
[(k_1) (k_2 y k_1 / k_3 / *)]                     (k_2 * cos k_3 x *)
[(k_2) (k_1) (k_2 y k_1 / k_3 / *)]               (k_2 * cos k_3 x *)
[(k_1 k_2 *) (k_2 y k_1 / k_3 / *)]               (cos k_3 x *)
[(k_1 k_2 * cos) (k_2 y k_1 / k_3 / *)]           (k_3 x *)
[(k_3) (k_1 k_2 * cos) (k_2 y k_1 / k_3 / *)]     (x *)
[(x) (k_3) (k_1 k_2 * cos) (k_2 y k_1 / k_3 / *)] (*)
[(k_3 x *) (k_1 k_2 * cos) (k_2 y k_1 / k_3 / *)] ()

In each line of this trace, the top (left, for simplicity) token is popped from the program and evaluated. If it’s a constant like k_2, that is pushed as a single new (parenthesized) “self-contained item” onto the stack. The parentheses here are simply used to delineate the bounds of what would, if I were evaluating the numbers put onto the stack, numerical values. In a standard interpreter for these sorts of languages, I would of course put the value of constant k_2 onto the stack in the first step, not the little expression that represents that value.

Similarly, when I process a variable reference token in the program, like y in the next step in this example, I’m moving it over to the stack as the self-contained expression (y). But in practice I would of course replace this with the current value bound to y.

Finally, when I process a unary operator token like cos, or a binary operator like +, I obtain their arguments from the stack, and push the result of applying the operator back onto that stack. In this example, the first operator I encounter is the / in the fourth step; that consumes the top two items from the stack when it’s processed—which are (k_1) and (y), here—and produces their quotient: (y k_1 /).¹ That is, the numerical value \(\frac{y}{k_1}\).

But the thing I want to focus on here today is how much stuff is happening along the way, as I process that short snippet of fourteen tokens, (k_2 y k_1 / k_3 / * k_1 k_2 * cos k_3 x *). By the time we’ve consumed them all, there are three items on the numerical stack: (k_3 x *), (k_1 k_2 * cos), and (k_2 y k_1 / k_3 / *), representing the values \((k_3 x)\), \((\cos(k_1 k_2) )\), and \((k_2 \times ((\frac{y}{k_1})\div k_3))\), respectively.

That there can be three different things left on the stack when we run out of tokens in our “program” is perhaps the strangest aspect of these Push-like RPN representations, for people who are used to tree-based S-expressions. The definition of a tree entails its single root: If we write that RPN “item” (k_2 y k_1 / k_3 / *) out in the equivalent prefix form (where each operator is a parenthesized block starting with the operator followed by its arguments), we get something like (* k_2 (/ k_3 (/ y k_1))). Looking at this single semantically-sound tree—as long as we know that it’s in correct prefix form now—we can be confident that the root operation is the * operator, and that a single number will be the result.

Indeed, that highlights one of the “tricks” in what I’ve been doing here: Whenever I show you these parenthesized “self-contained items” on the stack in these traces, there is an implication that it “is” a number, and that therefore it’s exactly equivalent to one prefix-formatted tree. When I’ve asked you to think about “how many” items are on the stack after you run a large block of random RPN code, or about “how big” those items will be (in terms of the number of tokens they contain), there is in some real sense an equivalent question I could be asking: How many trees are there in an arbitrary RPN token stream?

Is that your final answer?

In response to this strange-feeling “looseness” in RPN representations—the fact that there can be multiple items left on the execution stack when we’re “done” running them, unlike self-contained S-expressions—those of us who work with Push in genetic programming have adopted a simple-feeling approach to cope. We use the top value left on the stack when we’ve finished evaluating the code. That is, if you asked most people who use Push-like representations, if we asked them to evaluate the program (k_2 y k_1 / k_3 / * k_1 k_2 * cos k_3 x *), most would say the result is (k_3 x *).

And this makes perfect sense for a number of reasons. When we evaluate a traditional tree-based program, we also use stacks (though they feel like a slightly different sort) to manage intermediate calculations and the values returned by sub-expressions along the way. There, too, we will end up with “the answer” on top of some stack.

In many Lisp-like programming languages—Clojure, for instance—there are explicit and implicit “the last thing you do is what counts” rules for execution. For example, the Clojure code (do 2 5) produces the value 5, because the do executes every Clojure form in its argument list, sequentially, returning the result of the last (and 5 is the last, here). But in much of Clojure’s core syntax, for example the definition of local vars by let, or functions by fn, the special forms include an implicit do. You can sequentially execute a series of expressions in these forms—which can, usefully, have side effects—but can be sure that the last one is the only one that actually returns a value.

In other words: “keep the last item in the sequence of forms” is a lot like “return the top item on the stack”. It’s perfectly sensible.

And yet, here we are.

Feeling a bit bloated

A long time ago, in the days when almost everybody used tree-based representations all the time, there was a problem plaguing the genetic programming world. Whole tracks were devoted to this problem at all the international workshops, and a lot of intellectual effort was expended by a lot of very smart people in combatting this problem and its consequences. A lot of very cunning non-tree representation schemes arose in that period (around 2001–2005, I’d say), and also a lot of smart search operators designed to limit bloat, and also a lot of multiobjective schemes for handling bloat at the same time as model error.²

That problem was called bloat.

To glibly summarize several decades of excellent work in a quick off-the-cuff paragraph: “Bloat” in tree-based GP is the remarkable tendency of evolved trees to include big chunks (sub-trees) that don’t do jack shit. This arises due to a very interesting rabbit-hole of feedback between (1) the way one function or value can be represented by many different, often ornate or baroque, sub-trees; and (2) the crossover dynamics of tree-based GP, in which sub-trees are snipped off and exchanged.

This is a good place for an example: Suppose we’re evolving an arithmetic function, like the ones I’ve been talking about in this arc so far. But with trees. Let’s use the same ten tokens I’ve been playing with already, too: (k_1, x, k_2, /, -, k_3, y, cos, *, +).

Arithmetic is an interesting thing, when you get down into it. We tend to assume that simpler forms are better, but genetic programming and randomly cutting-and-pasting expressions doesn’t know about Occam’s Razor. And so in the course of randomly generating code, or of randomly snipping and swapping sub-trees between two expressions in this code, you tend to get things like this:

(/ k_2 k_2)
(- y y)
(- (+ x x) x)
(- (/ (* k_1 (- (+ x x) x)) k_1) x)
(- (/ (* k_1 (- (+ x x) x)) k_1) (/ (* k_1 (- (+ x x) x)) k_1))

and so on.³ Those are, in case you don’t want to squint and cipher too much, exactly equivalent (modulo some care with dividing by zero) 1, 0, x, 0 and 0, respectively.

So that, in some sense, is bloat. You may be thinking that no sane person would write those expressions, but you would be forgetting the context: no person at all is writing them. They are being evolved, by rearranging the parts of randomly-generated code, and evaluated on the basis of their accuracy. And of course, all of them are perfectly accurate, if in context your expression “wants” a 1 or 0 or x.

Bloated expressions like these, as I said, arise in many cases because so many cut-and-pasted trees tend to evaluate to the same thing. You might also be tempted to add a little step in here somewhere—if you were evolving arithmetic that you wanted to fit some data—that applied some algebraic simplification to replace sub-trees like these with their smaller equivalents. But the risk (and it’s a real risk) is that when you over-simplify the expressions, you are reducing the potential for more complicated and accurate expressions to arise later on.

The problem with bloat isn’t, in practice, that people don’t like to see that sort of stupid math junk in their final answers. The practical problem was (and is) that—even given some kind of algebraic simplification—you still need to crank the computational handle and determine explicitly that (- (/ (* k_1 (- (+ x x) x)) k_1) (/ (* k_1 (- (+ x x) x)) k_1)) is exactly 0. It costs time and attention and produces a lot of CPU heat to do that work.

And so people have waged a decades-long war on bloat.

Partial credit

The problem of bloat in tree-based GP therefore revolves around something that should remind you of my questions in this arc about the number, size and diversity of “things on the stack” when we evaluate arbitrary RPN programs.

Bloat is bad not because it interferes with the accuracy of results—at least not directly—but because it delays those results, and makes the process of calculating them much less efficient. It’s a lot of work done for little apparent return.

Realize that a substantial part of this comes when we’re generating and then evolving programs made out of random tokens—whether they’re arranged in a single tree, or in a stream of RPN “code”—because we don’t know yet that the “right answer” will be in the top of the stack at the same point in execution when we exhaust all the tokens. In a tree of 100 tokens (operators and leaves, arranged in a syntactically-correct position in some tree), any one of the sub-trees might be a “better” answer to the problem at hand than the root tree itself. And in a stream of 100 RPN tokens (operators with positive arity that consume-and-the-produce stack items, and simpler tokens that just produce new ones), any one of those steps through the program might be a “better” answer than the thing on top of the stack after the hundredth token has been processed.

Certainly this is true in the early stages of our searches, when we have no confidence that any of the random guesses we’ve built produce even an approximation of our target function.

So even before we’ve started evolving proper, we’re doing just the same sort of “extra work” that made people worry about bloat: If we’re building random trees to fit data that (ideally) looks like (+ (* x k_1) (* k_2 (* x x))), none of these trees will look very good, when compared to the data:

(/ (+ (* x k_1) (* k_2 (* x x))) k_3)
(* (+ (* x k_1) (* k_2 (* x x))) k_3)
(cos (cos( cos (+ (* x k_1) (* k_2 (* x x))))))

But each one of them contains a subtree that’s a very good answer, and in the process of evaluating each one of them, the “right answer” will have appeared on some stack.

We justify this, in our field, by saying that “evolution will find the right chunks”, but to be honest this is unlikely. Here, evolution’s first concern is to get close to the target data with something out of a pile of random guesses, and so it will almost certainly ignore these “masked” solutions in favor of a little constant it stumbles across that happens to be close to the average value. Recombination is a harsh tool.

Similarly, suppose our target data are derived from (k_1 x * x x * k_2 * +) (which I hope is the same function, in RPN form). None of these RPN expressions will leave the “right thing” on the top of the stack when executed, even though they will stumble across it along the way:

(k_1 x * x x * k_2 * + k_3)
(x k_1 x * x x * k_2 * + /)
(k_1 x * x x * k_2 * + cos)

It would be nice—whether we’re talking about bloat in trees, or intermediate calculations in RPN expressions—to be able to give partial credit. At least in the stages of search where we’re exploring a large space of possible structures. Even if the “whole” solution doesn’t appear explicitly in the execution trace, we calculate (in both cases) about a hundred potentially different functions, in the course of evaluating only one hundred-token tree or code block. Admittedly (and as I showed last time, when we counted the number of different items on the final execution stack) we will have a lot of intermediate trees or self-contained blocks that look like (x) and (k_2). But we will also run across all the chunks of the “final answer”, whether that’s the root of the tree or the last self-contained block on top of the stack.

Ideally, we want our search to return a perfect solution in the first step. This is, of course, ridiculous. But we do prefer (1) more accurate models, and at the same time (2) models that avoid doing extra work, and which therefore have short execution traces (or smaller trees). If we have two models of equal accuracy, but one is more complicated than the other, we prefer the simpler; if we have two models of equal complexity, but one is more accurate than the other, we prefer the more accurate one.

For example, which is “better”?

(k_1 x * x x * k_2 * +)

final stack: [\((k_2 x^2 + k_1 x)\)]
(k_1 k_1 x * x x * k_2 * +)

final stack: [\((k_2 x^2 + k_1 x)\) \((k_1)\)]

The second one does an extra step of work, leaving a stack that looks just like the one in the first case, but with an extra (k_1) entry below the shared value of (x * x x * k_2 * +).

Which is better?

(k_1 x * x x * k_2 * +) (\(f(x) = k_2 x^2 + k_1 x\))
(k_1 x * x x * k_2 * -) (\(f(x) = k_2 x^2 - k_1 x\))

The second one includes the same (and correct) \(k_2 x^2\) term, but inaccurate because of the second term’s constant, which misses the target by a factor of \(2 k_1\). Neither one is more “complicated” than the other (by most measures of functional complexity), it’s just that one is more correct.

Items made, items left, and intermediate steps

Along the course of this story arc, I’ve smeared some definitions and notions.

In the course of executing an RPN program of the sort I’ve been talking about, we end each step in which a token has been been processed with some number on the stack. That is, as long as we have enough items on the stack to satisfy the constraint that operators need the right number of items as arguments.

We sidestep that constraint either by (1) using Push’s “ignore on failure” rule for execution, in which we do nothing if there aren’t enough arguments; or (2) by providing the sort of “pre-image” I’ve used in several of the examples. The outcomes are different, though: In the former case, we are essentially redefining the behavior of all operators in the language, to avoid “crashes” if the stack runs out. In the latter case, we’re redefining what it means to “execute” a block of code, by constructing a sort of abstract context in which “stuff has happened before” to produce all necessary arguments on demand.

When we execute the RPN program (+ * + *) in these two very different regimes, we will get an empty stack—no answer, in other words—if we use the Push rule, or we’ll get something like (p_5 p_4 p_3 p_2 p_1 + * + *), where the p values come from the magic place we all reach for arbitrary constants.⁴

But set those two concerns aside until next time, and let me hand-wave in the direction of the “pre-image” one by saying that I’m concerned here mainly with very large blocks of RPN code, and how they end up working.

So: When we run a block of code, we end up with a series of values on top of the stack. Subsequent operators may consume items from the top of the stack, but every value present is, at some point, right there on top. If we were to stop calculation at that time step (and ignore any unexecuted code), that item on top of the stack at that instant would be “the answer”.

In this example (copied here from above)

STACK                                                    PROGRAM
[]                                                (k_2 y k_1 / k_3 / * k_1 k_2 * cos k_3 x *)
[(k_2)]                                           (y k_1 / k_3 / * k_1 k_2 * cos k_3 x *)
[(y) (k_2)]                                       (k_1 / k_3 / * k_1 k_2 * cos k_3 x *)
[(k_1) (y) (k_2)]                                 (/ k_3 / * k_1 k_2 * cos k_3 x *)
[(y k_1 /) (k_2)]                                 (k_3 / * k_1 k_2 * cos k_3 x *)
[(k_3) (y k_1 /) (k_2)]                           (/ * k_1 k_2 * cos k_3 x *)
[(y k_1 / k_3 /) (k_2)]                           (* k_1 k_2 * cos k_3 x *)
[(k_2 y k_1 / k_3 / *)]                           (k_1 k_2 * cos k_3 x *)
[(k_1) (k_2 y k_1 / k_3 / *)]                     (k_2 * cos k_3 x *)
[(k_2) (k_1) (k_2 y k_1 / k_3 / *)]               (k_2 * cos k_3 x *)
[(k_1 k_2 *) (k_2 y k_1 / k_3 / *)]               (cos k_3 x *)
[(k_1 k_2 * cos) (k_2 y k_1 / k_3 / *)]           (k_3 x *)
[(k_3) (k_1 k_2 * cos) (k_2 y k_1 / k_3 / *)]     (x *)
[(x) (k_3) (k_1 k_2 * cos) (k_2 y k_1 / k_3 / *)] (*)
[(k_3 x *) (k_1 k_2 * cos) (k_2 y k_1 / k_3 / *)] ()

…the series of items on top of the stack after each token is handled are:

(k_2)
(y)
(k_1)
(y k_1 /)
(k_3)
(y k_1 / k_3 /)
(k_2 y k_1 / k_3 / *)
(k_1)
(k_2)
(k_1 k_2 *)
(k_1 k_2 * cos)
(k_3)
(x)
(k_3 x *)

Note that this is different from the pile of stuff left on the stack at the end of execution. In top-to-bottom order, that collection is

(k_3 x *)
(k_1 k_2 * cos)
(k_2 y k_1 / k_3 / *)

All of those appear, at some point in execution, on top of the stack. There’s a complicated relation between what’s left and what’s appeared on top of the stack, but it certainly seems—at least for this small suite of operators—that what’s left will always be a subset of what’s appeared.

I can say the same sort of thing about the intermediate “steps” of evaluating a tree, like (+ (* x k_1) (* k_2 (* x x))), as well. As I’ve mentioned, there are several possible algorithmic approaches for traversing the whole tree to get all the arguments for the root operation, but I might be tempted to follow a depth-first traversal⁵ first. If we did that, I think we’d get a “series of top items” (which are really “intermediate values”) something like:

(x)
(k_1)
(* x k_1)
(k_2)
(x)
(x)
(* x x)
(* k_2 (* x x))
(+ (* x k_1) (* k_2 (* x x)))

An alternative—which may give a different series of steps?—would be to translate the tree into an RPN form, and then execute that as before. But in either case, we have a series of “things on top of the stack” to look at. If we did the same with the “almost right” trees I showed above, we’d see the “right” expression appear at some point, but then it’s superseded by a subsequent operation.

But note that there are nine tokens in the tree (+ (* x k_1) (* k_2 (* x x))), and nine intermediate values on this nominal “stack”. If we undertook the same series of “intermediate steps” to calculate a (ridiculous!) thousand-token tree, tracing the tree in a depth-first order and jotting down all the intermediate sub-trees we encountered along the way, then we’d have a thousand items on that list. One step, one token, one new thing on the stack.

You therefore may be wondering: What’s the practical difference between the stack-based and tree-based representation schemes? One thing you may have noticed, indeed right here in this very section, is that I didn’t speak of the “tree-based equivalent” of the RPN example I used: (k_2 y k_1 / k_3 / * k_1 k_2 * cos k_3 x *). That’s because there is no single “tree-based” version of that code block… at least not unless we use something a lot like Clojure’s do function.⁶ When I worked through that code block above, even though it used no “pre-image” items, recall that it ended up with three separate items on the execution stack when it was “done”.

You may be starting to see where I’m going. I’m about to argue (again, since I’ve written about it before) that we should treat every item that ever appears on top of the execution stack as a sort of “guess”. As I said above, we want these guesses to be accurate (compared to the training data we’re using), and also want them to be prompt (early in the series of values). If we look at the hundred guesses a 100-token program produces, we can whittle those down to the ones that are simultaneously most accurate and prompt.

To do that, we’ll have to talk about how to compare two programs with one another, when each one may have more or less accurate guesses, made at earlier and later stages of execution.

Let’s look at some actual code… but next time.

In a lot of genetic programming work, you’ll encounter what’s called “protected division”, a specialized function which (unlike real division) returns a constant like zero whenever the numerator is zero, rather than being undefined like real division is. I tend not to use “protected division” in my work, and instead prefer to define the operation in the more traditional mathematical way: you can assume that this division operator returns a runtime error if the numerator is zero. What happens to that runtime error is for another day, but in my Push interpreters what I do is maintain a stack of type :error, and push it onto that stack instead of the :number stack. I suppose the result is just a different kind of “protected division”, which returns one of two types of result when called, but it just feels a bit closer to the standard mathematical understanding, for me. ↩
Including the excellent Pareto GP algorithms, which anybody doing symbolic regression with trees should always be using. ↩
It can be worse in more expressive GP contexts. Imagine having not just these arithmetic operators, but also boolean logical values and operators. Then you’ve got the potential for lovely little tricky constructs just like the bad code you and I write ourselves, like (if (< x y) (+ x 0) (* 1 x)). Note that the same result arises in both cases, but that we’ve spread them apart in a conditional tree. Now imagine that conditional tree was a lot harder to read, and make the results look more like the ones in my list above, plus maybe throw in some more conditionals just like these in those results…. ↩
This is more like currying the code block as if it were a function, I suppose. The other (the Push rule) is more like “failing silently with a warning”, as long as you record a little warning message if it happens. ↩
I suppose you might want to try a breadth-first traversal too, just to see if it’s different. Surely things happen in a different order, but does it have the same chunks on the list? I suspect there may be trees where they’re different, but I haven’t checked. ↩
If that were the case, I guess the equivalent “prefix tree” would be something like (do (* (/ k_3 (/ k_1 y)) k_2) (cos (* k_2 k_1)) (* x k_3)), as long as do accepted an arbitrary number of arguments…. ↩