More gentle comparisons between Push programs

Draft of 2016.07.15

May include: Push ↘ GP ↗ learning in public ↖ &c.

There are a few deep differences between the way a Push program “works” and the way we’ve come to expect “human-written” code to behave. A lot of the difference involves the dynamical structure. Push is often harder to understand than the worst “spaghetti code” you’ve ever seen a human being write. But some of the most interesting things an evolutionary search can build demands a lot of very weird self-rewriting spaghetti code.

But no matter how a Push program does it, we are usually able to ask it for an answer of some sort. What I’ll call the “traditional” approach (in Clojush for example) is to examine the top item on the appropriate stack when we stop running the program. If we want an :integer as an answer, then we look at the :integer stack and pop its top value and compare that with the desired output to calculate an error value. If we’re want a particular :string as an answer, we pop the top :string item and measure its rewrite distance (or something) compared to the desired :string value. And so on.

Today I want to spend some time justifying and exploring a very different approach, one that—I think—can “help” the search algorithms we use tackle much more complex problems. Tomorrow I’ll go into a lot more technical detail, and do some experiments; this is a backgrounder of sorts.

Push “return” values

Because you might have arrived here from anywhere, let’s review quickly.

Push is an expressive, robust language for evolutionary computation. By “robust” I mean something very particular to the genetic programming use case: it’s designed so that arbitrary chunks of code can be chopped off and rearranged, and still do something. Push has a “syntax” and “semantics” that permit arbitrary rearrangement of programs at the level of individual tokens. Whether we’re using it in a “traditional” genetic programming setting or some more general software synthesis work, we’ll be making a lot of Push programs out of more-or-less random parts, running those programs, and trying to evaluate how closely they conform to our desired outcome.

Push’s unique structure permits that sort of man-handling. Any arrangement of arbitrary tokens (from the language) will do something, and without causing a syntax error.

Most software synthesis projects are focused on the discovery of functions. That is, code that takes one or more user-specified input values (“arguments”), and produces one or more user-desired output values. The classic example is of course symbolic regression, in which we have some collection of quantitative data and are searching for some function that fits it. But the goal could as easily be the discovery of a database search algorithm that efficiently produces a correct subset of records, or an image synthesis program that produce the right arrangement of colored pixels, or a robotic controller that moves an actual robot around in a desired path in the real world.

All doable.

But. The same characteristics of the Push language that make it possible for us to chop up programs and rearrange them at a very fine level, also means we have no equivalent to the traditional “return value” from human-writeable languages. Push code simply moves things around and transforms them in an abstract collection of stacks. We know our “answer” is on one of the stacks (or possibly, in the Klapaucius interpreter, in one of the bindings), because there is no other persistent state besides those stacks and bindings.

And it’s worse. There’s no particular time when a Push program is “done”. By convention we assume that when there are no more program steps waiting to be processed, it’s “done”… but as with most programming languages, that doesn’t necessarily happen at any particular time. There are loops and recursion in Push, and just like every component those can get rearranged and restructured and produce arbitrary behavior. And in dynamical settings—for instance where the Push program running an agent is communicating with some other Push programs controlling different agents—it’s entirely possible that new code can be received and executed by a Push interpreter we would think of as being “done” and “halted” if it were isolated.

As a result, the Push research community has made some design decisions about when to call a program “done”, and what to examine as if it were an “output value”. The most common approach is to let the Push program run—in isolation, with access to the input values we want it to respond appropriately to—until it runs out of code to run. But if it doesn’t run out (for instance if it’s in a loop, or generating code some other way faster than it’s consuming it) by some fixed number of steps, we’ll set a time limit at which we halt execution by fiat.

When it does stop, and we examine the interpreter, we will find zero or more items on the stack we’re interested in. If we’re interested in numerical outputs, we can examine the :scalar stack in Klapaucius (this would be :float or :integer in Clojush).

Here’s an example of what happens to the :scalar stack, at each step of execution, as I ran a Push program just now:

Each of those 442 lines is a snapshot of what the :scalars stack looks like, at that step of running an (arbitrary) Push program in Klapaucius. The stack is empty sometimes, it grows and shrinks in other periods, it has new items added and consumed, things are even moved around here and there. The top of the stack is shown on the left: when the interpreter pushes a new value onto the stack, it will shift the rest of the items towards the right, and when the interpreter consumes items as arguments, they will be taken from the left end, shifting the remaining items leftwards again.

So in the traditional approach to evaluating a Push program’s behavior, we’d run this particular program all the way to the end, look at the stack, and state the “answer this program produced” was -1. But take a moment to scan back over the history of the the stack, and think about where that number came from in the course of executing this program. Was it calculated? Was it actually a constant that had been sitting around at the tail end of the program? It’s always hard to tell.

But in this case I can explain it, and it’s worth walking through it briefly. Back in the 76th step of running this particular program, a :snapshot was created and stored. If you scroll back to around then in the gist, you’ll see the -1 sitting there at the top of the :scalar stack.

Now a :snapshot is a sort of saved state in Push, and by convention, when an interpreter runs out of code to run on its :exec stack, it will pop the saved states from the :snapshot stack and run that. In this particular program running with these inputs, the -1 arose back around step 70, was stored away for the next 350 steps while we ran a “side process”… and then when the “side process” was finished the execution stack was popped and the stored values magically reappeared.

I think this should be bothering you.

Most of the effort expended by this program in the course of its run was in the Push equivalent of a “separate process”. All of it, that whole 350 steps, produced a :string that looks like “å|NW|i”. That was the only thing returned from the subprocess; all the :scalar numbers it had made and fiddled with along the way were discarded when the process stack was popped.

But we are using Push to run arbitrary code. Who knows what it’s doing? What if this particular program hadn’t finished by the time I got bored? In my testing I see about 10% of random Push programs (like the one that produced the stacks above) running for 3000 steps or more, typically in long loops of some sort. Suppose I had stopped the run at step 400? The answer would’ve been… 180206783 instead of -1. What if I’d stopped at 300? 767389167M. What about 299? 52.9453125

And so on.

Asking for a lot

When we run Push programs with a goal of finding a function, we often think of the task we’ve given the search process as being, “For this input, give me this output.” We assign an error measure, and blithely take the top :scalar (for example) as signifying “what the program says”. But the Push language itself assigns no special significance to that stack or its top value.

In effect, we’re not asking for “the answer”, because that question and how we approach has literally no meaning to the Push interpreter or the search algorithm we apply to a problem. We’re asking very different questions:

tell me what number I’m thinking of
tell me when I say “stop!”
write your guess in exactly this position
don’t do any weird extra shit

Downhill from here

One of the key insights from the initial successes in the 1970s of both neural networks and evolutionary computation is the core notion of gradient following.

In neural networks—even Deep Learning ones—our search algorithms approach the problems in a way that uses error measurements made in the course of learning to direct subsequent steps. The metaphor (a problematic one, as it turns out, but that’s for another day) is that any given starting condition of the neural network search process is sitting somewhere on an abstract “landscape”. The error measurement assigned to that current (bad) solution is the “height” of that particular point, and the error measure we would assign to closely-neighboring also-bad solutions is like a rolling or ragged mountain. Our goal is to minimize the error as much as possibly by changing the state, and so in this analogous landscape our goal is to get down as low in elevation (error) by changing position. A small step in the landscape is a small change in the constants of the neural network, and if you think about it as walking, you can easily imagine taking small local steps to move yourself (your neural network’s constant values) downhill from where you are now.

A similar landscape metaphor can be useful in evolutionary algorithms. Any possible solution can be evaluated and assigned an error score, and any possible solution can be changed into “neighboring” solutions. The ideas of “neighboring” are a bit more complicated here, because we have a population of different solutions all at once, and “neighboring” means something much more complicated because we often mix the solutions’ parts in arbitrary ways. But awkward as it is, the metaphor is a favorite in the field: In an evolutionary search, we expect that a population of solutions can learn the structure of the landscape of a given problem, and move itself “downhill” to new locations over time, until the error is as low as possible.

As with neural learning and most other machine learning algorithms, in evolutionary algorithms we should be very concerned with the ruggedness of that metaphorical landscape. Insofar as we think our algorithm can work at all, we do so because we believe it can use local information to move reasonably around in that landscape. That “local information” isn’t some glib folk use of the term: think of it as actual mutual information our algorithm needs to extract about the “past”—what it has seen before in the error landscape of its search up to now—and leverage to produce a productive outcome in the “future”—the prospective solutions this algorithm will explore, conditioned by the “past” it’s seen.

Because of the No Free Lunch Theorem for Search, we can say that the worst possible algorithm for moving around on a landscape is random guessing. That is, picking new prospective solutions from the set of all possible ones, without paying any attention to the structure or performance of any prior solutions. Eventually you will find a very good solution, or even the best, but even if you’re careful not to resample the same solutions more than once, “eventually” will be a very long time indeed.

The fear of “rugged landscapes” we have in machine learning should—at least in a broad sense—be clear now.

Let me push us into the analogy all the way, and say “you” instead of “the algorithm” for a while: Imagine you are on a landscape, and your goal is to move downhill… but you’re in a place you literally can’t see anything besides the point you’re standing on.

So you take a small step¹ and measure a new point. Now you have more information. If your old position’s altitude is lower (better), then you have stepped uphill, and maybe you should try a different direction next time, or even go back where you were and go the other way. If the opposite direction is also uphill, then maybe try a third direction, and so forth. Eventually, with enough little steps and a good enough memory, you might be able to make it down off the mountain in the dark.²

“Ruggedness” is (speaking metaphorically still) how hard it is to infer one step from the prior ones. In “smooth” landscapes, when I take a step in one direction and my error goes down by 0.001, I can expect another step in the same direction to have about the same effect. The smoother the landscape, the more consistent, over time, this prediction will be. The more rugged the landscape, the less certain I am that what I did just now will still be useful very soon. In other words, ruggedness captures (factoring in the algorithm itself and fundamental information about the problem instance being searched and some other stuff too) the mutual information we can possibly gather—if we are perfect and optimal and do everything exactly right—from our past samples of the landscape. If there is more ruggedless, there is literally less information we can use which no matter what we do bears on our future samples of the landscape.

Back to Push

Push is a language designed for evolutionary search settings. Like I said above, it permits programs that do all kinds of computational work to be chopped up and recombined more-or-less arbitrarily, and still do something. Unlike most neural network models, the resulting algorithms can manipulate all kinds of abstract structures: numerical models, text strings, geometry, images, algebraic expressions, music, robots, bridges, and even the algorithms that search for new Push programs themselves. But this expressivity comes at a cost: it inevitably makes the landscapes for these search problems more rugged.

As I said, Push’s structure permits more-or-less arbitrary changes to programs, and the results will still run. But those changed programs may behave in wildly different ways. This is true of almost any complex system, but it becomes an immediate concern when we start poking round inside full-fledged computer programs.

Look at the collection of :scalar stack records I showed you, above. As I said, that -1 “answer” at the top of the very last step was hidden away because somewhere back around the 76th step it was stored and we launched a sub-process. Most of the rest of the program’s effort was spent inside that isolated sub-process.

Here’s the program, and the inputs I gave it:

Now suppose I delete one of the items in the :program, and run that variant program. What will happen to the way the program behaves? If there are (I’m guessing) 1000 items in this program, and make 1000 variations which each lack exactly one token, I will have 1000 variant programs to run, each containing 999 items. Those are all “small steps” in the world of genetic programming, but will they be smooth steps, or rugged ones?

It turns out it’ll be a bit of both. Much of the program consists of literals that get pushed to stacks and then ignored (even by later instructions), and instructions that don’t have arguments and thus fail silently. If I delete a literal that isn’t used, there will be a minor change in the :log stack by the time I get to the end, but that’s all. Same thing if I delete an instruction that doesn’t have arguments.

But in the variation where I delete the :snapshot-begin instruction that saved away our -1 around step 76, that number won’t appear at the end of the run at all. If I delete the :scalars-flush instruction that clears the stack around 3/4 of the way through the run, stuff will be left on the stack that would have been deleted, and (possibly) other instructions after that point which lacked arguments will now have those arguments, and other stuff will happen as consequences… and so on.

I think it’s a lot like an avalanche on a sand pile. When you add a grain of sand to the top of a sandpile, most times there will be no resulting avalanche. Sometimes (but less frequently) there will be small avalanches. Fewer times there will be medium-sized avalanches, and so on, until very very rarely you have huge, world-shaking earthquakes. Change at all scales. I think “producing results by executing code” is a sufficiently complex system that we might best model its internal complexity in the same metaphorical ways we model the hidden internal structure that holds up a sand pile.

Whenever there’s a big avalanche, or a major change in the observed behavior of a program, our “small change” has produced a “large result”. The change might have been beneficial, but (back in our walking-on-a-landscape metaphor), we have just fallen off a cliff. We’re in a better place now, but the cliff provides very little information for the next step we are about to take.

Mutual information of this sort is a lot like Goldilocks: If you are on a completely flat landscape looking for a tiny tiny hole somewhere out there, the local landscape never gives you any information about which direction that hole might be. If you’re on a completely random landscape that goes up and down randomly no matter which direction you move (or how far), the location of the lowest possible point is also not going to be obvious. Mutual information between locations (and throughout the space) is maximized when there is—in our analogy, at least—a gentle slope from anywhere on the landscape down towards the lowest point.

Like I said, there are lots of ways this landscape analogy is misleading, and don’t take it too much to heart. One trick I should point out, though: we may speak of our task as “finding the lowest point anywhere in the error landscape”, but there is an implicit sub-task hidden inside that. Suppose, because of the way our algorithm is “built” (even that word is tricky), we can pick any of four cardinal directions to make our next step. We take a little step in the N, S, E or W direction, and think about what we’ve seen so far, and make our next decision where to step. But suppose that along the way we start to notice that the E and W steps tend be a bit bigger, either up or down, than the other directions. The E/W direction is somehow more variable than the N/S direction.

A classic machine learning question to consider is: Which way will you tend to walk, once you notice that? The classic (and perhaps also the contemporary, since I haven’t paid much attention in the last 20 years) answer to that question is, as you might expect: Well, see, it’s complicated.

Simpler tasks

Whenever a task is too hard to work out in full, I know I do better in the long term when I break it down into components and tackle those incrementally. I’m about to suggest this same approach for evolutionary search for Push programs.

So to review:

A Push program does a whole bunch of stuff when you run it. The program listed above “does” 441 things, in the sense that it makes 441 steps when I run it with the specified inputs. A lot of that stuff is moving things from one stack (for instance the :exec stack, where the running program is stored) to another stack (for instance the :scalar stack). Some of that stuff is the execution of instructions, which tend to consume items from one or more stacks, do some work, and then put new things onto those or different stacks. Some instructions just rearrange things that are already on the stacks.
Except by convention, there is no privilege associated with the interpreter state at some “last” time point, especially when a program runs for a long time. We often stop execution arbitrarily, and often at an externally imposed time-limit, because there are always random Push programs that will run forever otherwise.
Because Push is an expressive, many-typed language, arbitrary Push programs can do all kinds of things we associate with “algorithms”. But because of the inherent inhomogeneity of “algorithms” and the parts they’re built from, some parts will have more or less influence than others in the overall outcome of execution. There is a lot we can delete from most evolved Push programs, without affecting the observed outcome of execution… but we permit that extra material because we think of it as “raw material” for innovative evolutionary change in the future.

At any rate, changing any given token in a Push program may have no effect or an arbitrarily large one on the outcome, and it is also prone to nonlocal effects so even we can’t say with certainty whether a part is important without running the program. Deleting the instruction that puts that right answer in the right place seems like it will “break” a program, but you can almost never be sure unless you check, especially if there is code being run after that deleted step.³
Except by convention, there is no privilege associated with the “top” item on any given stack. We often examine the top items—at an appropriate time step—as our prospective “output” value, but only because we think no value is any better than any other, and at least if there is any value on a stack, there will be a top one. But by that same argument: if there is any value on a stack, there will also be a bottom one, and we don’t, as a rule, take the bottom value as an “output”.

This last one is especially interesting to me.

Just in the course of running one Push program, like the one above, you can see all the stuff that happens on just one of the several stacks the interpreter uses. A lot of the action happens at the top of the stack, just by virtue of the way Push instructions collect arguments and push results, but the presence of a value at the top of a stack at a given time point is no guarantee that the value is “fresh”. Look at the :scalar stack in the last few steps of that trace I showed above:

...421 earlier steps...
()
()
(201.18359375)
(201.18359375)
(201.18359375)
(55445132 201.18359375)
(153861628M 55445132 201.18359375)
(153861628M 55445132 201.18359375)
(153861628M 55445132 201.18359375)
(153861628M 55445132 201.18359375)
(55445132 201.18359375)
(55445132 201.18359375)
(55445132 201.18359375)
(953358360M 55445132 201.18359375)
(953358360M 55445132 201.18359375)
(953358360M 55445132 201.18359375)
(953358360M 55445132 201.18359375)
(953358360M 55445132 201.18359375)
(953358360M 55445132 201.18359375)
(953358360M 55445132 201.18359375)
(-1 963186484M 0 369.453125 351.43359375 802246192M 899300736)

It’s empty at step 422, but there have been all kinds of comings and going in the preceding 421 steps. Then out of somewhere (possibly a literal value in the program?) 201.18359375 appears, and it sticks around on the bottom of the stack until—almost—the end. And 153861628M wanders on-stage, and leaves again, and 953358360M comes by instead for the remainder of the show… until the surprise ending. That’s when every number on the stack is discarded, and old numbers that were socked away in cold storage are trotted out instead.

What is our machine learning algorithm, evolutionary or otherwise, to do in this situation? Like I said, all those final values were stored back in step 76 or so. They are not “new”. What if 953358360M is a better answer than -1? What if 55445132 is even better, and to improve this program, we should delete the last half-dozen steps so it stays on top?

More generally, every Push program does lots of stuff. When we ask a search process to find some particular program that does our stuff, the only information that search process has on hand about the landscape of our problem comes from the information it gleans from the stuff “bad programs” are already doing.

We tell pretty stories, in the evolutionary computation field, about how “smart” evolution is. We especially like the way we imagine recombination “combines good parts” of multiple partial solutions, producing a hybrid that ends up doing even better. But in reality there’s nothing inherently better about recombination than there is in random guessing or making tiny incremental changes to the numbers, except in the context of a given algorithm being applied to a given problem.

All of our algorithms, whether they use linear algebra or biologically-inspired metaphors, are doing the same thing: They’re gathering information about the “landscape”, and trying to bring that information to bear in making further moves. The information we give them is the error score we assign them, given some input configuration and some desired output. We often call this “fitness”, but it’s really everything the algorithm knows, except what it is doing internally. Communicated error measurements are the only channel by which we can speak to a running algorithm.

So now think of a problem—just a training case, really, a single input:output combination—in which we have set up a program and run it and look to see if it has told us the number we want. Suppose it’s the program above, and we want the number 777 as a correct answer.

The traditional approach to scoring a Push program would be to run the program for all 442 steps, look at the :scalar stack, see the answer is -1, and assign the error accordingly. If the program had a better answer earlier (and it did!), we don’t care; the dynamics are not our concern, we seem to think. We don’t even care to note the fact that there are 7 numbers on the stack at step 442, and they all appeared there on that last step!

We simply score the program, for this training case, as having an error score of 778. Our search algorithm has only two sources of information in its world, and thus it “knows” exactly two things, and can only use those two things in its search:

It has a detailed history of this particular program’s behavior, involving literally thousands of bits of dynamical history. It’s running on a computer! It was there the whole time! Computers are very good at paying attention to this sort of details.
It has the score we give it, and so it also knows that as a whole that huge pile of dynamics it just executed—about which it has perfect information—was “off by 778”.

It doesn’t “know” that we looked at the top number on the :scalar stack. It doesn’t “know” anything about partial solutions it may have come close to, earlier in the process of running the program. It only “knows” that “when all of this happens, I get a score of 778.”

Walk a mile in my CPU

Suppose I asked you to write a program for me, and I did it by giving you an input like this one, which is the same one the program above was given:

    '{:OUTPUT (),
      
      :e (((false #push.type.definitions.interval.Interval{:min 122/75, :max 303.24609375, :min-open? true, :max-open? false} 250.546875 [0.13671875 0.0234375 0.11328125 0.1171875]) :scalars-do*each false false [" ÑËB°ª" "G¢âãÈÜ.-â&­fl}" "a¯F'tq5ÓX<"] :complex-conjugate :code-append :set-flush false \Z)),
      
      :g ((:interval-include? :booleans-length #{116.04296875 :boolean-equal? 766975258M 592141459 :booleans-remove 62/83 "s«gFX²¾.=" [true false false false true false true false]} :string-liftstack [\Ð \< \@ \ ] :exec-s 40164437M :tagspace-save :tagspace-min :generator-reset)),
      
      :c ((:intervals-new 672897750 #{:chars-replace #push.type.definitions.complex.Complex{:re 969631604, :im 199.3515625} :char-return 647395322 true 959383724M [\space] :scalars-savestack} :booleans-portion [0.01953125 0.03125 0.03125 0.0625 0.04296875] :refs-replace :string-tag :ref-equal? :set-echo #push.type.definitions.interval.Interval{:min 188/59, :max 290.7421875, :min-open? false, :max-open? false})),
      
      :j ((#push.type.definitions.complex.Complex{:re 234693697, :im 132.46484375} 767389167M 101883890M :ref-new "Ð¸¶" :string-cycler [0.04296875] :set-dup)), 

      :h ((100462571M "}>ÄÍÚÀÆª­zÜMVá·Y" :ref-conj-set (true true :char-liftstack :string-tag ["IrK¨»" "»²¤Ú!" "ZÜY" "Ìi£;C¨Îª­" "®p¬`x" "8X±ÞSRZYW" "$Ø§Q;ÀÈ"]) :push-bindingcount :interval-overlap? #push.type.definitions.complex.Complex{:re 336985437, :im 228.4921875} :float-uniform [false true false false true true true true true] [true false true true true])),

      :b ((#{:interval-liftstack :exec-storestack :vector-equal? :ref-peek ["¿È4`µÙ" "Í³" "h'gUÝb" "­" "Í2ÎÑ&S¹8­b¸Å"] false ([0.1171875 0.15234375 0.04296875 0.09765625 0.1015625 0.07421875] #{:booleans-build "Fr\"Û®g 6\\3" :string-swap #push.type.definitions.interval.Interval{:min 67/78, :max 209.1953125, :min-open? true, :max-open? true} [\y \Æ \p \© \) \, \ \Ô] :complex-multiply :complexes-swap [4138 4807 4251 4002 3626 864 3739 790]} :complex-stackdepth 462748995M \4) :interval-return} :code-map 963186484M [\k \ \å \Æ \â \Ä] :strings-indexof :vector-pt-crossover [] :exec->string :scalarsign->boolean [3586 2368 4517])),

      :d ((#push.type.definitions.interval.Interval{:min 43/27, :max 284.16015625, :min-open? true, :max-open? true} :booleans-shatter :string-notequal? \\ :intervals-contains? "QzS*|S:b¾àÛ·ÌIDeå" :refs-generalize :char-storestack [\« \])),

      :f ((:vector-tagstack :scalar-ceiling :boolean-later :interval-parts :booleans-empty? :code-position true 529460031M 369886695M :vector-rerunall)),

      :i ((#push.type.definitions.interval.Interval{:min 16/5, :max 279.73828125, :min-open? true, :max-open? false} ["¯Ó¬ÈÌ1{hgV£{A]Út" "mÂz¯¢ `¶NXx½"] :booleans-butlast :vector-tag 497278896 \i "=7j¢Â1Ø¬;0hXh" :boolean-3bittable :vector-fillvector)), 

      :a (([true] :ref-storestack :generator-cutflip [false true false true false false true] "ÊE6F­Ê" #push.type.definitions.interval.Interval{:min 48/25, :max 45.97265625, :min-open? true, :max-open? false} :interval-recenter [false false true false false false true] [0.08203125 0.1328125 0.09375 0.0390625 0.08984375 0.078125 0.1171875 0.09375] :string-replacefirstchar))}

That’s just 11 input variables, labeled :a through :j plus an empty one called :OUTPUT, and their assigned values. Sure, they’re complicated, but they’re just arbitrary stuff. Code is data, dude! I still tell you to guess the right answer, and I look at your work, and say your score is 778. What is the next thing you should do?

I think your next move might be any of several different ones, and that a lot of them involve flipping the table over. But I bet it would absolutely not be excitedly saying “Oooh! Oooh! I know the right answer!”… and then telling me the right answer.

We’re in this together. It’s not a programming interview with a whiteboard coding task, it’s a collaborative search for new and innovative algorithms. I want to help you help me. So I want to give you more information than just “all that stuff you just did produced an error of 778”. I can’t really give you more information about what you’ve done, because you’re a computer and you have a very good memory for details like that.

But I can give you more information about how well you did on the tasks I have in mind. There are several tasks, really, like I said above. We often say it’s “one task”, but as soon as we get a literal answer for our “one task” the back-pedaling starts and the other tacit and implicit tasks start to appear.

To review, those tasks are:

give me the right value
give me that value when I ask you
put the value in the place I look
don’t do a lot of weird extra shit

My feedback “778” has only provided partial information, and it may only apply to a few of these tasks. I haven’t even told you which ones I used to base my score on!

More information

What do I really want? In this particular case?

Give me the right value: I want the number 777.
Give me that value when I ask you: I want it on the earlier of (1) the program’s last step, or (2) the 1000th step.
Put the value in the place I look: I want it to be on top of the :scalar stack.
Don’t do all that weird extra shit: Ideally, I want you to give me the right answer on the first step of execution, and then stop doing things.

I can see a few ways to provide the search algorithm with more useful information about “partial success”. We already give it information about how “wrong” it is, in the form of the error score. But we might also give it information about how mis-timed the “answer” was, or how misplaced the “answer” was, or how much extra chaff we think there is.

When

If I wanted to talk about time—and particularly if I wanted to communicate new information to the algorithm that might help minimize error with respect to timing—perhaps I would look at the top item on the :scalar stack at every step of execution. Maybe the poor program did what we wanted—or more saliently, something closer to what we wanted—back at step 78, and then moved on. Here is the “top” item observed on the :scalar item for each step, of execution of this particular program; “nothing was on the stack” is indicated by ???:

Now I said very carefully that I want this answer on top of the stack at the last step of execution. But there were (some) other prospective answers there as well, at earlier steps. I wonder if there is something useful in that?

Where

If I wanted to talk about position on the stack—and particularly if I wanted to communicate new information to the algorithm that might help minimize error with respect to positioning the answer in the stack—perhaps I could just look at the entire :scalar stack at the last step of execution. There are several numbers there. For this particular program running in these particular conditions, the :scalar stack at the time I wanted an answer was:

(-1 963186484M 0 369.453125 351.43359375 802246192M 899300736)

Now I said very carefully that I want this answer on top of the stack. But there are other prospective answers there as well, deeper in the stack. I wonder if there is something useful in that?

Effort

If I wanted to talk about efficiency, I might mean any of several things. It might be that I want to minimize the number of steps it took to execute the code. It might be that I want to minimize the number of things pushed onto all the stacks (the :string and :boolean and :vector and :tagspace items), throughout the entire execution.

There are other things I can imagine. This one feels complicated, and in a way it’s more of a reaction to seeing a good answer that lacks “secondary” characteristics I desire. But I can see some ways to frame at least some of those desires—those tasks—as minimization goals.

Multiple Criteria

Those proposals are only slightly more formalized statements of the four (or more) tasks I’m asking the search algorithm to do for me. I don’t believe I can say there’s been a “full success” until all of them are complete. Typically when I see people doing machine learning projects—and these four (or more) tasks apply in all the machine learning projects I’ve ever seen—they start with one, usually the first, “error”. Then they get some low-error solution, and they start to think of the others. They start to say, “Well, fine, but it’s too slow,” or “Well, fine, but it’s not parsimonious….”

Put yourself in the place of the search algorithm right now. Think about how passive-aggressive that would feel. You try your best. You have done exactly what was asked, all along, and minimized error, and now that you have almost succeeded, that’s when they start to ask about “parsimony” and “efficiency”?

I suspect I’d cry.

It would be nicer, I think, if we let our search algorithms know at least a bit more about those other goals. I don’t want to aggregate them together into a single score, because that just makes the subjective experience of doing a search even more confounding from the algorithm’s standpoint.

Instead, I’m going to suggest we give the algorithm information on several scores. One for each task.

Suppose that instead of constraining the “returned guess” our program provides to be the top number on the stack at the last step, we treat every number ever placed on that stack as a prospective “guess”. Any program that runs gets as many “guesses” as the number of things ever present on the stack we are interested in.

We still can compare these guesses to one another, and compare two programs to one another on the basis of the guesses they make, using the concept of Pareto efficiency to compare the vectors of scores we assign them.

Here’s an analogy for you. Suppose you are considering buying a car, and you want one that is all of these:

cheap (min cost)
least dangerous (min accidents per year reported)
fuel-efficient (min gallons per 10000 miles)

When you look at the cars on the market, you find

;; cost       accidents       GP10k      
[  20000       190             200]
[  30000       290             800]
[  60000       160             500]
[  30000       100             300]

Which is “best”?

Well, the second car is worse than the first on all three criteria. The third car is much more expensive than the first, and less fuel-efficient, but appears to be safer (I imagine it’s an SUV). The fourth car is a cheaper, safer and more efficient than the third car. But it also is more expensive and less efficient than the first.

It feels like the first and fourth cars are “best”, in the sense that no other cars are clearly better at everything. The second and third ones are not “best”, in the sense that there are other alternatives that are clearly better at everything. I may not be able to pick which of #1 and #4 is “winner” based on these criteria alone… but then again, if two people get exactly the same maximum number of points in a single-objective competition, I’m happy enough to say they have “tied for first place”.

So both the first and fourth cars “win”, and the other two “do not win.”

Notice that I’m not aggregating these scores into some single score and ranking the choices by the sum or weighted average scores. Instead, I’m treating them as vectors, and comparing them in terms of which dominates the others.

In multiobjective comparisons, we formally say that one vector dominates another when it is at least as good at all criteria, and better at at least one. In this case, where I’m carefully minimizing over all objectives, we can simplify that slightly by saying:

One vector dominates another if its elements are at least as small as the second one’s elements, AND at least one element is smaller

You’ll probably need to think about it. It sounds roundabout, but there’s no simpler way to say it, and it actually makes sense. The “AND” is also very important.

Here are some examples, again presented under the assumption that lower numbers are “better” in all cases:

[1 2 3] dominates [2 3 4]
  - all its elements are at least as small
  - at least one element is smaller

[0 2 3] dominates [1 2 3]
  - all its elements are at least as small
  - at least one element is smaller

[1 2 3] does not dominate [1 2 3]
  - all its elements are at least as small
  - NO element is smaller

[2 3 4] does not dominate [1 2 3]
  - NOT ALL of its elements are at least as small
  - NO element is smaller

[1 2 3] does not dominate [3 2 1]
  - NOT ALL of its elements are at least as small
  - at least one element is smaller

Multiobjective comparisons are relatively simple once you get the hang of them. They’re just a helpful generalization of familiar scalar comparisons you’re used to. Think of “regular numbers” as “vectors of one element”: [8] is dominated by [3] because all the elements of [3] are at least as small as the elements of [8], and at least one is smaller.

Winners and losers

Now let me quickly revisit the car comparison example, ad finish it up.

Here are our four contenders, again, labeled this time

A  [  20000       190             200]
B  [  30000       290             800]
C  [  60000       160             500]
D  [  30000       100             300]

and here’s a matrix in which I’ve labeled which of the four dominates the other four:

        (does column dominate row?)
             A     B     C     D
        A    -     -     -     -
        B    Y     -     -     Y
        C    -     -     -     Y
        D    -     -     -     -

Neither of the cars I’ve labeled A or D is dominated by any of the others. So as far as these three criteria are concerned, I can at least say that I am making a poor choice if I pick either of B or C, because there is at least one other car in each case that is just as good for all criteria, and better for at least one.

And we’ve just performed multiobjective selection on these cars.

Selection

As I said, the scores we assign are essentially the only information we provide to our search algorithm.

But barring some rare exceptions, over the last decade-and-a-half of scoring Push programs (and by extension all the other dynamic genetic programming representations that “do work” or “are executed” to determine an answer, like Cartesian GP, Artificial Chemistries, Linear Genetic Programming, or anything even slightly more stateful than a simple symbolic regression S-expression), folks have used error as a selection criterion.

In fact a lot of young researchers I’ve known have only used error as their search criterion, been disappointed in the speed or complexity of the results they obtain, and then spent months or years trying to “fix” the algorithm or representation they’ve built so that those other tasks are also handled. It makes me sad.

So now I can sketch what I’m suggesting we try instead.

Take that program I’ve been mentioning all day. This is (repeating the gist from many thousands of words ago) the collection of all the :scalar values it has, at every time step. Each line is a step of the program, top-to-bottom. Each number is an item on the stack, top-at-the-left, bottom-at-the-right.

I’m going to call this program Ashley, or “A” for short.

Now suppose we are trying to select which of two different programs to breed, in some next generation. There’s some other program around, too, which is a relative of Ashley but not identical. Call that one Bobbie, or “B” for short. Here’s what Bobbie has on its :scalar stack, when I run it:

Both Ashley and Bobbie run about the same number of steps. If you scroll through their traces, you’ll see that they follow very similar patterns of… well, number-moving. That’s all we can actually see, looking at the time-series of :scalar stack lists like this, but it’s helpful enough.

Which is better, Ashley or Bobbie?

There is a lot more information here than there was in the example with the cars I gave above. But this is the raw data of their behavior, not a measurement of their performance. It’s as though I had driven each of the four cars in my example several hundred times, and each time jotted down how far my trip was and how much gas I bought.

What I want to do is boil down this huge, complex dataset into a measurement that represents Ashley’s or Bobbie’s performance with regards to each of my four criteria. Ashley has guessed 3486 times (over the entire run), and Bobbie has guessed 3749 times. Some of those numbers are the same, some are missing, some fall in different orders.

What are my criteria?

Error, first. So I want to know how different each guess is, compared to the value I want. If I say that value is 777 for these inputs, I can most easily assign an error to each “guess” which is simply the absolute distance from the desired value, or \(\vert x_{target} - x_{guess} \vert\).

What about time? I want the answer I see at the earliest of the 1000th time step, or the last one executed. Let’s say that the “timeliness” of a guess is how far back from the last recorded time step it appears. Thus, the “timeliness” of each guess that appears in the last stack is 0, the “timeliness” of the guesses in the next-but-last stack is 1, and so forth.

What about position? Well, I want the answer to appear at the top of this stack. So the “depth” of a guess is its order within the stack (at the particular time point) in which it appears. For Abbey this means the seven guesses range in depth from 0 to 6 in the last time point, for example. And for cases where there’s no guess at all, we don’t have to worry about assigning a value!

And finally, what about “doing all that extra shit”? I’m really talking here about parsimony, and the fact that we all (in theory) desire an ideal program that gives us the right answer immediately, as the first thing it does, and then does nothing else. But that concept is a bit hard to minimize over, so for consistency let me perhaps call this “extravagance”, and say it’s the total number of other guesses made.

That one might make you squirm a bit, especially if you’ve played with evolutionary search and the trickiness of selection before. It feels a bit indirect, and we have all of us in the field seen examples of evolution finding tricky work-arounds for our well-intentioned goals and constraints. But I’m sketching here, and perfectly willing to let things change if we see something untoward in a bit.

Here, then, is a sketch of the vector I want to calculate for each guess:

error: The absolute difference between this number and 777
timeliness: The number of steps appearing after the one where the guess was made
depth: The number of things appearing above the guess in its particular stack
extravagance: The total number of other guesses, before or after

The last one is constant for all the guesses one program makes in a given context. But recall that we’re not comparing guesses within the ones given by one program. We’re trying to compare between the behavior of two programs running with identical training cases.

I’m not going to share the embarrassingly-convoluted Clojure code I just kludged together to calculate these vectors. But I will share what they seem to look like.

Here is what I get for Ashley and Bobbie:

That seems like a huge amount of work for a conjecture, doesn’t it? And there’s more information here than there was when I started munging these numbers around into vectors. How can this be “better”, or even more “informative”?

Ashley returned a “traditional answer”, at the top of the stack in the last step, with a measured error of 778. Bobbie returned one with an error of 953357583. Clearly Ashley is better than Bobbie.

But look more closely: The 4-element vector “scores” for just the last step of each program are

A: ([778          1 0 3486]
    [963185707M   1 1 3486] 
    [777          1 2 3486] 
    [407.546875   1 3 3486] 
    [425.56640625 1 4 3486] 
    [802245415M   1 5 3486] 
    [899299959    1 6 3486])

B: ([953357583M   1 0 3749]
    [55444355     1 1 3749] 
    [95.81640625  1 2 3749] 
    [873585394M   1 3 3749] 
    [202088330    1 4 3749])

There’s more going on here. Bobbie has a much lower-error guess than Ashley provides in the last step. What happens if we look at all the steps?

What are the non-dominated guesses Ashley and Bobbie have made? That is, within their own pile of several thousand guesses, which are such that there is none clearly better?

Ashley:
   [407.546875     1 3 3486]
   [407.546875   341 2 3486] 
   [407.546875   346 1 3486]
   [407.546875   351 0 3486]
   [499.52734375 218 0 3486]
   [575.81640625   2 2 3486]
   [575.81640625   9 1 3486]
   [575.81640625  17 0 3486]
   [777            1 2 3486]
   [778            1 0 3486]

Bobbie: 
   [0            178 0 3749]
   [95.81640625    1 2 3749]
   [95.81640625    7 1 3749] 
   [95.81640625   15 0 3749] 
   [55444355       1 1 3749] 
   [55444355       7 0 3749] 
   [953357583M     1 0 3749]

Wow. A huge proportion of all the “guesses” are dominated—even within one program’s execution—by other guesses. Oh, and Bobbie has actually guessed exactly 777, back some time (178 steps before terminating), but on top of the requisite stack. Does that mean that Bobbie has a better chance at solving the problem?

Well, remember that this is only a single comparison, with a single input. Odds are this is just an explicit constant value appearing in Bobbie’s program, and doesn’t reflect a response to the input values. If we were looking at a different training case, this might be a bad thing to do.

What happens when we now compare the “best guesses” of each program to the other’s?

overall "bests"
     error       timeliness  depth  extravagance     who?
    [0              178        0       3749]        Bobbie
    [95.81640625      1        2       3749]        Bobbie
    [95.81640625      7        1       3749]        Bobbie
    [95.81640625     15        0       3749]        Bobbie
    [407.546875       1        3       3486]        Ashley
    [407.546875     341        2       3486]        Ashley
    [407.546875     346        1       3486]        Ashley
    [407.546875     351        0       3486]        Ashley
    [499.52734375   218        0       3486]        Ashley
    [575.81640625     2        2       3486]        Ashley
    [575.81640625     9        1       3486]        Ashley
    [575.81640625    17        0       3486]        Ashley
    [777              1        2       3486]        Ashley
    [778              1        0       3486]        Ashley

There’s a lot going on here. Just counting who has contributed these “best” guesses, it looks like Ashley is the winner, having provided more non-dominated guesses overall.

But it’s not that simple, and I’ll go into a lot more useful detail in the next episode. For now, let me point out what happens when we drop that rather complex-feeling extravagance metric, and focus instead on who guessed the closest answer nearest the end of the run and nearest the top of the stack:

     error       timeliness  depth   who?
    [0              178        0]   Bobbie
    [95.81640625      1        2]   Bobbie
    [95.81640625      7        1]   Bobbie
    [95.81640625     15        0]   Bobbie
    [778              1        0]   Ashley

Bobbie has many more “best guesses” here, compared to Ashley. Should we take that as an indication that Bobbie is intrinsically better? It feels more, especially when I look over the trace of stacks itself, that this just indicates that Bobbie had a good number on the stack for some time. Is that better?

It depends, I think, on how we are planning to use this extra information in our search algorithms.

We’ll definitely have some things to explore in more detail next time.

Just to give you something to think about: If the answer we were looking for had been -9.2 instead of 777 for this input, not only would Ashley have returned exactly one non-dominated “best guess”, but that single guess would also dominate every guess Bobbie gave. With or without the extravagance measure.

Next time, I’ll greatly simplify this scheme, and try to make it seem much more useful than it probably does now….

By definition, all searches are “local” in this abstract space, which is one of the places where the landscape metaphor becomes most problematic. You don’t ever get to peek, and you can only measure a new solution by making that solution, so “movement” defines what is close or far away. If you think making small numerical changes to a neural network is “small” change, but modifying the way it’s wired together is “big” change, that’s a matter of habit and convention. If I permitted you to use a search algorithm for neural models that changed constants and modified architecture, “you” (the algorithm) would treat all those changes as being equally “close”, but would (perhaps) realize that some of them tend to be “more rugged” and others “smoother”. But they’re all equally local from your position in the invisible landscape. ↩
Another place where this analogy breaks down: If you stumble off a cliff, that’s awesome, because you’ve hugely reduced your error with a single step! ↩
This feature is one of the most exciting things about evolved Push code and the long-term possibilities of this approach. Like biological systems, many evolved Push programs use ad hoc approaches to problem-solving that permit back-up plans and alternative designs to coexist in a single “algorithm”. The resulting code is often organized much more like a biological system than anything an organized software writer would execute. You may think that’s a problem… but there are situations where it’s very interesting and useful. ↩