On the Automatic Generation of Other People’s Code

Draft of 2018.02.18

In the last episode I spent some time understanding how Lee Spector’s Clojush package works. My goal, then and today, is to use it to evolve FizzBuzz programs in the Push language.

For the most part, I spent last time describing how I gradually came to understand what a Clojush user is supposed to do to set up a new problem, mainly by looking over the included code in the repository. I ran the demo simple_regression.clj, and then spent a while trying to understand the apparently-necessary error-function blob that is present in that demo file. I needed to understand it enough to either change it, or write a new one that would work with my FizzBuzz problem.

All I was able to do in one day, it turned out, was run the demo from the off-the-shelf freshly-downloaded repository, and then look at the code for a number of problem definitions, in hopes of extrapolating. It was—for me, and maybe for some of you—a very interesting day. Along the way, I found myself doing some of these things:

trying to figure out the general shape (and a few details) of the huge pile of text Clojush dumps to STDOUT when you run it;
working through an evolved simple_regression Push program by hand, trying to make sense of it;
making a mistake while I was working through the evolved Push program (and its simplified version, too); this led me to believe there was a bug in the auto-simplification process Clojush uses;
later, realizing that the mistake I had made was in misunderstanding how the demo problem was explicitly set up in the configuration file (the input value was being pushed to two different stacks, where I assumed it was pushed to only the obvious one); as a result what was really happening was something more like: the unsimplified evolved program was robust to a missing “second” argument and so it did what I literally asked it to do and also what I had assumed I had asked it to do, but the “simplified” program was sensitive to a missing second argument, and only worked in the way I had literally told it to work…. (we’ll come back to this one);
feeling more confident about the evolved code, I looked at the configuration file, and couldn’t really understand how it was structured;
spending a while (most of an hour, wall-clock time) “mentally refactoring” the Clojure (not Push!) code used to specify the problem I’d been running, so I could better form my conception of what the various parts do;
being reminded that, of the hundreds of configuration variables shown in the info-dump by Clojush at the start of a run, the problem-specific file only set a few;
thinking that maybe the config file simple_regression.clj didn’t even “set” all of those variables it mentions; one was explicitly set to an empty vector [], which implies that either it was being intentionally un-set, or that somebody had previously set it to a value (now vanished) and then deleted the value and left the framework in place as a sort of “trace fossil” of its editing history.

A lot happened, though on the face of it not much seemed to be in service of what I’d planned at the beginning. Then again, it was a pretty normal day—in my experience—in cases where I’m using other people’s software.

And I learned a lot. Probably as much as I would have learned if I’d been forced to write my own software. We tend to imagine that discovering how to “use” something is a subjectively different experience from learning how to “make” something.

I’m not sure that’s true.

If I came across as overly-critical of Lee’s codebase, or of the people who’ve contributed and maintained it, or maybe of the Computer Science community as a whole… or of academic coding practices more broadly?

No. That was just normal criticism.

Keep in mind what I was able to do, within a very few hours:

I was able to automatically generate runnable code that did what I asked it to, just by giving it examples in the correct format. I mean really; think about what that sentence implies. Sure, simple_regression.clj might not have been my problem of interest, but it was close enough to be analogous to my problem of interest.

I took the time to learn what the Clojush codebase code was saying, both in the case of the problem definitions, and also the code it produced. But think carefully: I have to take the time to learn what any codebase is “saying”, and what the appropriate magical incantations are to use it on the command line, or the appropriate magical gestures in the UI. I have to do that whether I am learning to set up a virtual machine on a new cloud system, or process video with ffmpeg, or when I try to draw a vector diagram in an unfamiliar Adobe product that’s been updated three versions since the last time I touched it.

To do my work, I have to try to put myself in a place where I can understand the context of the author(s) of each section of source code I am reading. Only that way can I hope to understand not just the code’s syntax, not just its semantics, but to also get a feel for its pragmatics. What is this code “for”? in other words. Code is intentional, and almost by definition the intention of an author cannot be identical with the desires or intentions of a reader.

I walked through an evolved solution for the problem in simple_regression.clj last time, too. It processed a numerical input and “returned” \(4318=(17^3 - 2(17^2) - 17)\). It was very convoluted code. I’m not sure I could explain what it was doing, or “why”.

At the end of the walkthrough I asked, “How do you feel?”

I need exactly the same critical skills and techniques, whether I’m trying to understand Lee’s (and other colleagues’) Clojure code, or the automatically-generated Push code that Clojush produced. I need to know the syntax of the languages. I need a sense of the semantics of the languages. But most importantly, I need to ask What is this for? and What does this mean? and Where is the author going with this piece? Those aren’t matters of syntax or semantics. They are the same questions any created work makes us ask—at least if we’re paying attention.

So I want to make the case that every computer program does work beyond merely compiling, and running, and producing “answers”. A computer program does social work. Code is never intended to be written once and never read—even if it’s “research code”. Yes, code must be run, but then it also must be read by human beings.

What code communicates, when read, can never just be “How to do this thing”. There is no absolute Platonic ideal for “this thing”, that is not also contingent on some historical decision like “how we represent floating-point numbers” or “how we represent alphabetic characters”.

Code says much more about “How the author of the code understood the problem, and set about trying to address it”. This is true even of the most canonically “rational” and deterministic “code”: mathematical proofs. How many proofs of the Pythagorean Theorem do you think there are, anyway?

So. If I asked you to write a FizzBuzz program, with no additional information beyond some bare examples that say “take these numbers and print these strings”, what program would you write?

You would, with intentionality—and in some real way acting outside the specification I’ve given you—pick a language in which to write it, and pick a computer (and operating system) on which it would run.

You would interpret what “printing a string” meant—writing to a file, putting the string into a return value from a function, generating light on a screen, etc.

Maybe, in your experience, there’s a different FizzBuzz specification, and you don’t notice that I haven’t mentioned printing the number I give you in addition to the strings. So maybe you print the number and the string—which is the way FizzBuzz is described more often. I didn’t ask you for that, though.

But you do that. You write FizzBuzz. All the tests I gave you pass.

Thus, you give me what I asked for. And maybe a little bit more. Actually, what you’ve done by that point is already far more contingent and contextual than what most people imagine “programming a computer” to be. But those people don’t usually think about the meaning of the situation you (the author) or I (the user) find ourselves in. Nor its context. Nor its consequences.

What a thing means often reaches out far beyond what we typically bother to mention. I assumed you knew that I only want the string but not the number. I assume you will not do something “weird”. That what you give me will “generalize”.

Generalize to what? you might ask. What an annoying question, I would say.

And your solution (whether you are producing an app for a store, or a neural network, or a Push program based on my training data) is remarkably contingent. It’s shaped by and filled with decisions made by you (and for you) along the way. There are externalities and assumptions—and even mistakes you will have made—that may not affect the explicit validity of your work, but which will almost certainly affect my response to it anyway.

I might think some of them are bugs.

Maybe (like Joel Grus imagines) you decided to write FizzBuzz in a TensorFlow. It is exactly what I asked for, if it satisfies the criteria I laid out. What should I do, when I see that you’ve made that decision? What if you use an unexpected language? BASIC or Forth, maybe? What if you use an object-oriented approach, where I was expecting (but didn’t suggest) you write it in a functional language? What if you use Microsoft Word as your text editor?

How do I “read” that extra stuff you did? I never mentioned it. It never occurred to me, frankly. It’s pretty weird. I will be confused. I didn’t ask or expect to be startled or confused.

And yet I probably will be, no matter what you do.

Forging ahead by backing up

Last time, I was trying to understand the specification of an error-function in simple_regression.clj. I manually refactored it (not even running it, just trying to identify its parts mentally), and ended up moving from the original structure, which looks like this:

(def argmap
  {:error-function (fn [individual]
                     (assoc individual
                            :errors
                            (doall
                             (for [input (range 10)]
                               (let [state (run-push (:program individual)
                                                     (push-item input :input
                                                                (push-item input :integer
                                                                           (make-push-state))))
                                     top-int (top-item :integer state)]
                                 (if (number? top-int)
                                   (abs (- top-int
                                           (- (* input input input)
                                              (* 2 input input) input)))
                                   1000))))))
   :atom-generators (list (fn [] (lrand-int 10))
                          'in1
                          'integer_div
                          'integer_mult
                          'integer_add
                          'integer_sub)
   :epigenetic-markers []
   :parent-selection :tournament
   :genetic-operator-probabilities {:alternation 0.5
                                    :uniform-mutation 0.5}
   })

towards something that looked more like this, which (I think) does the same thing:

(defn run-with-input
  [individual input]
  (run-push
    (:program individual)
    (->> (make-push-state)
      (push-item input :integer ,)
      (push-item input :input ,)
      )))

(defn target-function
  "returns x^3 - 2x^2 - x"
  [x]
  (- (* x x x)
     (* 2 x x)
     x
     ))

(defn absolute-difference
  "returns the absolute difference between two numbers"
  [v1 v2]
  (abs (- v1 v2)))

(defn absolute-error-for-training-case
  "given an individual and an x value, runs the program with that x loaded as an input and an integer, and "
  [individual input]
  (let [program (:program individual)
        state   (run-with-input program input)
        top-int (top-item :integer state)
        expected (target-function input)]
    (if (number? top-int)
      (absolute-difference top-int expected)
      1000
      )))

(def argmap
  {:error-function
    (fn [individual]
      (assoc individual
             :errors
             (doall
               (for [input (range 10)]
                  (absolute-error-for-training-case individual input)
                  ))))
  ;;...
})

What was I doing, in undertaking this extra work?

When I looked at the Big Pile of Code in this function, I started by identifying segments of what seemed to be shared intention. Then I extracted those as isolated functions, which were connected back into the original structure via function calls.

This is basically a Refactoring 101 maneuver, right? It’s Extract Method, which (as Martin says right there on the page) is Turn the fragment into a method whose name explains the purpose of the method.

There are no names in the original Clojush codebase for these different facets of intention. But clearly—and I can say this because I have written enough blobby code like this myself—there are nested chunks of purpose in this piled-up function.

There was something in the pile that’s “for” calculating the target value for a given input.
There was something “for” quantifying the error between the result of running a Push program, and the desired result.
There was something “for” building a Push interpreter state, and loading the program of interest into that, and also setting up the input values in the correct place and order.
And of course, there is the root of this code’s intention, which is to build an ordered collection of things called :errors for each of 10 specified input values.

In a real way, I’m lucky that all the stuff this code is “for” is included right there in one relatively small pile. When I start trying to understand configuration settings in a moment, I will be searching the entire codebase looking for their meaning. But here, I just had to tease these chunks apart, after seeing the seams between facets of intention, and then work through what the author really meant for them to do.

But I am also lucky, because I know this code “works as intended”. That is, I happen to know that the authors run this code a lot, and they aren’t often finding bugs as such.

In some sense, from their perspective we can think of “every” part of this code as being “for” evolving solutions to simple_regression.clj problems. There isn’t “really” any idealized blob of intention in any particular part of this code. Why should I worry at all about how this codebase is factored, as long as it works?

Maybe I am rewriting the code as an act of criticism. To understand it better in a way I can personally use more effectively.

back to work

I think I understand that the :error-function in simple_regression.clj is intended to be an abstract function. It takes an individual (which contains a :program and an :errors collection… at least). But the problem specification also includes a number of other things:

(def argmap
{;;...:error-function here
:atom-generators (list (fn [] (lrand-int 10))
                       'in1
                       'integer_div
                       'integer_mult
                       'integer_add
                       'integer_sub)
:epigenetic-markers []
:parent-selection :tournament
:genetic-operator-probabilities {:alternation 0.5
                                 :uniform-mutation 0.5}
})

As I said yesterday, it’s only from experience with this package (and with writing a bunch of GP systems myself) that I understand that :atom-generators seems to be a list of things that can appear in a Push program. Things “appear” in Push programs when we’re building a random “guess” for the initial population, and also when we’re mutating one later on.

I see in that collection an abstract function (fn [] (lrand-int 10)) which poops out a random integer value when called. But I also see some quoted Clojure symbols, like 'in1 and 'integer_mult, and I will assume those aren’t going to be “called” in quite the same way the function is.

More importantly, I am sure that for my FizzBuzz problem definition, I will want my evolved code to include all kinds of other Push instructions, not just these four that do simple arithmetic. I want string ones, and boolean ones, and (because I’m intimately familiar with Push) probably exec and code and char and who knows what else. My tendency in these matters is to say, “Make me one with everything.”

But I don’t know how to do that. I guess I’ll have to look around and think some more.

Then, finally, there are the most problematic settings in simple_regression.clj. Because they feel important.

:epigenetic-markers []
:parent-selection :tournament
:genetic-operator-probabilities {:alternation 0.5
                                 :uniform-mutation 0.5}

These are the things that I (personally) know way more about than you do. And so it makes me sad that I literally don’t know what these mean in the context of Clojush. That is, I don’t have enough information here to understand their intent. All I can infer is something like

:epigenetic-markers wants to be present, but doesn’t need to be set?
or maybe :epigenetic-markers needs to be overridden to work here?
parent-selection can be :tournament, but I already know I want to not use the tournament selection algorithm (from experience); I want to use the lexicase selection algorithm, and I have no information here about the other options I could use;
:genetic-operator-probabilities seems to be a sort of ad hoc probability distribution, with half the time being :alternation (which is what?) and half being :uniform-mutation, which (again because of experience) I know means “changing random shit to other random shit”.

I don’t want most of those. I do want some other ones. How do I get more information?

I could read the docs… but they are not extant. There are docstrings on many functions in the Clojush codebase, but those don’t so much communicate situational knowledge about pragmatic uses of the functions—when to use what—as they do semantic and syntactic information about what types they take as arguments, and what they do to the state of the system when called. I know how to call them, but not why. And not why to use one and not the other.

Maybe I need another training case.

More information, not all useful

Last time, I mentioned in passing that I recognized the names of some subdirectories in the Clojush /problems directory from some of Tom Helmuth’s work. Specifically, I think I should look at the problems like replace_space_with_newline, which is Tom and Lee’s go-to problem for software synthesis in Push.

Here’s the top of that file:

;; replace_space_with_newline.clj
;; Tom Helmuth, thelmuth@cs.umass.edu
;;
;; Problem Source: iJava (http://ijava.cs.umass.edu/)
;;
;; Given a string input, print the string, replacing spaces with newlines.
;; The input string will not have tabs or newlines, but may have multiple spaces
;; in a row. It will have maximum length of 20 characters. Also, the program
;; should return the integer count of the non-whitespace characters.
;;
;; input stack has the input string

OK, so in this problem, we are using a single input that is a string, and looking to evolve a function that returns an integer (and “prints” something too). That seems a bit like what I want for FizzBuzz, if backwards. I want to give an integer and produce a string.

I know already that I want to use some of the newfangled search operators and stuff, and I know that these software synthesis problems are complicated enough that their :atom-generator definitions (what an awful name) will include most if not all of the Push language, and maybe some other interesting stuff.

Let me start by glancing at the :atom-generators definitions for this problem:

; Atom generators
(def replace-space-with-newline-atom-generators
  (concat (list
            \space
            \newline
            ;;; end constants
            (fn [] (lrand-nth (concat [\newline \tab] (map char (range 32 127))))) ;Visible character ERC
            (fn [] (replace-space-with-newline-input (lrand-int 21))) ;String ERC
            ;;; end ERCs
            (tag-instruction-erc [:exec :integer :boolean :string :char] 1000)
            (tagged-instruction-erc 1000)
            ;;; end tag ERCs
            'in1
            ;;; end input instructions
            )
          (registered-for-stacks [:integer :boolean :string :char :exec :print])))

This actually helps a lot. I also like the way it communicates the sub-structure of the work it does, compared with what we saw in the :atom-generator inline definition from simple_regression.clj.

This is still a Clojure list, which makes me feel like I’m understanding the desired pattern the code wants. And the list includes some literals, some Clojure symbols again (like 'in1), and also some pure functions that seem to produce… stuff. Literals, again.

Indeed, there are some comments in there, and they inform and confirm my suspicions. I see “;;; end constants” to indicate that \space and \newline characters are explicit Clojure char items. I can imagine that if I want to give explicit strings like "fizz" and "buzz" to my FizzBuzz program, I would put them in an analogous place here, as explicit strings.

Then we see two ERC functions, which produce what we call in the GP field “ephemeral random constants”. That is, these pepper random code with specific constants, but save the trouble of listing those here in the list explicitly:

(fn [] (lrand-nth (concat [\newline \tab] (map char (range 32 127))))) ;Visible character ERC
(fn [] (replace-space-with-newline-input (lrand-int 21))) ;String ERC
;;; end ERCs

The first produces one random (visible) ASCII character each time it’s called. The second produces…. Aha, the second calls a function, also defined in this file, called replace-space-with-newline-input, with an argument that is a random number of some sort. Not too sure about that one, but I think I get it. Cross-referencing this function, it’s clear what it does is produce “random strings” of a specified length. So our ERC function, here in :tom-generators, peppers the code with random strings (of ASCII characters) of length 20 or less. That is, when a piece of code is generated, and this item is invoked in the :atom-generators list, what will be inserted will be a random string.

That makes some sense, in context. I don’t think I’ll need that sort of thing for FizzBuzz. But it’s good to see the function extracted like that. I more quickly understood the intent.

Next we see

(tag-instruction-erc [:exec :integer :boolean :string :char] 1000)
(tagged-instruction-erc 1000)
;;; end tag ERCs

I don’t know what the means, honestly. There’s something about “tagging” going on here, but… is it a thing we need to know? It seems to be unique to the Clojush system, or perhaps to this problem’s representation, or something. I don’t actually use “tagging” in my own work, though I do use the Push language. I guess I’ll have to look it up to suss out the intention.

Or not.

Because I realize suddenly that the simple_regression.clj file didn’t have it. Therefore, it sounds like it’s an option of some sort which is included here for something not related to what simple_regression.clj was intended to show. And also therefore, I’ll assume it’s either for research goals I don’t share with the authors, or that maybe it’s intended to do something that a more complicated and interesting problem needs.

And I would feel more comfortable if the results of some code I were running would tell me that it wants tagging. It doesn’t feel as if I need to start with it. Otherwise… why wouldn’t it be turned on or set up by default?

Next in :atom-generators I see

'in1
;;; end input instructions

and yay! I know what that means. That will place our input “variable name” (which is in Clojush just another Push instruction) into the programs we generate. What good could an evolved program be, if it lacked any reference to the inputs arguments? We need this.

Finally, there’s this interesting thing

; Atom generators
(def replace-space-with-newline-atom-generators
  (concat (list
          ;;... the list we went over above
            )
          (registered-for-stacks [:integer :boolean :string :char :exec :print])))

The :atom-generators list isn’t just a series of explicitly defined constants and constant-generating functions, but rather it seems to be concatenated with the results of the registered-for-stacks function.

That’s not defined here in this file, but I do find it by searching the entire repository for "(defn register-for-stacks".

(defn registered-for-stacks
  "Takes a list of stacks and returns all instructions that have all
   of their stack requirements fulfilled. This won't include random instructions
   unless :random is in the types list. This won't include parenthesis-altering
   instructions unless :parentheses is in the types list."
  [types-list]
  (doseq [[instr instr-fn] @instruction-table]
    (assert (:stack-types (meta instr-fn)) (format "Instruction %s does not have :stack-types defined in metadata." (name instr))))
  (map first
       (filter (fn [[instr instr-fn]]
                 (and (:stack-types (meta instr-fn))
                      (clojure.set/subset? (set (:stack-types (meta instr-fn))) (set types-list))))
               @instruction-table)))

I confess I don’t know what “parenthesis-altering instructions” are, but in general this sounds as if this produces a list of all the instructions in the Push language that “use” some specified stacks. So the (registered-for-stacks [:integer :boolean :string :char :exec :print]) call we see in the problem file I’m working through must produce a list of Push instructions that are “involved in” integer math, boolean logic, string handling, character stuff, execution flow control and… “printing”.

“Printing” seems like what FizzBuzz is actually about. Maybe I should re-think what I want to use “printing” instead of just producing a string on top of a stack? I don’t know if it’s idiomatic one way or the other. I don’t even know what a “Push idiom” is supposed to be, since Push is a machine-written programming language, as a rule….

But I’ll gloss over that for now, and make a note to think about this later. I don’t actually quite understand what “printing” is, in this context, but I do understand how to look at the top of a stack for a value.

And that’s the end of the :atom-generators structure.

I think I’ve understood its parts well enough to cobble something together for my FizzBuzz problem. And the registered-for-stacks function specifically solves the problem I identified yesterday when I was looking at the short list of instructions explicitly defined in simple_regression.clj: With this, I can specify a bunch of stuff I want to include in my random code.

Now let me deal with :error-function in replace_space_with_newline.clj, and see if there’s anything I can learn.

Reading specifications

I scroll down the page, looking over quite a number of what look like helper functions (yay!). And there I find the :error-function for this problem:

(defn replace-space-with-newline-error-function
  "The error function for Replace Space With Newline. Takes an individual as input,
   and returns that individual with :errors and :behaviors set."
  ([individual]
   (replace-space-with-newline-error-function individual :train))
  ([individual data-cases] ;; data-cases should be :train or :test
   (let [cases (case data-cases
                 :train (first replace-space-with-newline-train-and-test-cases)
                 :test (second replace-space-with-newline-train-and-test-cases)
                 [])
         behaviors (replace-space-with-newline-evaluate-program-for-behaviors (:program individual)
                                                                 cases)
         errors (replace-space-with-newline-errors-from-behaviors behaviors cases)]
     (cond
       (= data-cases :train) (assoc individual :behaviors behaviors :errors errors)
       (= data-cases :test) (assoc individual :test-errors errors)))))

Well, first off there’s a bit more going on here than there was in :simple_regression.clj. Something about different runtime conditions, called :test and :train, which is certainly making me feel better about the validation and verification practices used for this problem.

And as in simple_regression.clj, there is an :errors collection being made…. But there’s also now a :behaviors collection. I don’t know what that is.

Apparently :behaviors captures something useful for an intermediate step. First we associated that collection with the individual by calling the function replace-space-with-newline-evaluate-program-for-behaviors, and then we pass that stuff into replace-space-with-newline-errors-from-behaviors as one of its arguments.

I suppose I need to follow the call chain.

Looking up at replace-space-with-newline-evaluate-program-for-behaviors, it does familiar work I learned from simple_regression. I see it doing the work of setting up the interpreter, and then running the program extracted from the individual, and then producing an ordered tuple of two values: something from the :output stack, and also something from the :integer stack.

Maybe the :output stack is what “printing” uses. At least I suspect this to be the case.

Actually, this function and its pairs of values makes a lot of sense.

I am reminded that replace_space_with_newline is described in its docstring as looking for Push algorithms that will (1) “print the string, replacing spaces with newlines”, and also (2) “return the integer count of the non-whitespace characters”. So there are, by definition, two things we want these programs to do. This function is just salient collecting what “actually happened” when the program was run, so it can calculate both of those two scores.

Then, following the call chain onwards to where the :errors collection is built, the replace-space-with-newline-errors-from-behaviors function is measuring the levenshtein-distance between the desired string and the one produced by the program.

That’s good! That’s what I want to do, too, because I’ll be comparing the string I find on the stack with the desired "fizz" or "buzz" or whatever.

And, as I suspected, because there are two goals for this problem, this function is calculating two error scores for each case. One measures a “numeric” error, of the same kind we saw in simple_regression.clj: the absolute distance between the target and output numbers, or 1000 if there’s nothing produced at all. The other measures the “string” error, between the “printed” output string and what we want.

I think I’m getting it. This is also making it easier to understand the patterns I started to sense from simple_regression.clj.

Now, based on what was being done in simple_regression.clj—and on the assumption that Clojush works more or less consistently in the broad sense—I think all of this is in service of setting up the map called argmap. I should look at the argmap in this problem, too.

Hmm….

There seems to be a lot of extra stuff I didn’t see in simple_regression.clj:

; Define the argmap
(def argmap
  {:error-function replace-space-with-newline-error-function
   :atom-generators replace-space-with-newline-atom-generators
   :max-points 3200
   :max-genome-size-in-initial-program 400
   :evalpush-limit 1600
   :population-size 1000
   :max-generations 300
   :parent-selection :lexicase
   :genetic-operator-probabilities {:alternation 0.2
                                    :uniform-mutation 0.2
                                    :uniform-close-mutation 0.1
                                    [:alternation :uniform-mutation] 0.5
                                    }
   :alternation-rate 0.01
   :alignment-deviation 10
   :uniform-mutation-rate 0.01
   :problem-specific-report replace-space-with-newline-report
   :problem-specific-initial-report replace-space-with-newline-initial-report
   :report-simplifications 0
   :final-report-simplifications 5000
   :max-error 5000
   })

Good news first. I see the line :parent-selection :lexicase, which as I’ve mentioned before is the selection method I want to use for my FizzBuzz problem.¹

But then there’s an awful lot of text devoted to :genetic-operator-probabilities, and subsequent mentions of stuff like :alternation-rate 0.01. Apparently I need to (1) do something with the probabilities in that hash, but also (2) set some parameters regarding what happens if one of those things is chosen at random?

And there are some other things floating around here that I can’t parse. What’s :max-error? Doesn’t the calculation of errors in the functions themselves handle this? Why is it being set to that number in particular?

I see that :problem-specific-report is calling a function defined above in the same file, which puts even more stuff out to STDOUT. But that didn’t exist in simple_regression.clj, so I suppose I don’t need that.

Then there is the obvious intentionality hinted at by the carefully-numbered probability distribution in :genetic-operator-probabilities. Both of two things possible, with equal probabilities. Is the fact of the equal probability important, or is this demonstrating that there’s a reasonable prior assumption that “well, do them all about the same amount”? And the fact that there are these two operators listed, and no others: does that mean that these are sufficient or necessary? What would I do if I wanted to add more? What if I wanted to use fewer? What other ones could I add? Are the ones that aren’t mentioned (because I remember looking at the preliminary list of settings in STDOUT printed when I ran simple_regression.clj) still active, because I believe the default values I saw there were 0.01 not 0.0.

Oh no. Then there are more mentions of the same functions, below, outside the probability distribution itself. Are the specific decimal values associated with :alternation-rate 0.01 and :alignment-deviation 10 and :uniform-mutation-rate 0.01 telling me….

Nope. Stop it, Tozier. I don’t know what I’m doing, here.

What values should I use? What are these settings for? The words are familiar (because I do this for a living), but the particulars of having things appear in two places, and possibly having invisible default values, and also having apparent specific values here…. It feels like work in progress. This part (especially) feels like fossils of ideas that haven’t yet been communicated well enough to their authors, let alone to me.

It’s all very confusing. There seem to be a lot of options, and I don’t know what a lot of them do. I feel like I know better than to poke around in stuff I don’t understand—especially if it’s not covered by automated tests—but I don’t want to bother the authors asking them to fix their code just for me.

One is tempted to just cut-and-paste some stuff from what’s here, and see what happens. Thank goodness I am so experienced and skilled, and that I know better than to do that.

In which we cut-and-paste some stuff and see what happens

NARRATOR (V.O.)

Dear reader, now we draw a discreet veil over what happened in the next few hours. He looked at other problem settings files. He tried to read the documentation. He tried to build a problem specification for FizzBuzz from scratch. He walked around the block. Then he tried to build one by making incremental changes in another file he’d found.

In the end, he just cut-and-paste some stuff. And that worked.

Boy, that was something, huh? I didn’t expect the pit filled with lava-snakes at all.

But here’s what finally worked, with all of its parts shown together:

;; fizz_buzz.clj
;; the silly interview question, because why not?
;; Bill Tozier (Feb 9, 2018)

(ns clojush.problems.software.fizz-buzz
  (:use clojush.pushgp.pushgp
        [clojush pushstate interpreter random util globals]
        clojush.instructions.tag
        clojure.math.numeric-tower
        ))

;;;;;;;;;;;;
;; Given an :integer input argument, produce the :string "fizz" if the input is
;; divisible by 3, and (:string) "buzz" if it is divisible by 5. In cases
;; where it is divisible by both, produce "fizzbuzz". Otherwise produce
;; an empty string.

(def fizz-buzz-atom-generators
  "Collection of items which will be used to construct random programs: the strings 'fizz' and 'buzz' are given, since we don't really want to evolve them from scratch (though we could). Also some integer constants, boolean constants, the input and various standard Push instructions."
  (concat (list
            "fizz"
            "buzz"
            ;;; end constants
            (fn [] (- (lrand-int 21) 10)) ;Integer ERC [-10,10]
            (fn [] (- (lrand-int 201) 100)) ;Integer ERC [-100,100]
            (fn [] (lrand-nth (list true false)))
            ;;; end ERCs
            'in1
            ;;; end input instructions
            )
          (registered-for-stacks [:integer :boolean :string :exec])
          ))

(defn fizz
  "given a number, return 'fizz' if the number's divisible by 3, or an empty string otherwise"
  [numerator]
  (if (zero? (mod numerator 3))
    "fizz"
    ""))

(defn buzz
  "given a number, return 'buzz' if the number's divisible by 5, or an empty string otherwise"
  [numerator]
  (if (zero? (mod numerator 5))
    "buzz"
    ""))


(defn fizz-buzz-string
  "given a number, return a single string concatenating its fizz & buzz strings"
  [n]
  (str (fizz n) (buzz n)))


; Helper function for error function
(defn fizz-buzz-training-cases
  "Takes a the integers in the range [1,limit], and gives IO test cases of the form `[input output]`."
  [limit]
  (let [simple (range 1 50)]
    (map #(vector % (fizz-buzz-string %)) simple)
    ))


(defn fizz-buzz-error
  "Based on some undocumented stuff I found in another problem, and I have no real idea why things are named these things or what said things do. Except `errors`, which uses the Levenshtein distance between the expected string, and the string obtained from the top of the stack at the end of the run."
  [individual cases]
  (let [behaviors (vec (for [input (map first cases)]
                          (->> (make-push-state)
                               (push-item input :input)
                               (run-push (:program individual))
                               (top-item :string)
                               )))
        errors (mapv (fn [behavior case]
                        (if (string? behavior)
                          (levenshtein-distance behavior (second case))
                          1000
                          ))
                      behaviors
                      cases
                      )]
    (assoc individual
      :behaviors behaviors
      :errors errors
      )))


(def argmap
  {:error-function (fn [individual]
                      (fizz-buzz-error individual
                        (fizz-buzz-training-cases 50)
                        ))
   :atom-generators fizz-buzz-atom-generators
   :max-points 1000
   :max-genome-size-in-initial-program 1000
   :evalpush-limit 1000
   :population-size 1000
   :max-generations 1000
   :parent-selection :lexicase
   :genetic-operator-probabilities {:uniform-addition-and-deletion 0.5
                                    [:alternation :uniform-mutation] 0.3
                                    :two-point-crossover 0.2
                                    }
   :uniform-mutation-rate 0.1
   :uniform-addition-rate 0.2
   :uniform-deletion-rate 0.1
   :report-simplifications 0
   :final-report-simplifications 5000
   })

Let me review what’s happening in this specification file.

The list fizz-buzz-atom-generators includes the string literals "fizz" and "buzz" (because I don’t really want to force them to be evolved from raw characters); individual random integers in the range \([-10,10]\), and also (duplicating some) \([-100,100]\); the boolean constants true and false; the input argument 'in1; and every Clojush-flavored Push instruction we get from (registered-for-stacks [:integer :boolean :string :exec]).

The training cases set an integer in the range \([1,50]\) as a single input, and expect the top of the :string stack to be "" (the empty string), "fizz" if the input is divisible only by 3, "buzz" if the input is divisible only by 5, or "fizzbuzz" if it’s divisible by both 3 and 5. I decided not to go with the :output stack that Tom used in replace_space_with_newline.clj, mainly because I don’t quite know yet what it means.

I spent some time trying various settings for :max-points and :max-genome-size-in-initial-program and so on, but really I just picked numbers that… “sounded about right”. I mean, there are examples of various problem configurations that use these values, but I don’t really know why.

As I said before, I can into this knowing that I wanted to use :parent-selection :lexicase and not :parent-selection :tournament, so I just… changed it, and hoped it worked? You don’t need to know why, for now. I’ll explain another day.

That final business about :genetic-operator-probabilities, :uniform-mutation-rate, :uniform-addition-rate and :uniform-deletion-rate are just guesses. They just seem to be “enough like” numbers I see in other problem definitions.

I saved this file as Clojush/src/problems/software/fizz-buzz in a git branch of the original Clojush repository.

Now it’s time to “push the Big Green Button”, by which I mean I type this incantation, based directly what Lee said I should type to run simple_regression.clj:

lein run clojush.problems.demos.fizz-buzz

And stuff starts happening.

What happened then…

As with simple_regression.clj, a huge spew of text feedback gets printed to STDOUT. A lot of it is about the setup of various aspects of the Clojush system, and it seems to include a bunch of default values that aren’t mentioned in my configuration file. But I do see that the configured values I passed in are present and accounted for against the background of other stuff, so that’s encouraging.

Then we start evolving code.

A lot happens, and it takes a while. On the order of one or two hours, on my laptop, with the fan running like crazy and the memory use pegged to “max”. Unlike simple_regression.clj, where the task was literally a “toy” problem involving only a half-dozen instructions, a few training cases, and a relatively small population size, here I’ve told it to build 1000 random programs from dozens (hundreds?) of Push instructions. Each of those Push programs can be several hundred items long. And then the system needs to initialize and run every one of those, for each of 50 training cases.

That’s just what happens in just the first generation. Clojush builds the random genomes, which are translated into Push programs, and then those are run (concurrently, thank goodness) using my 50 input:output pairs as training cases. The error (which is the Levenshtein distance between the observed string and the desired string, or a “penalty” score of 1000 if no string is produced) is measured for each.

Once all that work is done, lexicase selection (and some other search operators) are used to pick which members of that first generation will be used to produce the members of the second generation. I won’t go into the details here—because there are an awful lot of them—but from the 1000 individuals in the population, 2000 “parents” (or so) are chosen (with replacement, of course) to breed the 1000 new members of the next generation. “Breeding” means chopping up the parents’ genomes and recombining them to form new programs. Oh, and also we fuck around with them via mutation, and we insert new code and delete some of the old code.

All of this is happening at random, in the sense of being a stochastic process. There is no gradient descent, it’s just that programs with detectably better performance are selected more often to be used as “parents” than those with worse (higher) errors.

And in each generation, the 1000 “children” are essentially brand new programs. So we have to start the process again. Every one of those is run against all 50 training cases, and when the scores are collected we do the same thing again to get the grand-children of the starting guesses. And so on.

It really is a miracle it works.

And yet it does.

As time progresses, I watch the info-dump in STDOUT and see the error measures being reported are dropping over time. More or less, of course, because this is a stochastic search. There is no guarantee that these operations I’ve configured it to use will produce programs with strictly “better” behavior.

Maybe by coincidence, my problem definition file works the first time. After something on the order of 4,350,000 program evaluations, a winner is discovered which produces the correct string for every training case.

It’s a pretty big Push program, but it gets auto-simplified into this one here:

(in1
  exec_do*while
  (exec_do*count
    (exec_do*count
      (in1
        "buzz"
        exec_yankdup)
      integer_div
      exec_when
      "fizz"
      string_take
      integer_yankdup
      boolean_frominteger
      string_take
      string_dup
      integer_stackdepth
      exec_yankdup
      exec_y
      integer_sub
      string_take
      exec_do*times
      (integer_lte
        exec_while
        ("fizz"
          "buzz"
          string_concat
          exec_flush)))))

How it works

The astute reader may notice this Push program isn’t the same one I reported last time, when I showed a program I “quite liked”. That’s because I don’t “quite like” this one.

It’s interesting, and absolutely informative as an object lesson. But I think you’ll understand shortly why I don’t “like” it at all, as a solution to the FizzBuzz problem. I won’t even walk through it with you, except to show you a little snippet where it gets pathological.

First, some necessary background.

There are a few fundamental things you need to understand about the Push language, and how the Clojush interpreter runs it.

All Push code (for our purposes, here) is composed of literals (numbers, strings, booleans, etc), variable names, instructions, or code blocks (lists) of those things. By convention, when an interpreter is set up to run a program, the program is pushed (as a code block) onto the :exec stack.

In each step of the interpreter, the top item is popped off of :exec and examined.

If the item is a literal, then it is pushed onto the stack of the appropriate type. An integer literal like 9 or -818281 will be pushed onto the :integer stack, a string literal like "fizz" or "bar\ttab" will be pushed onto the :string stack, and so on.

If the item is a variable name (like 'in1, in these Clojush examples), then the value bound to it is pushed onto the :exec stack. So for instance if we set 'in1 to have the bound value 17, whenever the interpreter encounters 'in1 it will produce 17 (and then next time, 17 will go to :integer because it’s a literal).

If the item is a code block, then the items inside it are pushed back onto the :exec stack. This may seem like a useless step, but it’s not. As you’ll see below, some instructions are defined to use an “item” from :exec stack as their arguments, and manipulating whole blocks of code (for instance, for iteration) is very useful. So we want code blocks in our Push code so instructions can grab them as a group, not so they can “get unwrapped” in the course of execution.

Finally, if the item is an instruction—and if that instruction’s necessary arguments are present on the stacks—then it is immediately executed. Most instructions pop their arguments off the appropriate stacks, and then push the result(s) of the function they implement onto the appropriate stacks. To revisit my example from last time: the 'integer_add instruction is defined so that it immediately pops the top two :integer items from the :integer stack (if there are at least two there), adds them, and then pushes their sum back onto :integer.

We run a Push program until the :exec stack is empty, or until a step limit is reached, whichever happens first.

You may wonder why we need a step limit. It’s because there are instructions in the language that are defined in such a way that they can put more stuff onto the :exec stack than they took off. For instance, this is how iteration happens: An iteration instruction pops its argument(s) off the :exec stack, and a counter from the :integer stack, and then pushes those arguments back along with a new copy of the iteration instruction onto the :exec stack again. There’s no guarantee at all that the :exec stack will be emptied at all.

It’s the Halting Problem. So Push interpreters, as a rule, use a step counter to make sure all programs eventually “halt”, even if there is unexecuted code remaining at the deadline.

That should be enough to understand what’s about to happen. At least in terms of Push’s quirks.

Now, glance up at that “not happy” program I evolved, above. You will see several instructions that start with "exec…". As I mentioned, these tend to pop their arguments off of the :exec stack, and also push their results onto :exec as well. For example, 'exec_do*while and 'exec_do*count are both iteration instructions of that sort. As you look over the code (see how I’ve indented it), you’ll notice the first few items include the sequence "… exec_do*while (exec_do*count (exec_do*count …".

Nested loops! We’ll come back to that. At least we know it works… right?

If you’re from a Computer Science or functional programming background, the other instruction in the listing that may catch your eye is 'exec_y. That’s the Y Combinator. Not the one that indoctrinates smart young people into a terrible culture of greed and waste, but rather the abstract algorithmic combinator function that boils down to “do this thing forever”.

That should worry you.

I said I wouldn’t bore you with the details, so I won’t. But I want to focus on my experience of looking over at this program, because it’s been very enlightening.

Yes, this program does in fact work. For every one of the argument numbers from 1 to 50, it runs and finishes and produces the correct :string answer on top of the :string stack, right where I told the :error-function to look.

Because it’s the first Push program I evolved to solve FizzBuzz, I planned on walking through it in detail with you. Of course this was when I looked at what the interpreter was doing in the course of the algorithm.

That’s when I stopped “liking” it.

Here’s what the interpreter state looks like after a few dozen steps, with input of 7:

:exec = (exec_y exec_y (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) (exec_y (exec_y exec_y)) integer_sub string_take exec_do*times (integer_lte exec_while ("fizz" "buzz" string_concat exec_flush)) (1 6 exec_do*range (exec_do*count (in1 "buzz" exec_yankdup) integer_div exec_when "fizz" string_take integer_yankdup boolean_frominteger string_take string_dup integer_stackdepth exec_yankdup exec_y integer_sub string_take exec_do*times (integer_lte exec_while ("fizz" "buzz" string_concat exec_flush)))) exec_while (exec_do*count (exec_do*count (in1 "buzz" exec_yankdup) integer_div exec_when "fizz" string_take integer_yankdup boolean_frominteger string_take string_dup integer_stackdepth exec_yankdup exec_y integer_sub string_take exec_do*times (integer_lte exec_while ("fizz" "buzz" string_concat exec_flush)))))
:string = ("" "" "buzz" "buzz")
:input = (7)

You’ll notice that by this point, the :exec stack is full of a lot of copies of the code block '(exec_y (exec_y exec_y)). The right answer string is on top of the :string stack (the empty string, "", because 7 is not divisible by either 3 or 5). But there’s no sign this expanding wall of '(exec_y (exec_y exec_y)) is ever going to go away.

And it doesn’t. It grows, forever. Running this code for a thousand steps (as specified in my problem config file, using :evalpush-limit 1000) produces a huge clump of junk.

But even though it builds a clump of junk, it doesn’t ever change the :string stack from here on. And so, because the answer is present, every time this program is run, it will first get the right answer, and then it will bide its time running out the clock with this infinite loop.

In some sense, it just so happens that the correct answer is left sitting on the :string stack for that whole time it’s futzing around with 'exec_y. By every definition I can think of, this program “works”. For every argument I tested (the numbers 1 through 50), it puts the correct string on top of the :string stack. And then it makes a mess, but the mess isn’t something I said I care about.

So what’s my beef? You should be wondering why this is even an issue. The correct string is present on top of the stack for all the input cases, right? The error values are all zero?

My attention was drawn by that pile of junk accumulating in the trace. But when I had finally worked out where it had come from, I also noticed that the algorithm that evolved (the stuff that happens before the infinite loop kicks in) couldn’t “get the right string” before 1000 interpreter steps. At least not for larger arguments.

That business at the front of the program, before 'exec_y takes over, includes three nested loop-like things. Seeing the infinite loop, I began to worry: How long would three nested loops take for bigger arguments?

So I tried a larger argument. 4545 is a nice, moderate-sized number that is definitely divisible by 15, so I tried that. And I was right: Those three nested loops can’t manage to produce the correct string ("fizzbuzz") on top of the :string stack before a thousand steps have been taken, when you give it 4545. I think it may take more like 98 million steps, actually….

But nonetheless, this program “works”. It’s just that I didn’t ask it to “do” 4545 when I specified the problem.

So that’s interesting

Take a moment here and consider what just happened to me.

I did a lot of the work I needed to do, in order to understand enough of what Clojush wants. Just enough to set up my FizzBuzz problem file.

When I felt I had mastered the prerequisite structures, I did what any normal programmer would do. I picked the obvious test cases: the numbers from 1 to 50. OK, maybe you’d go up to 100. Whatever.

And then this is a testament to how good Clojush is: It worked the first damned time I ran it. That is to say, after a while I obtained a Push program that solves every one of the training cases I provided, with perfect accuracy.

There is no approximation here. This algorithm does in fact put the correct FizzBuzz string on top of the :string stack, within 1000 steps, for every training case I provided.

But then.

In order to explain it—to me, and also to you—I looked at how this evolved program was working. I examined a detailed step-by-step trace.

Something caught my eye.

That pile of exec_y, infinitely accumulating on the :exec stack, was a red flag for me. It smelled like a program stuck in an infinite loop. It was doing something extra. Maybe I shouldn’t mind. But it got my attention.

“End up stuck in an infinite loop” has no direct bearing on the demanded “quality” of the program, at least not as I’d initially specified. The fact that the algorithm hares off and produces an vast pile of useless exec_y instructions after it gets the right answer—instead of putting its pencil down and sitting quietly at its desk until the rest of the students have finished? That doesn’t signify, does it?

And yet. And yet.

It was enough to make me wonder: What would this program do with 4545 as an argument?

So I tried it. And it doesn’t work at all.

It doesn’t “fail” because the algorithm is intrinsically flawed, in itself. If I had waited a long, long time, this program would in fact end up with "fizzbuzz" on top of the :string stack, given an input of 4545. But there would have been a huge number of steps before that—I’m guessing on the order of 98 billion, but I’m not going to run it to see. I traced it for 1000 steps, and it was nowhere near “done” yet. At that point, the :string stack had 124 copies of "buzz" on it. And "buzz" is not the correct answer, for 4545.

To be honest, the limit of 1000 steps wasn’t even something I realized I’d set as an endpoint. I just copied and pasted some code from some other config file, and so I ended up with an extra “semi-implicit” criterion in my configuration file: ":evalpush-limit 1000". Until now, I didn’t even notice that line was in there.

By every explicit criterion, this program works. It was only luck, and the force of human habit, that I had provided training cases where this limit permits the algorithm to work within the prescribed deadline. The imperative :evalpush-limit 1000 doesn’t mean “do it as fast as possible”, after all; it just means “do it before this (generous) clock runs out”.

This is what gets me: There was no valid reason for me to suspect this particular program was out of control. It worked perfectly for every single case. It was only when the trace exhibited a completely-unrelated behavior—that infinite loop—that even bothered to glance at the nested loops that actually do the work.

I just didn’t like the way it smelled.

NARRATOR (V.O.)

This will turn out to be another reason for him to talk about “The Mangle of Practice”. He just doesn’t want to say that out loud. But you know he’s going there. He always does. If not here, then next time.

Working through some evolved programs

Now let me walk through the program I “liked”.

To evolve this one, I made a small change in my configuration file. Instead of using the integers \((1 \leq n \leq 50)\) for my training cases, I used the numbers \((21981 \leq n \leq 21981+50)\).

Why \(21981\)? No reason, except it’s fucking huge. That’s me saying to Clojush (or maybe to Evolution, more generally), “Hey by the way don’t fuck around trying to count on your fingers this time.”

Now not every run of Clojush, even with my slightly revised configuration file, produces a valid working Push FizzBuzz program. I see a success in about half of them. So I’m going to skip over the times it didn’t discover a perfect solution to the FizzBuzz problem, and I’ll skip some of the longer, not-especially-interesting ones too. Maybe later I will do some classification on how these evolved FizzBuzz programs “approach” the task. Some of them wrote "fizz" and then changed the characters to "buzz" when needed. Others worked out that you should say "" most often, and sometimes "fizz", and more rarely "buzz", and very rarely "fizzbuzz", and did some fancy footwork with conditional code. Stuff like that.

But stop and hear what I said: I am seeing about 50% success, give or take, with just typing in my training cases and pushing “go”. That is, when you think about it, pretty damned impressive.

Good job, Clojush.

Here’s that evolved FizzBuzz program that I do like. I call it “The Card Trick”:

(exec_stackdepth
  integer_mod
  exec_s
  string_swap
  in1
  integer_mult
  exec_stackdepth
  integer_mod
  "fizz"
  exec_do*range
  (string_parse_to_chars
    string_rest
    "fizz"
    string_shove)
  in1
  boolean_xor
  exec_swap
  (string_dup
    "buzz"
    string_concat
    exec_stackdepth
    integer_mod
    string_empty
    boolean_dup
    string_shove
    boolean_and)
  false)

Let me work through it, and show you why I came to call it by that name.

First, I traced the execution of the program with an argument of 15. Along the way, I had also discovered a very handy built-in Clojush capability for generating a stepwise trace of a Push program. In the listings below, I’ve edited those steps a bit, to remove extra junk and stacks we don’t use. I’ll also insert comments into the trace to explain what’s happening in more detail than the automated function provides.

Here’s the trace with input 15:

Even in the first “real” step of the program, there’s a very interesting thing happening—something I’ve seen crop up in quite a few other Push programs through the years. The instruction exec_stackdepth appears. That counts how many items are on the :exec stack at that moment, including code block lists as one item each. There are… aha, there are 15 items there, at this point.

Isn’t that interesting? That number is remarkably salient, for FizzBuzz.

There are a also number of other instructions in this program, including the next integer_mod, which fail when called because there are missing arguments. But—also interesting, and part of that “design pattern” that exec_stackdepth depends on—these instructions can’t be deleted from the program without breaking it. The first salient step of the program counts how many items there are in a stack, and then the subsequent behavior of the program depends on that number being exactly 15.

In other words, this code stores a constant it needs for core calculations, in a distributed way. The program has to be 15 steps long. Or at least, it can’t be 14 steps long, at that step of execution, and still work in the same way.

There are a number of other things that happen, after this constant is produced. One interesting one is that the input value (also 15) is multiplied by the “stored constant” 15. Then, later an integer_mod instruction calculates the remainder when that product (225) when 9—another “secret” constant—is divided into it. I have a suspicion that this will be important somehow in figuring out which string to leave on top of the :string stack.

The rest of the program is, as we say, non-obvious. I certainly don’t quite understand what it’s doing, just from this single trace. At one point, it takes apart the string "fizz" into its component characters. Then it erases one of those characters, producing an empty string. And also it concatenates "fizz" and "buzz" to produce the correct string result for our input of 15.

What I don’t understand, yet, is how it avoids doing that for other inputs like 1 or 3. Then again, maybe it doesn’t “avoid” making "fizzbuzz", but rather does… something else, too?

Let’s look and see.

Here’s what happens when the input is 14 instead of 15:

As I annotate this trace, I start to understand more of how the algorithm is working for the case of 15 as well.

We start, as before, by counting how many items are on the :exec stack. It’s still 15, of course; this is the same program after all.

The first place we diverge noticeably is when we multiply that constant by the argument to get 210 instead of 225. This value is divided by the second “implicit constant” 9 a couple of steps later to produce a remainder of 3 instead of the remainder 0 we got last time.

Also as before, we then “disassemble” the string "fizz" and push a fresh copy on top, to produce a :string stack that looks like this:

:string = ("fizz" "" "i" "z" "z")

Then the real divergence happens: The algorithm uses the 'string_shove instruction to move the top item on the :string stack to a new position specified by the top item on the :integer stack. Here, the top :integer is the value 3, not 0 as it was when we worked with input 15. When we have input 14, therefore, the algorithm shoves the top "fizz" down into the stack, leaving the empty string on top. When we have input 15, which leaves a remainder of 0, the algorithm “shoves” the top string to position 0, which is still the top of the stack; in this case, "fizz" remains on top.

It’s as if a stack of cards was dealt, in the order ("fizz" "" "i" "z" "z"), and then depending on the input value the top card was cut down into the deck. Like the setup for a card trick.

Farther along, in both cases we then put the string literal "buzz" on top of the :string stack. And concatenate that string with the item beneath it. In Push, this means we prepend the empty string "" to the "buzz", to get… "buzz" again. But if we had input 15, we would prepend "fizz" to "buzz", producing "fizzbuzz".

Note also that by concatenating “fizz” and "buzz", we consume the string "buzz". That’s important, too, as you’ll see.

In the example I worked through last time, of simple_regression solutions doing arithmetic, you may have noticed something a bit quirky about the order of arguments in some Push instructions. Specifically, for order-dependent binary operations like subtraction or division, the convention in Push languages is for the second item on the stack to be the “first” argument, and the top item on the stack to be the “second” argument. This is due to the way Push programs are processed, by first being pushed onto a stack.

Think of it this way: We are comfortable reading a program (- 3 7) (as in Lisp) as “three minus seven”. But in Push, the instruction comes afterwards, not before, so we might more appropriately write (3 7 -) to mean “three minus seven”. In Push languages, we first push the arguments onto the appropriate stack (during execution), and only then apply the instruction’s behavior to them. I know I always get confused here, at least when I start writing a Push interpreter from scratch. But it’s just the way it is. Subtraction, division, concatenation, things like that take their last arguments first—from the top of the stack—and their first argument last from the stack.

Anyway, notice what happens here. Because of how the “deck of cards” in the :string stack has been so… “cunningly?” arranged, in the case of an input wanting "fizzbuzz" as its output, we have "buzz" followed by "fizz" here. Here, with an input of 14, we have "buzz" followed by "", so we end up with "buzz"…. but that’s not the output for this input. Dang.

Aha! But wait!

Next, the program goes through a third process of extracting a “magic constant”, in this case the number 5. Can you guess what we’re going to do with it?

That’s right! We take (mod 14 5) for this input value, to produce a result of 4. For an input value of 15, (mod 15 5) is 0.

And then, in a couple of steps, we shove the top :string item “down to” the specified position in its stack.

In the case with 14 as input, we take "buzz" and shove it down to position 4 in the stack. In the case with 15 as input, we take "fizzbuzz" and shove it down… to position 0, leaving it where it started.

In the end, the :string stack for an input of 14 looks like this:

:string = ("" "i" "z" "fizz" "buzz" "z")

and the same stack, for an input of 15, looks like this:

:string = ("fizzbuzz" "fizz" "" "i" "z" "z")

You may be wondering why the stacks look so different from one another. Why is there no "buzz" in the second stack at all? Ah, remember: when we constructed the string "fizzbuzz", we consumed the substring "buzz" from the stack.

You can imagine, at least in a vague way, what happens for the case where the argument value is divisible by 3 and not 5, or vice versa: The magic constants work out so that we end up with remainders of 0 at the right time to leave "fizz" or "buzz" on top of the :string stack, instead of shoving it away. And we end up concatenating empty strings to those, perhaps.

If it were a card trick, I think you’d be able to master it with a little practice. It’s clever, in a way that is simultaneously both admirable and suspicious.

What does a program do?

These last two pieces have been two long, wordy, subjective stories. But they are parts of the same story, really.

In the first, I spent my time looking at the Clojush codebase, in an effort to try to understand what it was doing—what the intention of the authors was at each point in one of the example problem files. The pragmatics of that code didn’t feel to me as if it communicated to me what I was “supposed to do”, so I spent some time walking through what “felt like” an important part (the :error-function) trying to figure out what it “was doing”.

But notice: My expressed frustration is not a kind of complaint of the code itself, or of the programming or design skills of its authors. Knowing (as I do) that the code is used intensely in the research projects of friends and colleagues, it’s already be clear to me that it works as intended. It does what was asked. I had no reason to expect it to be buggy.

Maybe I am complaining about having to learn something, when I didn’t expect to.

And indeed: When I imagined I had discovered a “bug” in the auto-simplification code, because the results didn’t match what I had expected, it turned out there was no bug.

Not in the code. I had simply missed the fact—it was right there in the problem specification I was reading—that the interpreter setup pushed two copies of the input argument (one to :input and one to :integer), not just one as I had assumed. When I worked through the resulting evolved Push code by hand, on the wrong assumption that there was only one copy of the argument on the stacks, of course the result I got was different.

What is interesting is that for the unsimplified evolved program, it didn’t matter that I had left out the second copy of that argument value. For the unsimplified program, things just “worked out” to give me the right answer anyway. For the simplified version of the same program, they did not work out. Here, the second copy of the argument was necessary to get the right answer in the right place.

I didn’t “ask for” that behavior. But the specification file simple_regression.clj that I was running did, in fact, explicitly provide two copies of the argument. I just hadn’t noticed.

At the moment, Wikipedia starts off its definition by saying “A software bug is an error, flaw, failure or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways.” This was certainly unexpected. Therefore, there was a bug—in the sense that the results I obtained were not what I expected.

So where was the bug?

When I explain the Push language to people in classes or pairing sessions, and we start to write interpreters, we almost always start with arithmetic. There are historical reasons for this; people still think of Symbolic Regression when they think about genetic programming, and so they want to “do math”. But for me, a pedagogue trying to communicate something important to them about how Push code in particular works, starting with arithmetic means I can visit a landmark where we learn something important about Push “syntax”.

We start, always, by implementing addition and multiplication. So we can write and run simple Push programs like (7 3 +) and expect 10 to be the “result”, or (1 2 3 4 5 + * + * + + +) and expect 29 to be the result. That’s always how it goes. By writing the interpreter code that pushes and pops items from stacks, and which does arithmetic enough to implement these instructions, we learn about pushing and popping arguments. And we get a reasonable understanding of what to do when Push arguments are missing.

I make sure we always do (4 9 -) next.

Everybody inevitably gets this wrong. I get it wrong myself, half the time. As I said above, the result here is \((9-4)\), not \((4-9)\).

But people always imagine this is a bug. Because in their heads (and often in the code they’ve already written) they have implemented addition and multiplication as though they were popping the top stack item as the first argument, and then adding to it the second item. That’s the obvious way to do it, after all.

But of course, it’s wrong. Not in itself. It’s fine in itself, because addition and multiplication happen to both be commutative operations. But it’s also a bug, once we enter the realm of abstract binary operations of the sort \((a \odot b)\), where “\(\odot\)” might be addition or multiplication (which are commutative), or it might be modulo or string concatenation (which are not commutative).

Where is the bug? Their code works fine for the first two cases we did. But it doesn’t generalize. Well, OK, in one way, it does generalize: they can take any example, use any of the possible arguments for addition and multiplication, and it still produce the right sum or product. But the same structure of code will not work for non-commutative operations.

There was no bug, until I told them that I wanted subtraction to work this way. That information wasn’t salient for writing addition and multiplication functions. But every damned time, with near 100% certainty, we need to suddenly pay new attention to what we thought we understood when we wrote addition and subtraction.

Should I tell them they should never write addition and multiplication algorithms “wrong”, from the start? Should I say, when we are starting, “Now, you must always remember to reverse the order of arguments when you add them in Push!” What a ridiculous thing that would be. Who would even say that? What possible reason would there be for me to demand such a ridiculous thing as to “always reverse the arguments of addition and multiplication”? It would be a red flag for sure. They would suspect I had a bug—in my head.

And so every time we write a Push interpreter together, I know before we start that we will be confused as soon as we get to subtraction and concatenation. Every time.

In this second part of my account of evolving FizzBuzz programs in Clojush, I focused on my attempts to understand how to build a problem specification file. By looking at examples, I had learned enough to understand (more or less) how to describe my :atom-generators list. And based on a few more examples, I worked out what seems to be the minimum amount of structure needed for my :error-function, comparing strings with their targets using the Levenshtein distance.

But even now, I’m not sure what the various search operators implemented in Clojush are doing. As somebody who’s worked for three decades with evolutionary algorithms, I know what the words mean, or at least what they imply. I just don’t know why I would pick one and not another. Or why I would use these specific proportions, or these particular settings.

Again, this shouldn’t be misread as a complaint of Clojush itself, or even of its documentation. There is no “right answer”. Nobody “knows” which search operators to use in metaheuristics like evolutionary algorithms. That’s one of the things that differentiates Machine Learning from its Cold War era ancestor Mathematical Programming: there is no optimal approach.

You have to try and see.

In the end, I was able to copy and paste what seemed like “reasonable” things into my specification. I suspect that if I had changed some numerical values to “wrong ones”, I might have made the program crash as it was setting things up. For example, I noticed that the :genetic-operator-probabilities hash has values that add to 1.0, so I was able to infer (correctly) that it’s some kind of discrete probability density definition. So when I made my own :genetic-operator-probabilities hash, I was careful to always make them add up to 1.0 as well.

I just checked (for the first time), and if they don’t add up right, the program does indeed throw an exception. But the exception’s error message is very well worded, and points out what caused the problem.

Except, of course, it doesn’t explain “why”, not in every Aristotelian sense of cause and effect. The error message tells me there’s a bug in my specification because the numbers don’t add up to 1.0. I still don’t know why they have to, or even why they should. It was just, you know, something I was able to infer from the fact that they seem to already add up.

I can imagine several ways of taking arbitrary values and making them work out, through some sort of proportional renormalization. That way, people could put whatever numbers they wanted into the values of their :genetic-operator-probabilities hashes.

Preferring something to work in a different way isn’t a “bug”, though, is it? I personally prefer to be permitted to put whatever numbers I want into a list, and to have them be normalized so that they represent fair proportions. If I were to implement that algorithm, though, I’d need to test they were all strictly positive values, right? It’d be bad to include a negative weight for an item in a probability distribution.

Seeing the numbers used in a way unlike my preference makes me want to ask “Why did you do it that way?” But I can’t say that the current setup works “in a way unexpected”. Hell, I specifically told you I noticed that the numbers add to 1.0, and I was right.

“Why did you do it that way?” has an obvious answer: Because that’s the way the rest of the program expected it to be. The cause is contingent on the other decisions made. All kinds of other important stuff might not work, if these numbers were outside of their expected envelope.

All code is contingent on other code.

When I first evolved a successful Push program that solved the FizzBuzz problem—which I trained on the inputs \([1,50]\)—every training case I gave it worked perfectly. It was only when I began trying to explain how it worked that I noticed the fact that it ended up in an infinite exec_y “waiting loop” at the end of calculating the right answer.

That’s not a “bug” at all, in context. The Push interpreter assumes that if a program isn’t finished (by running out of :exec items) by the deadline, it will be force it to halt. But it got my attention. It made me wonder whether my training cases were the best choices. When I tried an input of 4545, the answer I got was wrong.

It wasn’t “wrong” because the algorithm didn’t work. It was “wrong” because the algorithm’s inefficiency meant that for larger input values, it hadn’t finished before the 1000-step deadline.

If I had changed the deadline to a much larger number of steps, it would have “worked”. Even if I had removed the offending exec_y instruction and any other code that was involved in the infinite loop, the program would not have “worked” for large inputs before the deadline (except, I suppose, by coincidence).

But I hadn’t “asked” it to finish up all its work before the deadline. It was just that the code I used to specify the problem had said, without me noticing, that we would use a thousand steps and no more. If I had not specified any deadline, then some default deadline (it turns out to be 150 steps in Clojush) would have applied.

Where is the bug? Is it in the evolved code? I’ve checked, and the algorithm would eventually give the correct answer for any positive input value.

And yet it was unexpected.

Above, I walked through the “Card Trick” algorithm that I evolved to do FizzBuzz. This uses a number of things I’ve come to think of as “idiomatic Push”: counting the number of items on a stack as a “secret constant”, and the use of stack-reordering combinator functions like :string_shove to change the output (item left on top of the stack), conditioned on the input values.

I pointed out that there were several steps that did nothing to help the program produce the correct output, except that they filled it up so that the “secret constants” would work out right. If you deleted, for example, the 'boolean_and instruction from the end of the program, several of the crucial constants would change. But if you deleted one the first 'integer_mod instruction, which seems like it has a use in this context, but which is really there so that the number 15 can be produced by the prior 'exec_stackdepth, something else would change.

What is 'integer_mod for, in a Push program?

It can be used “correctly”, as it is several times in this algorithm, to make a number that is an integer. In this context, I suppose it “intends” to refer to clock arithmetic or counting out a particular item from a reduced list of options. But clearly, in Push (and any other language capable of introspection) it can also be used to fill a space, to count in itself as an object.

Why does one seem more reasonable than the other? Why does one seem better?

Often when I describe genetic programming to programmers, the first thing I hear is some variant of: Can it write code in my favorite language?

The simple answer is “yes”. Of course. It’s just a suite of tools for the abstract manipulation of symbols. Almost certainly the people who invented and expanded your favorite language have produced a formal grammar that describes its syntax, and maybe they have a formal type system written down somewhere, and even if it’s a compiled language there is probably a linter out there somewhere for it.

That’s how Grammatical Evolution works, and it’s also the core of most approaches to Genetic Improvement. Even if you can’t imagine writing valid expressions in Python using “random tokens”, surely you can imagine filling in the blanks in the formal grammar. You can imagine starting with a line of that like stmt: simple_stmt | compound_stmt, and maybe picking to build a simple_stmt this time, and then looking down at the definition simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE, and deciding (at random?) you’ll fill that in with just one small_stmt and a NEWLINE, and then looking down at the definition of small_stmt and picking expr_stmt at random, and so on and so on. The resulting code would be syntactically correct, of course. So you could try to run it. And maybe it would run, and do some stuff.

But what do you want it to do?

You’d want the right answers, given the training cases, of course. But it’s also code in a language you already “know”. I don’t think folks realize how much of the code they read—other people’s code, or their own—is idiomatic enough to communicate intention, without that ever being explicitly stated.

You can write syntactically- and semantically-correct Python code that stores numbers in base-55 notation as the last 33 digits of a string. You can write Python that creates and then immediately closes blocks, or which nests blocks sixteen layers deep (for no apparent reason), or which uses the id() function to get an arbitrary random-seeming number that’s changed every time the program is run.

Genetic programming will do all those things, if it can, and if it finds a reason to do so. It doesn’t “know” what you do about what these parts of a language are “really for”. Evolution is remarkably capable of doing what you ask it to do, but not in the way you expect.

You will find it remarkably difficult to ask it to “write good code”.

I know this to be true, not just from experience with GP, but also because it’s remarkably difficult to ask human beings—including you, and including me—to “write good code”. Even the code we write ourselves eventually ends up looking just like “other people’s code”.

What was I thinking here? What is this for?

So the reason why I shy away from evolving code in Python or Java or whatever is that a new language, like Push, puts us all at the same disadvantage. If I showed you evolved Python code (something I hope to do), you would look at it and say This doesn’t work!, or There must be a bug!, or What total crap!

But it’s doing exactly what you asked. Where is the bug?

When I evolved a FizzBuzz program that entered an infinite loop after it solved the problem, I was wrong to think it was a bug, wasn’t I? It did exactly what I’d asked.

It can be a pain to read a manual. You have to know the correct terms to look up. You need to trust that the authors have arranged things in a reasonable structure. You have to hope that they actually included what you want to know. Sometimes these things aren’t true.

It can be a pain to try to suss out what’s intended by looking at a program’s overall structure and architecture. Sometimes structure and architecture are just many stratified layers of disconnected diffs, fossilized on top of one another, remnants of what the code used to do. Sometimes you’ll see a hint of something somebody thought the code might do some day. Most often, structure and architecture and even idiom end being a habitual echo of all the other projects “like this one” that the programmer worked on before.

It can be useful to work through a tutorial. But almost any tutorial, for any really interesting tool, can ignore the special cases you have in mind right now. Or those will be left to the end of the course and marked “advanced”. And if they’re marked “advanced”, you can bet the documentation will be poorly-justified, and may still miss crucial information. Because marking something “advanced” is a signal that the author doesn’t think you should be doing it—at least not yet.

I would be very tempted to say that all these are examples of code (and documentation of code) “working in an unexpected way”. So they’re bugs.

But where does that bug live?

Other people’s code.

In the end, we’re left doing basic science on it. Or basic sociology, at least.

This is also true for code written by GP. A well-written and effective GP system (like Clojush) will do just what you ask it, at least some of the time. It will produce code that maps your training cases from input to expected output, as well as it can.

If it doesn’t, though, is that a bug?

Here’s something interesting: You will recall that several times I’ve commented on how much text Clojush poops out into the STDOUT stream when I’ve run it. It dumps a record of all its numerous settings, and then it starts building and scoring Push programs, and along the way it gives a wordy wordy report of what it’s found over time. This ends up being, for the FizzBuzz problem I’ve been looking at, about 4 megs of text, over the course of several hours spent searching for a Push program that gets all my cases right.

In a way, I find that irritating. I would rather something less wordy, which sums up what’s happening while also communicating that it’s still… I don’t know, “going OK” I guess. I even looked, for a while, for a way to turn it off.

But genetic programming is intrinsically stochastic. It’s just a finicky sort of random search. It’s never guaranteed to produce a perfect-scoring solution every time. And that may be down to something odd and nonlinear in your choice of settings, or just-plain random chance. You won’t know.

So right now, Clojush is doing another search for FizzBuzz programs in the background on the laptop I’m writing on. Here’s the program it thinks of as “best” so far:

'(boolean_and \space string_reverse integer_mod boolean_dup_times string_fromchar string_replacefirst char_pop 44 in1 "*uzz" string_concat string_nth "fizz" exec_do*while (char_stackdepth) 46 integer_inc integer_fromstring exec_noop "buzz" char_frominteger string_dup exec_dup (string_setchar integer_inc integer_inc) string_split exec_do*times (boolean_dup integer_lte string_split integer_inc) string_split exec_do*times (in1) string_rest exec_swap () (boolean_dup_times string_nth integer_fromchar exec_do*count (boolean_or string_rest integer_lte) boolean_empty string_conjchar) in1 boolean_invert_first_then_and integer_lt exec_do*count (integer_stackdepth) exec_do*count (char_allfromstring))

That program solves all the training cases perfectly, except the ones divisible by 15. That is to say, it mostly works: for 47 out of 50 training cases, it’s fine. I don’t much care how it does so, but it does.

It’s tempting to look at it, and ask: Where is the bug?

I know already that there are many things I might do to “fix” this code. I could probably streamline it a lot be deleting unused sections. I could simplify the “flow” a bit, probably, by isolating chunks of functionality—maybe by minimizing some kind of programming complexity thing, by shortening the time between the step when an item is pushed, and the time when it’s consumed again as the argument to some other instruction.

Or maybe, if I were able to pause and re-start a Clojush search, I could somehow “reach in” to the evolutionary process, and make sure it has access to "fizzbuzz", which seems to be the missing string.

But nope, it just didn’t work out. As I wrote this section just now, the last generation of search completed, and Clojush reported FAILURE as the last line it printed to STDOUT.

Poor thing. It did its best. Other times, it’s managed to do fine.

If only things had gone slightly differently.

Now I find I prefer the idea of having the steady feedback of that spew of text. I don’t want to turn it off any more.

Full circle

I started from an amusing Twitter conversation, where somebody deeply invested in neural networks research made me think he wasn’t aware of several decades of work in genetic programming. From a joke to a misunderstanding, and from there to an experiment or two.

What do people want? I doubt it’s 20000 words about an obscure GP package.

No, seriously: Why do people imagine that they can automate program-writing? And why do they imagine they want code written by computers?

I think, increasingly, that it’s because they rarely think about the social work that code does. People seem to imagine—in much the same way they imagine that neural networks can drive a car “in the same way” that a human being might do so—that it’s a matter of taking the right inputs and producing the right outputs. What they don’t understand is the tacit cultural and social baggage that all code carries.

As the man said: fish don’t know they’re in water.

It’s relatively easy to evolve assembly code to do whatever you want. There are a number of commercial products that do just that. Hell, why do you even bother writing more programs, dude? Why don’t you just use the assembly code we can evolve to do those things? If you can specify input:output pairs, we can produce assembly code that will do that exact thing.

It’s a solved problem, right?

I would rather use GP to evolve working code that is utterly unlike what I might write. I want to be irritated and confused. I want it to find bugs in my specification, in my mental state, in my assumptions about the world. I want it to show me things that have to be refactored, that I have to do real thoughtful first-principles analysis to understand. Code that drives me to cover a bulletin board with yarn and photos of its victims, to lose sleep. Code that makes me wonder what I would normally do in the same situation, whether it makes me understand all the more why I should keep doing it my way, or change my ways and adopt this miraculous new invention it’s shown me.

And I want it to do these things in its own idiomatic way, just outside the comfort zone I’ve stopped examining over the years.

I want it to be the critic of my assumptions about the world, to say “This is water.”

Turns out, it does that remarkably well.

I don’t want to bore you with an explanation of why I think I should use lexicase selection. Maybe another time. ↩