The very idea of numbers
Draft of 2015.07.24
For the last couplethree months I’ve been working—talking with Tom Helmuth and Nic McPhee towards an attempt to refactor the Clojush package for genetic programming. Working towards because I started knowing no Clojure, and also because the code is rather complicated and of that sort I tend to call “born legacy”.
Over the last few weeks, though, I’ve done some design spikes, Nic and I have wired in Brian Marick’s excellent Midje testing suite so we can prep for surgery, and I’ve made an effort to put the system through its paces by playing the role of a new user. “New user” is a pretty easy role for me to play in some senses, given above, but harder to claim since I’ve been writing Push language interpreters and GP systems for a while now.
The field of GP tends to approach itself on the scale of problems—a habit I suspect is maladaptive but can’t be helped—which is to say as a field we tend to frame our “challenges” and “examples” and “tutorials” around how “well” GP “solves” a particular benchmark problem. More on why this is maladaptive on another day; I am here to play the role of new user, so let me play along. Clojush is bundled with the requisite pile of problems: demos that surface aspects of “problem setup” and reporting, and which try to present target contexts of interest without driving too far off into the inevitable thicket of difficulties that arise whenever you try to aim a big, complex, supposedly “selfadaptive” system at a big, complicated and supposedly “interesting” problem.
Since the most direct modality for using Clojush seems to be “add and then run your problem”, that’s what I’ve been doing over the most recent exercise. I’ve collected a bunch through the years, which I mainly use when teaching introductory tutorials or making arguments in person, and I’ve picked a couple to build out in Clojush. They’re there now, and I should talk about them both.
One Hand Tied
All that said, it should be clear by now that I’m a bad actor when it comes to these sorts of “demo and pedagogic problems”. While most folks like to show little satisfying case studies that play out in predictable and timely ways, most of my didactic thrust is a litany of it’s more complicated than you think.
When I teach classes and workshops, especially those where the participants are building a GP system themselves, one of the first cases we work on is always a symbolic regression problem to fit \(y=x+6\). It’s not an especially hard problem to understand or solve, but it can be especially informative for that very reason: knowing the answer and being able to plot it on a Cartesian coordinate in realtime lets one see the many missteps, and feel the struggle this ridiculous little mess of an “artificial intelligence” as it learns first that what you want is for it to return a number, then learns that the number should change, then learns that the numbers depend on the input, and so forth until it finally discovers some basic arithmetic structure that works out to mean \(y=x+k\)… but it has no inbuilt capacity to adjust \(k\). It’s a visceral shock, seeing the forward progress of our struggling wee system grind to a halt, when armed with wideranging chaos and rigorous ontological order it tries (and often fails) to see the salient difference between 5.881 and 6.0.
The provocation of \(y=x+6\) is supposed, in my classes, to be a sense that we as “users” don’t understand what we’re asking our systems to really do, and as a result we often equip them badly for even the most trivial tasks. Do we really want every system to be its own McGyver, or are we willing to give it, ab initio, a way of quickly coping with the simple stuff?
As you may imagine, I side with the angels; I tend to argue we don’t want to watch the poor little duckling struggle to get up the street curb. The simple stuff, things like knowing the difference between two floating point numbers, isn’t our point at all; we’re much more interested in the miraculous discovery of pattern from data, not the fiddly details.
Sometimes this point doesn’t sink in, even though it’s such an important point to me. So that’s when I trot out the One Hand Tied problems.
The first of these is the Birthday Polynomial. Like \(y=x+6\), this is a symbolic regression task: I usually ask folks to build a quadratic target which is based on the date of their birthday. I was born on September 11 in 1964, so my Birthday Polynomial is \(y=1964 + 9x + 11 x^2\).
Simple enough, right? Should work almost exactly like \(y=x+6\), since it’s just some addition and some multiplication and a couplethree constants. Assuming \(y=x+6\) was eventually solved by the system (a necessary precondition, I have to stress), it’s only a matter of scale, right?
But then we also make a minor adjustment to the way the GP problem is set up. See, there is an interesting design decision implicit in every GP system’s setup, which is the suite of tools we include when establishing the prospective space of solutions we’re interested in: within the context of a particular problem, we may want to use or disallow certain primitive components as part of the constraint on our exploration. If we’re interested in humanreadable, “familiar” mathematical models of pattern in data, we might insist that the system only use addition, subtraction, multiplication, division, maybe some exponentiation or logarithms. Or as Ying Becker has said many times in her work, when you’re finding models to show to investors who handle literally billions of dollars, you can’t show them models that include the sine of a price because that isn’t something they already do: your goal is the provision of models in a social context, not just whatever works. Every decent GP package, as a result, surfaces this important decision as part of its process. You “phrase” a problem not merely in terms of the training and test data you’ll use to inform it, but also in terms of the subset of all possible computations you are willing to let it consider as possible solutions. Eureqa starts a problem setup by giving you a little checklist of mathematical functions, and further admonishes you not to pick too many.
What the Birthday Polynomial With One Hand Tied does in my workshops is turn off constants. That is, we use whatever functions and other tokens our language contains, but “uncheck” literal values of any other sort. “Random programs” can do math, and they can have the input value, but they have to invent a way of talking about the necessary numerical constants. A “solution” in this setting literally cannot contain “1964” or “9” or “11”… because I am a mean daddy and I say so.
What Happens
The first reaction is often that this is “impossible”, to which I usually find myself pointing out that we are in the room together right now on the premise that asking a pseudorandom number generator and a big pile of badlywritten guesses to explain the fucking universe for us, so “impossible” may not be the word they’re looking for.
But I also point out that there are numbers available: there’s \(x\), which is usually little but at least it’s a fixed point of some sort. And there are, tucked away in there, the idea of \(1\) and \(0\): because \(xx=0\) and \(x÷x = 1\) (for some \(x\)). And there’s a whole pile of functions that make numbers into other numbers: multiplication and addition, incrementing and modulo and all those others too. We throw all of those in, plus whatever stuff there is in the GP’s language already that might produce numbers: Push for instance can count how many boolean
values are on that stack and have that number to play with, or it can (potentially) make a big string somehow and convert that to an integer (I guess). There are loads of numbers in there.
This is not to say it’s not a “hard problem”. It’d be a crappy thing to do to an enemy, frankly, and here we are, supposedly fostering the next wave of artificial intelligence…. Which is my point, of course: I want people to really see the casual cruelty of our advantaged and elite position, as supposed “thinkers” who take for granted our inbuilt experiences, our accumulation of instinct, common sense, and habit, but who nonetheless want some pseudorandom number generator and a pile of badlywritten guesses to be intelligent in a surprising way.
Because you put up with so much from me, I will tell you the secret lesson again and maybe this time it might stick: Don’t be a mean daddy to your code. Don’t be afraid to reward partial success, don’t withhold crucial raw materials, don’t assume that your own brain works on anything but received wisdom and bad heuristics, and above all don’t tease the robots.
The point is that there is no Prime Directive. What really happens, I hope, is that in undertaking this surprisingly painful process, and in watching the little duckling try and fail to get up the curb, the participants in my workshop begin to question the very structure of what they consider “doing GP”. That they give a moment’s thought to the premise that it’s somehow a good idea to create a complex system that’s literally supposed to be able to “think”, but then immediately lock it into an isolated Skinner Box as soon as it’s “born” and to let it die in there without any help from the outside world. Hundreds or thousands of times in a row.
My Little Duckling
I was supposed to be talking about me playing the role of New User, and let me get back to that.
The process of setting up a “problem” in Clojush isn’t particularly well documented at the moment, but it’s not especially painful either. There are plenty of examples from which one can copy details.
The core, I suppose, is the error function (more typically in the literature, “fitness function”) we use to point towards our goal:
(defn missingnumberserrorfunction "Returns the absolute error." [numbertestcases] (fn [program] (doall (for [input (range 0 numbertestcases)] (let [finalstate (runpush program (pushitem input :input (makepushstate))) resultoutput (topitem :integer finalstate)] (if (and (number? resultoutput)) (abs ( resultoutput (birthdaypolynomial input 1964 9 11))) ;; edit this so it's your birthday 1000000000) )))))
Basically this goes through some Clojushspecific gyrations to say “Run the program that needs to be scored, given the value of input \(x\) has been pushed to the input
stack. When it’s finished running, the ‘error’ is defined as the absolute difference between the top integer
on that stack and the desired target value, or a default ‘no answer’ value of 1000000000
if no integer
was returned at all.”
That last clause is salient but may be unfamiliar, so let me focus on it. Push is an interesting language with very few rules: if a token in the running program is a literal, then it gets pushed onto the stack of the literal’s type; if a token is an instruction, then it takes its arguments off the stacks of the appropriate type(s), and pushes the result(s) onto the requisite stacks. So for example the Push program (1 2 integer_add)
first pushes 1
onto the integer
stack, then 2
onto the same stack above 1
, and then integer_add
will be executed by taking its two arguments off the integer
stack, adding them, and pushing 3
back onto integer
. All Push instructions, with few exceptions or other tricks, work this same way; the only other important thing you would need to know is that if any arguments are missing when an instruction is applied, then the instruction just disappears silently without changing the interpreter’s state. So by extension you now know that running the Push program (1 integer_add)
will end up with 1
on the integer
stack, and that’s it; the other argument for integer_add
is missing, so nothing bad happens.
But there are loads of Push instructions, and they can (and should) fail silently whenever they’re missing arguments of the required type, and also we’re talking about random programs so there may not even be any arguments. So consider for example a complete Push program like (boolean_or float_add)
: there are no literals in the program, so nothing on any of the type stacks, and so neither the boolean
nor float
function will do anything, and we end up with all empty stacks.
Nonetheless, this is GP: (boolean_or float_add)
is a perfectly viable candidate solution to the Birthday Problem. It’s not good, but it’s in the space of permitted solutions, and so we are forced to give it a score. And in the function I’ve sketched above, that score will be a billion “error” points for every case it is exposed to but fails to respond to.
There are innumerable ways for Push programs to not work at all. A program may not produce a result of the correct type because it doesn’t contain any instructions that return that type. Or it might produce the correct type of result, but eat it later in some illadvised predicate operation, converting the one integer
it cobbled together into a float
. Or it might invoke one of the several flush
instructions, which immediately empty a given stack of all values. And so on.
Notice though that these are not bugs in the Push language or even the particular program we’re exercising, but rather “missteps” in the context of our target problem. Any Push program will run, and as a result it does (zero or more) things as the tokens that compose it are examined and trigger changes in the interpreter’s state. It’s on us to call them “better” or “worse”, and to score them in some quantitative way for the task we want to coax out of a sort of badly hummed outoftune memory of evolutionary search.
The error function, as I said above, is only part of the definition of the problem. The rest is this little pile:
; Atom generators (def missingnumbersatomgenerators (cons 'in1 (registeredforstacks [:integer :code :boolean :exec :char :string :float])))
This is Clojush’s (rather awkward) way of saying, “Use the input value \(x\), and all instructions in the entire language which refer to the integer
, code
, boolean
, char
, string
and float
types^{1}, and nothing else.” In other words, this is where the hand has been tied behind the system’s back, because what is not listed is the ability to introduce boolean
, integer
, float
and other constants in the programs it explores.
Sadly, while that defines the “problem” it isn’t enough to run the system; there’s also a deal of bookkeeping to be done. That happens in this block of code:
; Define the argmap (def argmap {:errorfunction (missingnumberserrorfunction 20) :atomgenerators missingnumbersatomgenerators :maxpoints 500 :maxgenomesizeininitialprogram 300 :evalpushlimit 1000 :populationsize 1000 :maxgenerations 300 :parentselection :lexicase :finalreportsimplifications 1000 :geneticoperatorprobabilities { :alternation 0.5 :uniformmutation 0.5} })
Clojush uses a single big argmap
structure to manage the many parameters that affect the Push language (like errorfunction
and atomgenerators
above), and also the GP search (all the rest, and a few dozen more that take their default values). Teasing those apart from one another is, actually, the point of the refactoring project I’m undertaking: both the interpreter you would use to run a Push program and the evolutionary search you would use to find a Push program use the same huge kitchensink argument map. This is the minimum pile of changes I think I can get away with, for the sake of making progress in going from an idea for a project to seeing it run on my laptop.
maxpoints
is the maximum number of tokens (more or less) present in the programs we’re evolving; because search operators like crossover can produce new programs that are longer than their “parents”, there needs to be a limit somewhere which keeps them from growing without bound. This limits Push programs to (more or less) 500 tokens; bigger ones are thrown away.maxgenomesizeininitialprogram
is a sort of starting point for our search. The Push programs in Clojush are not represented directly for the purposes of evolutionary search, but rather there’s an indirect representation which can be snipped up and recombined more easily than Push proper can. This starts us with fewer than 300 chunks.evalpushlimit
sets the patience with which we evaluate Push programs when we exercise them. Push program termination is undecidable, since there are a number of ways programs can enter infinite loops or recurse without termination, but also a number of ways they can rewrite themselves out of those loops. They can run a long time, in any case. This sets a line in the sand and stops them dead after 1000 “steps”.populationsize
is how many random Push programs we start with, and sets how many “children” are made in each generation of search.maxgenerations
is a deadline for search; we start with 1000 random genomes (and thus 1000 random Push programs), and breed those with one another, and those children among themselves, and so on through 300 iterations, hoping that we’ll get to a “winner” by that time. We can’t restart afterwards, so it’s best to have this set to a “reasonable” time frame. That said, on my pretty good laptop, 300 generations of 1000 individuals takes a good four or five hours to run, so we’ll have lunch at least before we get there.parentselection
is set to Lexicase selection, an algorithm invented by Lee and which I find especially interesting for a number of reasons I’ll go on about interminably but later. Not now.finalreportsimplifications
is an interesting affordance. Push programs, as you might have worked from my example above and my hint that there are well over a hundred instructions in play, do all kinds of stuff. But our goal task—returning aninteger
—just involves whatever the last thing the Push program did that produced aninteger
result. So (at least in some cases) there’s a lot of extra stuff that might not be salient to the outcome. This argument controls a “simplify” functionality that takes a given program, runs it, deletes an arbitrary token, and keeps checking to ensure that it gets no worse for up to 1000 attempts at deleting randomlyselected tokens. If a deletion makes things worse, we backtrack; if a deletion does no harm, we move forward. Thus, the resulting program might be exactly as long as it started “presimplification” (if no single instruction can be taken out), or it might be almost entirely gone if a succinct kernel of value is tucked away inside it. Usually the simplified result is somewhere in between, of course.geneticoperatorprobabilities
are controls for how GP works. It’s complicated. These are copied from another example.
So, long story short: it’s hard on this little duckling to solve this particular problem. In fact, I ran it a couple of times with these settings, then moved things over the Big Desktop computer and let it run longer and with more “oomph” in the revised settings, and nope, I see something like this almost every time:
Current time: 1437726655671 milliseconds ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; FAILURE
Being more accommodating
Well, technically I see that message above at the end of about 3 Gb of preceding text. Clojush is nothing if not verbose: I get reports of the best in each generation and by three or four criteria for “best”, I get detailed listings and genomes (which are different), I get error measurements, and so on and so forth. There are 1000 individuals in each of 1000 generations, and each one gets run over 20 input:output pairs for evaluation, so there’s twenty million “runs” of Push programs involved, each of which can involve up to 1000 steps. No wonder my laptop gets hot.
And yet we don’t get a good enough answer, meaning we don’t get one that’s able to “understand” that \(y=1964 + 9x + 11 x^2\).

The
:exec
thing is harder to explain than I want to take time for here, but I should say that the:exec
stack is the stack on which the Push program is actually running. So when I said a Push program “is”(boolean_and float_add)
in my example above, that is a cartoon of theexec
stack in the interpreter at the beginning of execution. The interpreter pops items from the:exec
stack as it runs the program, and it’s gradually exhausted as the program executes one token at a time. It gets complicated and worth more time than you have right now, because Push has a number of important instructions that put new things back onto the:exec
stack. Save that for another day. ↩