Draft

One helpful way to think about GP

Draft of 2016.08.29

May include: data scienceGP&c.

The other day I mentioned I’d promised Brian Clark at OSU some advice on what to do when “GP doesn’t work” in a data analysis situation. In that introductory essay, I tried to make the case that unlike the tools we’re all comfortable calling “Machine Learning”—most neural models, support vector machines, decision trees, nearest neighbors classifiers, and so on—symbolic regression specifically, and genetic programming more broadly are useful because they don’t immediately show you the answer you expect and want. They are, in other words, tools better used for seeing what’s “in” your data, and really shouldn’t be expected to work in the reliable and composable way that those other tools work.

Most published genetic programming research papers are about fixing things that “break”, in this sense. In the earliest days, when everybody thought GP was an “automatic programming method”, people found “pathologies” in the algorithms’ behavior when applied to “benchmark” problems with known solutions. Whenever one of us invents some new scheme for representing solutions, or for manipulating those solutions in the process of search, or for describing and handling the goals of search in some new way, the complex nonlinear dynamical system we’re fucking around with inevitably finds a way around our well-intentioned plans. While we might fix “bloat” (for example), we might find we so at the expense of “interpretability” or “diversity” or some other thing we wanted but forgot to mention.

You’d think that after a few years of working hands-on with the complex interaction between population diversity, search operators, objective definitions, parsimony, indirect representations, performance measures, benchmark problems, and all the rest, some consensus would develop. But to be honest, the result is much more like the schools of martial arts one sees in a Shaw Brothers film.

My Tiger Stance is unbeatable!

It might be better to speak in terms of an Alexandrian pattern language for situations in GP projects. I haven’t done that exhaustively (because I suspect by definition one can’t), but I do have a few that I tend to talk about when I teach workshops or classes, and when I work with consulting clients.

Because the focus of the CHEAPR workshop was the application of GP to explore symbolic regression and classification models, I’ll speak in what follows mainly about symbolic regression problems and classifiers that involve algebraic expressions. But realize that for GP in the larger sense, any computational scheme at all can be used in one’s representation and search space. You might encounter these same situations, and apply some of these same design patterns, in projects where you’re evolving robots or bridge designs, circuit layouts or artworks. The nature of the mathematical and algorithmic functions and the ways in which evaluation is performed (the “representaion”, though it’s more complicated than that) can almost always exhibit the problems I’ll mention here.

That said, I have spent this entire chunk of exposition defining terms. Isn’t that always the case, for technical subjects? Hey it’s either that, or you’d need to live through an exhausting “learning by example” session. Mine has extended more than 20 years so far, and I don’t know how to abbreviate much. So let me, just this once, try to define some terms first. I did some things, so you don’t have to.

Some definitions

These differ in almost all cases from those you’ll find in the literature. That’s how it works when I talk about it.

Answers

I prefer to use the word answer to describe the data structure we agree might be a “solution” to our problem of interest. In the case of symbolic regression, a single answer may be a tree describing an equation, or an array containing a number of weighted linear trees over input variables which are added to one another during evaluation, or a program in some weird representational language which will be interpreted as a solution, and so on. In other words, let’s define it as being any and all the stuff you would need to say with confidence, “This here is a prospective solution to the question I’m asking.”

It might be a totally stupid solution. But even if it’s stupid, when you have an answer in hand, and when you have decided on an algorithmic and reproducible way to exercise and evaluate any arbitrary answer, then you are in a position to compare one against another.

Scoring an answer

In supervised learning projects—including almost all symbolic regression and classification problems—we usually have some finite suite of data with associated expected “results”.

I’ll refer to these known facts we’re trying to fit as training cases. What I’m about to say is also true of test and validation cases, if you decide to use those in your project.

Let me start by walking through the process by which I might score a given answer “on” one training case. I’ll use an example that’s common in symbolic regression problems, to begin with… just to play to your assumptions. Then I’ll generalize it to more reasonable real-life approaches, and hopefully you’ll see why the relatively complicated abstract structure I’m laying out is so useful.

Suppose my goal is to fit this data, using symbolic regression:

x1   x2   x3    y1
------------------
 3    7   2.0   9
-2    7   2.0   99
 1    8   2.0   999
 5    9   2.0   99
 4   -1   9.7   9
...

I’ll call each row here a training case, in keeping with the statistical tradition. My goal is (for now) to find answers which are equations over x1, x2, and x3. I’m going to assume (for now) that the value of y1 is simply what I “get” when I plug specific assignments of those variables into an answer equation, and simplify with normal mathematical rules.

The context of a training case includes, in this example, one specific assignment of values. The context of the first training case might be written as {x1: 3, x2: 7, x3: 2.0}.

I exercise a given answer in a particular context, in this example, by making the assignments in that answer, and evaluating the resulting mathematical expression with simplification. For example, if the answer I’m exercising is x1 * 2 + 1.2 x3 - sin(x2)/(x1-x3), when I exercise that answer in the context of the third training case, the result of simplifying the relation will be 5.589358247 (I think). I want to call that number the final state of the answer when exercised in this context.

Note that exercising an answer in a particular context does not define a measure of “error”. In this particular example, I can interpret the final state as being 5.589358247, but it’s really important to realize that I have not yet specified a mechanism for scoring that value.

I will call that crucial component a rubric. In this example, I would be comfortable defining a single rubric for each training case, plus one for the aggregate score I suspect you want to see me use. Think of each rubric as a function over the set of answers, which given an answer will return a single scalar error score for that answer.

For each training case in this example, let’s define a simple rubric. I tend to like to think of error measures in terms of absolute deviation of expected and observed results, so let’s use that here.

The rubric for the first training case can be described as: “The error score is the absolute difference between the number 9 and the final state reached when I evaluate the answer in the context {x1: 3, x2: 7, x3: 2.0}, simplifying the equation using standard arithmetic rules.” Similarly, we can imagine there is a different rubric for each row in the training set.

However, if we’re following the conventional approach of comparing our answers on the bases of an aggregate statistic over training cases, I will also want to define a secondary rubric, which would read something more like this: “The error score is the arithmetic mean of the error scores of all simple rubrics over training cases.”

That’s crazy complicated, Bill

I bet you’re thinking this feels weird and convoluted. That maybe I’m making something trivially simple—as easy as any high school student’s homework problem—into something way more ornate than it needs to be just to score an equation fitting some data.

That’s good. This is Back-of-the-Book “advanced” stuff when it comes to evolutionary algorithms generally, and GP specifically. Thing is, I’m trying to save you 20 years’ worth of being frustrated. So bear with me.

Let’s get a little bit fancy, and hopefully you’ll see very quickly where I’m going. A “little bit fancy” here means no fancier than you should expect for any real-world problem.

Context

If your project is one where you’re using symbolic regression, you may imagine that what I’ve called the context of a training case will “just” be some specific assignment of numerical and binary values to defined independent variables.

But in fact, the context includes a bit more, even without you knowing it. Even in simple arithmetic symbolic regression projects. What happens when an answer includes an expression that evaluates in such a way that division by zero is attempted? What happens when a result is Infinity, or you add Infinity to a number, or divide by Infinity? What happens when you add Infinity to -Infinity?

Those are also part of the context.

Now suppose your simple symbolic regression project permits random variables and functions in your answers. Suppose I have included (for perfectly sensible reasons) a function unif[min,max], which “returns” a uniform random number in the range between min and max. When that appears in an answer, it represents an entire probability distribution. I may want to do something like Monte Carlo sampling or higher-order stochastic algebra to evaluate an answer: that is also part of the context.

In the case of classification problems, you might see this even more easily. Sure, you can define the space of possible answers as those which always return either true or false, but to be honest that seems kind of limiting. We can more easily (and productively) define a classifier as an arithmetic expression plus a threshold function: the arithmetic expression returns some scalar value, and that is passed through a threshold function to produce a “result” of true or false. Or we could be searching for an ensemble classifier, and have a dozen or a hundred results, which must be aggregated. Or we could be looking for an integrating classifier, for which we’d define the “result” as the integral of some variable’s value over time or some range of conditions. And so forth.

Finally, in “real” GP projects, realize that the representation of an answer can be any arbitrary structure, including a collection of stuff, including iteration and recursion, including Turing-complete bullshit that will blow you away when you see it. In those cases, the “context” can include not just a complex, full-featured representation language, plus an interpreter for “running” answers, but also a hardware platform on which answers are executed for evaluation in a complex context.

How else you gonna evolve something that runs in a real robot?

Final state ≠ “result”

I’ve made a very sharp line between the final state of exercising an answer, and the “result”.

This can be especially confusing for people used to thinking of simple symbolic regression projects. But even there, there are important and very useful cases where you will need to do just that.

Most obviously, suppose that in your project you are interested in finding answers which exhibit good scores simultaneously over several response variables. For example, suppose you’re really interested in the mean and the variance of responses, or you want a classification decision and a confidence measure on that decision. I’m sure you can see that exercising a given answer once in a given context will produce both values. The question is then what you “do” with that information.

The temptation among many people trained in technical fields is to immediately aggregate these two nominally independent “scores” into a weighted scalar.

I’ll just save us time right now by saying: Don’t do that unless you really know what you’re doing.

So, assuming you’re willing to believe me and do what I say for a while, we can talk about one of the most prevalent modes in which GP can “resist” your expectations. This is also one of the most insidious, annoying, and argued-about problems in the field itself. That is bloat.

Which model is better?

x1 * 2 + 1.2 x3 - sin(x2)/(x1-x3)

x1 * 8 + x1 * 9 - x1 * 15 + 
   (3.6 * x1 * x3)/(3 * x1) - cos(0)*sin(x2)/(x1-x3)

You’ll want to say it’s the first one, I hope. I imagine I might have made a mistake making the second one baroque, but my intention is to make them the same expressions when simplified. You want to only see the first one, given the choice.

Some designers of GP systems will do this algebraic simplification for you, either when you view the intermediate results of an ongoing search, or as part of the search process itself. But what you see isn’t really the point here. Consider that if the second example is in a population of evolving answers, more computational effort will have to be expended to evaluate it. Either I will simplify the expression into a more parsimonious form and then evaluate that, or I’ll spend more cycles working through the explicit form regardless of what “cancels out”. We call this “extra stuff” bloat, and in what I’m going to be rude and call “thoughtless” GP systems, it can be a serious problem.

As a result, for decades GP practitioners have tried to eliminate bloat. If you glance over the literature, you’ll see hundreds of papers proposing schemes for reducing bloat in benchmark problems. You should be forgiving: it’s the first impulse whenever you see something unexpected to try to squash it out.

But there’s a problem with that approach. When you focus too much on the simplest answers, you reduce the ability of the search process itself to discover innovative answers later on. We can call this “premature optimization”, in the sense of you (the designer or user of a GP system) trying to “optimize” a parameter or algorithm too soon to permit useful results from being produced. Because “bloat” is not just “in the way”, it’s also the raw material for new modules and whole new approaches to arise within your population of answers.

Using the common search operators of mutation and crossover, it’s much easier to make a small change to a large, bloated answer and get something qualitatively different from the original, than it is to make a large change to a parsimonious, succinct answer and expect it to be both good and different. So practitioners tend to say “some bloat” is good, because of the prospects for innovation, but “too much” bloat is bad, because one spends time evaluating expressions that have no impact on performance.

The correct approach, I will just say, is to let GP itself do the work. I’ll talk about that below, when I talk about Pareto-GP. But for now, here’s the reason for this side-trip:

Suppose you had those two answers above, the simple one and the complicated one that does the same thing. If you treated the numerical result of evaluating each one, they would have exactly the same error scores. Right? When I substitute specific values of x1. x2 and x3, I get the same simplified numerical final state.

But I can also define a separate rubric on the same training case. That rubric, and the context I define for it, can be this instead:

For a given answer, the number of characters in the string representation of the equation is its score.

For this rubric I am defining a completely orthogonal context for evaluation. I’m not even substituting the arguments into it. I’m simply saying that the score is how long it is when I print it. By separating the ideas of context from rubric, I have permitted us to make a more valid comparison between the two individuals above.

I’ve surfaced, in other words, my “real goals”. I want an accurate model, which is also parsimonious. That will be the most important and simplest way for you to accommodate unexpected results GP presents to you. Revise what you’ve told it you “want”, including what you really wanted all along.

Simple rubrics vs Aggregate rubrics

Another point I’ve made in this framework I’m describing will seem weird if you’re only used to the statisticians’ and engineers’ habit of relying on aggregate statistics for optimization goals. Why bother specifying one rubric for every training case’s error, and then another rubric for the sum or the average or the max of all those? Aren’t we minimizing the aggregate errors every damned time?

Quick answer: no. If you start from that assumption, you’ll be worse off when you have to change it. And almost every accommodation to a “problem” GP has with your project will involve this very thing. As I said above, you already really want more than one thing, but your habit has been to only think about that “secondary” thing of not-being-too-convoluted-and-stupid as a sort of after-the-fact “check” on low-error solutions. That’s how statisticians (and by extension and association, Machine Learning practitioners) frame these things. Optimize for this one number, then later we’ll check the others.

The more salient and practical reason, though, is that your dataset is not as simple as you imagine it to be. I’ll try to spend a long time going over Pareto-GP approaches of the sort Katya Vladislavleva and Guido Smits have pioneered. But generally the way to think about it is this: Recall that I said GP’s utility lies mostly in showing you the sculptures hidden inside the marble block of a dataset? One of the most important and useful things you can discover, even if your goal is the construction of some sort of automated global model, is patterns in your data.

More on this—over and over and over again—later.

Representation, evaluation, and context

Often the most important, but most overlooked, facet of this scheme I’ve laid out is the clear distinction between the representation of an answer, and the way it is exercised. This is honestly one of the reasons I try to avoid focusing on Symbolic Regression problems in classes I teach, because the near-identity between “algebraic function” and “mathematical evaluation” is so hard for people to get away from. We’re so used to thinking immediately that (x + x + 2 * x)/x “is” 4.0, because we’ve been taught arithmetic simplification for so long, that we just have trouble thinking otherwise. As a result, edge cases can easily trip us up.

Quick question: What is the result of that expression when x is 0.0?

Followup question: If you’ve done some GP, and you say “well, because of protected division, it would be 0.0”, great! Now what’s the first derivative of that function around the origin? Which is where you say: Whoopsie.

So establishing a formal context that is not just the tacit application of “regular math rules” is actually important. Computers and floating-point arithmetic have been designed for maximum applicability, but there are always weird edge cases that we get away with because humans don’t often work with extreme values… when they write their code by hand. GP systems have no such habit, and it’s easy and fun to find cases where infinite or undefined values will arise in any numerical modeling system.

And this is where you say, “Hey! I know all about arbitrary precision numbers!” Which is where I say: great! What’s the arbitrary precision system gonna do when you divide 1/3?

Trust me. I do this a lot.

But the most important reason why I insist we differentiate carefully between representations and the context we build to exercise answers in a given representation? Because you need to be ready and willing to alter both. And they’re not related to one another in any neat, pretty, simple-to-think-about way.

Here are two cases I always use when I teach introductory GP classes. I stole one of from Nic McPhee, and I think he might’ve picked it up from Riccardo Poli or Wolfgang Banzhaf. The other one, that’s mine; it’s to make a related point.

This is the stolen one: Take your symbolic regression system, and build a training dataset of \(y=sin(x)\). Now start a symbolic regression run in which the answers are arithmetic functions which include only the functions \(\{+,\times,-,\div\}\). Start that thing “learning” to fit \(y=sin(x)\) with arithmetic alone.

What do you think you’ll see?

OK, so we’ve established you’re a nerd already. You’re saying, “It’ll discover the Taylor series expansion for \(sin(x)\)!” and you probably think that would be pretty cool. Sure, if you’re in an introductory GP class and you’re still being convinced that this weird-ass random process can do anything at all.

But you’re trying to do Data Science, not reinvent the fundamentals of first-year calculus class. Imagine instead of being impressed and wondering at the amazing impressive flexibility of this immensely powerful example of real life artificial intelligence, you saw this happening in your actual experimental application of symbolic regression to real data. Say you thought, “Hey, I don’t want things to be very complicated in the models I look at, so I will just use arithmetic and that will make the models GP evolves be super duper easy to understand!” And then you run GP with trivial arithmetic functions on your actually-quite-complex empirical dataset, and “it doesn’t work”. It does all kinds of weird-ass things where it’s adding more and more higher-order polynomial terms to the expressions, and it’s doing some funky rational function stuff, and it’s not really getting anything like a parsimonious “elegant” answer at all.

Do you flip your laptop and say, “This is bullshit, man!”? Or do you notice that, hey… maybe that series of \(x^2\) and \(x^3\) and \(x^4\) and so on are some kind of Taylor expansion of some stuff that involves transcendental functions? I will vote for the latter; flip the laptop later on.

The second example, same subject, is another object lesson. This is called “The Idea of Numbers”, and it’s so misleading-but-obvious that you should wonder at why I’m even going to make you do it.

But do it anyway. Write down the year, month number, and day number of your birthday. Mine is 1964,9,11, but yours will possibly be different.

Set up your same symbolic regression system, the one you downloaded or you wrote yourself. Train it on data from your birthday polynomial: \(year + month * x^2 + day * x\). So you have this set up to use \(\{+,\times,-,\div\}\) again, right? And you’ve got a few dozen or a hundred or whatever training cases with the birthday polynomial, right?

This should be so easy, you think. Just random guessing will get this to work, and then it’s just adjusting the numerical coefficients.

Oh, right—about those numerical coefficients. I want you to remove all constants from the system. That is, there should be no ephemeral random constants (if your GP system supports those, which it really should or you should immediately stop using it), and there should be no explicitly defined constants permitted. So in other words, the representation will permit x + x/(x-x*x + x + x + x + x), but it will not permit 2x/(5x-x^2).

Now run that symbolic regression problem, with that constraint.

As in the case where you fit \(y=sin(x)\) using arithmetic, consider: What are you “asking it to do”, really? It’s relatively easy to construct algebraic expressions that evaluate to 1 (for example, \((x/x)\), sometimes) and 0 (for example, \((x-x)\)). But we need to get from there, somehow, up to 1964, in my personal birthday function. You may well be younger than I am, and have to get all the way to 2000 or something.

Again, these both feel like artificially tying one hand behind your back. But they’re object lessons intended to surface exactly the sorts of things you may be asking your system to do, without realizing it.

The appropriate response to seeing them is to adjust or extend the representation language you’re using. GP, when you see it running, can tell you that you forgot to include an important function; it can tell you that you neglected an important component of numerical modeling (the numbers, thus the name).

It can also tell you that something is there that shouldn’t be. Look back at the straw man problem where we try to fit \(y=sin(x)\) with only \(\{+,\times,-,\div\}\). Suppose I tell you to add \(sin\) to that set. How does that make you feel?

A related behavior arises often in GP projects: an over-eager and over-simple solution can arise which makes you realize that the question you’re asking, the problem you’re attempting to explore, is not as simple as you imagined. If GP tells you that it’s trivial to solve the question you’ve posed with some easy-to-reach tool, you should consider whether that suffices.

Examples next time

This has turned out to be more of a description of a philosophical framework and architecture than it has a list of design patterns. Rather than walking through cause-and-effect “if you see X, in this context you may want to try Y” design patterns, I’ve jumped ahead and just dropped the overarching model here on the page.

But these are the terms I’ll be using as I actually talk about design patterns. There will be answers, not “individuals” or “genomes”, because those are both loaded concepts that mislead as often as they inspire. There will be many rubrics instead of one “objective” or “fitness”, because 90% of the time there should be or are multiple objectives, and “fitness” is another badly overloaded metaphoric term you should avoid. Also, I haven’t even mentioned lexicase selection yet, or why that might be useful. Anyway, finally: there is a sense of state of evaluation, which I’m insisting should be different from “result” or “output”, because sometimes you want to consider aspects of an answer that are not “what result it gave”. The answer may not even be a function in that sense.

And as I finish up, let me reiterate what I tried to say the other day: Don’t consider GP a “tool” in the same way that Machine Learning systems are tools. When you undertake any interesting or useful GP project—even if you’re using symbolic regression—you’re asking the system to argue with you, to surprise you in some way. The tools of Machine Learning are broken when they do that.

The benefits of using Genetic Programming depend entirely on the fact that this happens so often. The abstract structure I’ve spelled out here is in support of that distinction.