Considering lexicase selection algorithms in SQL (and NoSQL)

Draft of 2018.02.07 ☛ 2018.02.08

I’m near the the end of a several-weeks’ holiday break in the continued fractions story as I prep the next research projects I intend to discuss with colleagues at a workshop. The continued fractions thing is part, but implementation has its own noteworthy aspects.

Specifically, I find myself wondering (for at least the second time in three years) whether it’s reasonable to implement the lexicase selection operator in SQL. Or, if not, in something like a NoSQL persistent store.

I specifically mean implementing the selection algorithm in query language here, not just implementing the algorithm with regards to data stored in a database. I know how to ask the database to give me the whole data table, and then work in memory with that in whatever language I like that day to apply lexicase selection. What many database vendors (especially the good SQL ones) tout is the strength and optimization of their query preprocessors. And as anybody knows who’s looked into implementing some simple-sounding algorithms in SQL (for example, finding the median), computational complexity and algorithm speed are especially salient in that setting. Especially in these modern times of million-row, thousand-column tables.

Lexicase itself

Let me abstract Lee Spector’s and Tom Helmuth’s lexicase algorithm a bit, to simplify. There are at this point more than a dozen variants (many due to William La Cava and Armand Burks, among others), all making the operator a bit different in implementation details, if not in principle. So I’ll stick to basic principles here.

First, as with many machine learning setups, this algorithm is aimed at supervised learning of models, based on how well they approximate some set of observed data. The data is a set of training cases, each with some “inputs” and an expected “output”. Each model under consideration is a function mapping the space of inputs onto the space of outputs.

We want to preferentially keep models where the mapping is closer to the data.

Think of linear regression here. Say at some point in finding a model, you have two different linear models: \(y = 0.5 x + 8.2\) and \(y = -1.9 x + 17002\), and a data set of observations of \(y_i\) for several different \(x_i\) observations. You’ve been taught to calculate the summed square errors for linear models like these, given the data, and to prefer the one with the lower SSE score.

To calculate the summed squared error (or residual sum of squares), you “run” each model \(m\) with each \(x_i\) value, and determine its predicted \(f_m(x_i)\) output. Then you compare that with the observed \(y_i\) (usually by taking the difference) and maybe square it, and then add all those together. That is, for a given model \(f_m(x)\) and a set of data in the form \((x_i,y_i)\), you calculate

\[\sum_{i}{(f_m(x_i)-y_i)^2}\]

Not so in lexicase selection.

Lexicase selection algorithms (all of them) aim to disaggregate the error measure, and in an odd way you may not have considered before. Lexicase selection makes you choose one model over another by sequentially comparing those models’ behavior (error) on the basis of each training case in some (typically random) order.

This disaggregation might make little sense in a setting where, for example, you were optimizing the parameters of a single model by gradient descent. It’s for use in population-based metaheuristics, like genetic programming (or genetic algorithms). While there may be reasons to consider lexicase selection in the context of one-off statistical comparisons between models, the inherent randomness of lexicase selection methods helps mainly when we are sampling many times from the set of permutations over the training cases.

Rather than immediately summing the deviations of our model \(f_m\) over every training case \((x_i,y_i)\), let’s pause just after evaluating all these deviations, but before summing them all together as we would in an SSE calculations.

Say there are three models and four test cases, and here are their measured outputs and observed outputs as well:

PREDICTIONS:

          case_1     case_2     case_3     case_4  
  x_i:     0.3        22.1       -3         111.2
  y_i:     0.4        -17         2          17.9
        -----------------------------------------
model_1    1.4        -12        99          13
model_2     -9         10         8          18.8
model_3    812        -22       -52          17

In the predictions table, I’m showing each model’s prediction for the output \(f_m(x_i)\), given each observed input \(x_i\), for three different models I’ve just made up. In the top headers I’ve shown the training cases themselves for reference. Now let me calculate the errors for each prediction. But instead of using the square of the difference, I’ll use the absolute value, because why not?

ABSOLUTE ERRORS:

          case_1     case_2     case_3     case_4  
  x_i:     0.3        22.1       -3         111.2
  y_i:     0.4        -17         2          17.9      SSE
        -----------------------------------------------------
model_1     1.0       5.0       97.0          4.9     9459.01
model_2     9.4      27.0        6.0          0.9      854.17
model_3   811.6       5.0       54.0          0.9   661636.37

In the absolute errors table, I’ve gone ahead and taken the absolute value of the difference between the predictions \(\lvert f_m(x_i)-y_i \rvert\). We could use the squared differences if we want, but it gains us nothing at this point and this is a bit easier to see to be honest.

I’ve also added a new column at the right, containing the traditional summed squared error for each model over all four training cases. None of these three models seems very good, as you can see, on that basis of summed squared errors. But if you were looking for the “best” classical least summed-squared-errors model from this collection, it’d be model_2.

The fundamental goal of lexicase selection algorithms is to select a diverse collection of models, each of which may be better at some subset of tasks, in order to foster partial solutions in modular problems. The keen insight for lexicase selection is that the training data itself contains information about the modular structure of the problem: each training case may be, potentially, a good example for learning a qualitatively different “task” from what’s shown in the others.¹

This is of course a simple example, but it’s good enough to explain the loop we undertake for every lexicase selection event:

Start with the complete table of all models and their case-wise scores. That’s the second table I’ve shown above, if we’re using absolute error as score.
Permute the columns of interest (once, at the beginning of the selection process). In this example, I would pick some random permutation of the set {case_1, case_2, case_3, case_4}.
Select and retain only the subset of rows which meet or exceed some cutoff score (best or above-median, for example) in the column that appears first in your random permutation.
Recursively, which is to say without replacement, select and retain only the subset of rows remaining in what you selected last time, which meet or exceeds the appropriate cutoff applied to the next remaining column in your initial permutation.
Quit when you’ve reduced the set of models for each item in the permuted set of cases (or when your set of models contains only one).

If you run out of training cases (columns or permutation items) before reducing the table to only one row, you’re on your own. Which is to say: probably you can pick one of the remaining rows at random and be fine. But in general it’s cleaner to think of “lexicase selection” as just this recursive reduction of the rows in the table, and treat cases where multiple rows remain after a full series of recursive reductions as “ties”.

Let’s walk through the example.

Say I first randomly select the permutation [case_2, case_1, case_4, case_3], and also I’ve decided to select models with the best (lowest) absolute error.

I start with the full set of models: {model_1, model_2, model_3}.

In the first stage, I will remove all models not best-scoring on case_2. Their scores are, respectively, [5.0, 27.0, 5.0], so I remove model_2 and retain {model_1, model_3}.

Then, recursively, I remove all models from this subset not best-scoring on case_1, which is next in my permutation. The two models’ scores on that column are [1.0, 811.6], and so when I remove the not-best ones I am left with {model_1}.

At this point, because there’s only one model remaining, I could quit. But in situations where many scores are identical, or where for example I am using “better than median error” rather than “best error” as a selection criterion in each stage, I might have multiple models all the way to the end of the permutation.

But for completeness, let me continue. I have the reduced subset {model_1} now, and so I remove all the not-best-scoring models from that, based on case_4 this time. Of course I keep {model_1}, as you’d expect, since it is in fact best-scoring in its field of one.

And finally, given {model_1}, I throw away the not-best-scoring on the basis of case_3. Hey look, {model_1} is still there.

So I’ve selected model_1 by lexicase selection, given this randomly-chosen permutation of columns.

Every time I apply lexicase selection to a table like this, I begin with a new randomly-chosen permutation. There are \(4!=24\) permutations of four columns, but as you can imagine for any sufficiently interesting set of training data, there could be hundreds (or millions) of “columns” here, so it pays to treat each application of lexicase selection as using a random permutation, because combinatorics makes these numbers very large, very quickly.

But in this case, I could work out the lexicase winner for every permutation:

            1         2          3            4
model_1     1.0       5.0       97.0          4.9
model_2     9.4      27.0        6.0          0.9
model_3   811.6       5.0       54.0          0.9


order  winner    stage
-----  ------    -----
 model_1   1
 model_1   1
 model_1   1
 model_1   1
 model_1   1
 model_1   1
 model_1   2
 model_1   2
 model_3   2
 model_3   2
 model_3   2
 model_3   2
 model_2   1
 model_2   1
 model_2   1
 model_2   1
 model_2   1
 model_2   1
 model_2   2
 model_2   2
 model_3   2
 model_3   2
 model_2   2
 model_2   2

I’ve added a column here labeled stage, where I’ve made a note of where in the course of the four permutated cases we first reduce the set of models to one. In this particular example, and using the “keep best-scoring” criterion, it turns out that half the time you’ll need to do two comparisons, and the rest of the time the “winner” is obvious in the first step. But if there were more ties in the score table, or there were more cases, or if the criterion I was using was “softer”, then the number of recursive steps needed to winnow the table down to one item would probably be larger.

In many real-world cases, for example where most models are all very similar and we’re “tuning” some details regarding only a few test cases, the number of cases needed to reduce the collection to a single model may actually be very large. The same thing might happen—recalling that this is not simply numerical regression but open-ended genetic programming—if some of these errors are measured for boolean classifications, or the distances between discrete numeric values.

Similarly, for this example and selection criterion, over all the permutations possible you’d select model_1 eight times, model_2 ten times, and model_3 six times. You might be tempted to say “well then, model_2 wins!” but that would be a bad idea. Remember: the point here is to sample from this distribution many times, stochastically, because we’re attempting to build a population of potentially somewhat-better models which we will subsequently improve or modify by crossover or mutation or something fancier than those. Read this is a probability density distribution over models, not a “score”.

One more caveat for the statisticians in the audience: Suppose we were applying lexicase selection to a population of a thousand models, and each of those were scored on four cases as we have here. You might be tempted (so many of us are) to model these 4000 scores (four scores for each model) as i.i.d. random variables, for the sake of tractability. But realize that these numbers in practice cannot be i.i.d., because the models at any particular generation are themselves the products of earlier rounds of lexicase selection and subsequent search operators like crossover and mutation. As an evolving population is selected over time to be “better” than earlier generations, these numbers will converge and therefore become correlated in interesting and complicated ways you can’t model with a simple mixing model. So don’t do that.

But that caveat points out an important point: For reasonable, practical population sizes in genetic programming (and probably other metaheuristics as well), you’re going to be running this algorithm—pick a permutation and winnow the whole table down to a few rows—a whole pantload of times. If there is a population of 1000 individuals, and you want to select 2000 parents for crossover, and you want to see what happens after a hundred generations, you’ll be running this algorithm something like 200000 times. And there aren’t, as far as I know, any obvious shortcuts for you to take.

It’s computationally burdensome. Assuming we know what we’re talking about in saying we want to do it anyway, the question arises: Where should the code live that bears this burden?

Who will do the work?

Now then. That’s the question that I’ve been musing about lately.

Many of my colleagues in the field of genetic programming are comfortable working without a data store like a database. Indeed, I’ve often wondered whether the common notion in GP of a “population” is a design decision that’s fundamentally driven by the long-standing habit in Computer Science of avoiding persistent data stores because they’re hard and slow.

But of course, modern database servers aren’t slow any more. And modern SQL servers have many decades of optimizations and cunning pre-compilation stuff cooked in, so that when you present one with a stupid SQL query they can intervene and shuffle things around internally to improve on time and spatial computational complexity stuff you literally can’t fathom. There’s a lot going on inside PostgreSQL, for example, and at this point you should probably be willing to believe these people know what they’re doing.

So it makes sense, now, to wonder: Given that I am storing my GP population and scores in a table like the ones I’ve shown here, in a SQL-driven database, and given that modern SQL servers have some awesome smart algorithms inside them, and given the frustrating brute-force alternative of storing these numbers in the table and then reading the whole damned table back into memory to “run lexicase selection” on them… maybe the server itself could do this?

For example, recursive SELECT functions are part of every modern database server (including SQLite!). That whole thing in the lexicase algorithm about “reduce the table recursively by considering columns in some order” sure sounds like it would be feasible. It may not be efficient, especially if the order of columns changes in each application of the recursive SELECT statement… but then again, what do I know?

Narrator: “He doesn’t really know much.”

And as I work on a scalable, distributed cloud-based GP system, I find myself drawing a little circle-of-responsibility around this “selection” role a lot of the time. So there wants to be an agent or process or daemon or lambda-function out there whose sole responsibility when activated is to look at the table and do the lexicase.

For the moment, disregard the likelihood that the table of models and scores may be updated frequently and asynchronously. And disregard my implication above that sometimes you want to use more complicated selection criteria like “better than median” instead of “best score”. Hell, to make things even easier, don’t even worry (for now) about how one might randomize the order of columns: suppose we’re given a fixed order over a set of maybe a dozen columns.

How would one write a SQL query to do lexicase selection, as I’ve sketched it above?

I’m thinking maybe a recursive CTE of some sort, possibly with INTERSECT. Maybe. But I don’t know that much about SQL.

This may be too hard. It may be too much work for a SQL system. It certainly seems like the kind of thing Apache Spark would be good at, and so if it is too hard for SQL then I know something about architecture I didn’t know before, and now is a good time to decide these things.

Or maybe it’s just better to do this in memory, as long as one avoids re-loading and re-writing big blocks of stuff. But when I say it that way, I keep coming back to the suspicion that SQL database people have spent decades making their servers smart enough to avoid re-loading and re-writing big blocks of stuff. And so inevitably I wonder whether that work is already done, at least in part.

Or—and this is a real possibility—maybe there is an equivalent representation of the models and cases and scores that would make this easier for SQL but harder to explain. For example, most of my recent implementations of distributed GP systems don’t use this

         case_1    case_2     case_3       case_4
model_1     1.0       5.0       97.0          4.9
model_2     9.4      27.0        6.0          0.9
model_3   811.6       5.0       54.0          0.9

but rather use this

model  case  score
    1     1.0
    2     5.0
    1     9.4
    2     5.0
    4     0.9
...

That structure, with its foreign key index over case and model, makes my other code simpler, and certainly makes it faster to write to my databases. But how do I know? Maybe there’s some smart way to load that big vertical pile in and do some fancy temporary-column stuff to cobble together derived values that would be useful for a lexicase selection algorithm to use without re-wrapping the table.

I don’t know! And that’s interesting. So if you have any ideas, pass them along.

The most obvious case of a “modular” problem in this sense Lee means would be, for example, something like \(f(x)=(x \text{ if } x > 3; sin(x) \text{ if } x \leq 3)\), where observations in one part of the range \(x\) give you useful information about one “task”, while in the other region the other “task” holds sway. ↩