What does a program do? (part 1)

Draft of 2019.02.09

May include: GP ↘ &c.

My premise in this piece is that the paradigmatic system we imagine when we talk about a “computer program” involves quite a few assumptions and constraints. And, it follows: Hey maybe if we pay attention to those features and how we refer to them, some interesting things may happen.

Whether these assumptions and constraints arose from von Neumann’s original architectural decisions, when he sketched the architecture used by all modern computers, or they are reflections of deeper assumptions we make about the nature of computational algorithms and mathematical functions, or they’re lodged somewhere down deep in our collective tacit philosophy of modeling in general, I just don’t know.

Wherever the weirdness comes from, for several years now I’ve thought that asking “What does a program do?” might be a good project for some professional historian, philosopher or sociologist of science to look into. Because we seem to talk around most of the actual things programs “do” in most of our work as… you know, programmers.

For now, lacking such a smart historical and philosophical approach to the question, I’m gonna just crawl down in there and root around with a flashlight for a while.

An invocation

A few days ago, my wife Barbara and I spent some entertaining minutes watching this old AT&T video that explains—to a contemporary, early 1980s audience—what Unix is like and how it changed the way computer programming was being done and used at Bell Labs.

“UNIX: Making Computers Easier To Use”

(AT&T Archives film from 1982, Bell Laboratories)

If you watch it, you’ll come across a bit where Dennis Ritchie describes how he “writes” a spell-checker using command-line Unix utilities strung together. In the context of the 1980s, this was probably miraculous-seeming. Based on the comments on reddit where Barbara found the video, I’d say that many modern folks who “program” still think of this approach as remarkable.

One of the steps of Ritchie’s spell-check “program” is the invocation of a command-line script he calls makewords, which (I’m inferring from the video description here, and the fact that it’s not a standard Unix utility) consumes a string as its argument, and produces a new string in which the “words” are each on their own line.

That is, it replaces whitespace in the original string with newlines. Stick a pin in that description, because we’ll come back to it: “replacing space with newlines” has an interesting history in the field of GP research, where it’s one of Tom Helmuth’s benchmarks for program synthesis. Let me call it “RSWN” instead of makewords here, like Tom does in his work.

Anyway, and since this AT&T video is fresh in my mind, I am moved to write a little function or “program” or “utility” of my own that does the same thing: It replaces all the spaces with newlines in a given argument string. I can do it in Ruby most easily at the moment; maybe afterwards I will also do it in Clojure or Haskell or R or some other language, just to see what generalizes.

What does RSWN do?

First, recall that a lot of what was custom-written in the 1980s is probably built into any modern language, which will have all those handy libraries we take for granted. There are far more high-level string-handling functions available in Ruby than I expect Ritchie had in his C (and sh) setup.

So you won’t be surprised when I say that I can see a few different ways to make an RSWN program. Some are ridiculous and ornate and make me think of Carl Sagan talking about apple pie, while others are more simple and elegant.

One “elegant”—or perhaps “very parsimonious”, and maybe “glib”—approach in Ruby is simply to invoke the built-in methods that are intended to do just this sort of thing.

The String#gsub method replaces every occurrence of a matched regular expression with some other string, and so I can just “do that” and write a Ruby method that jams its argument into the appropriate built-in function and returns the result:

def replace_space_with_newline(arg)
  return arg.gsub(/\s+/,"\n")
end

It’s not much to look at. But this method does exactly what one should expect from RSWN, as long as one is comfortable reading regular expressions like “/\s+/”. That is, it greedily replaces every substring composed of one or more “whitespace” characters in the argument with a single newline character.

Let me exercise it a bit, just to see that it does what I expect.

Using my shell’s irb Ruby interpreter, we see below that when I invoke it, it replaces a big long series of eight characters of “whitespace chaff” (spaces, tabs, newlines) between the characters "a" and "b" in the argument with a single newline, so that " \t\t\n\t " becomes one "\n".

irb(main):001:0> def replace_space_with_newline(arg)
irb(main):002:1>   return arg.gsub(/\s+/,"\n")
irb(main):003:1> end
=> :replace_space_with_newline
irb(main):004:0> replace_space_with_newline("hi there how are you")
=> "hi\nthere\nhow\nare\nyou"
irb(main):005:0> replace_space_with_newline("a  \t\t\n\t  b\nc")
=> "a\nb\nc"

Well, that was easy.

But this version feels like I’m passing the buck. I haven’t “written a program” that replaces whitespace with newlines, I’ve invoked a “program” (instance method, but whatevs) that somebody else wrote—and which also does something far more general and complicated than I really wanted—to do this one task.

In almost all modern languages, this kind of library will be available for a huge swathe of tasks that used to be (and which are are honestly still) hard or complicated: opening ports, rendering images, reading from and writing to databases. Just calling this library function isn’t helping me understand much about what an RSWN program “does”, since it’s not actually doing anything beyond speaking as a middleman between me and the core libraries it invokes.

Show your work

Let me try again, and this time I will lean towards explicitness.

Here’s a new version of RSWN. This one explicitly iterates over all the characters in the string, without invoking the regular expression library, and as it goes it copies the characters over into an output buffer string. Along the way, it uses some extra-seeming conditional logic to detect “runs” of whitespace, which it “skips” (doesn’t copy):

def rswn2(arg)
  result = ""
  ws = false
  arg.each_char do |char|
    ws = [' ', "\t", "\n"].include?(char)
    result << (ws ? "\n" : char) unless (ws && result[-1] == "\n")
  end
  return result
end

test_string = "hi there \n\n\n\n   \t\t  boyo"
puts test_string.inspect
puts rswn2(test_string).inspect

Notice that (for once) in this essay I won’t be talking about how I came to obtain this code. Let me just present it.

Here I am exercising rswn2:

irb(main):001:0> def rswn2(arg)
irb(main):002:1>   result = ""
irb(main):003:1>   ws = false
irb(main):004:1>   arg.each_char do |char|
irb(main):005:2*     ws = [' ', "\t", "\n"].include?(char)
irb(main):006:2>     result << (ws ? "\n" : char) unless (ws && result[-1] == "\n")
irb(main):007:2>   end
irb(main):008:1>   return result
irb(main):009:1> end
=> :rswn2
irb(main):010:0> test_string = "hi there \n\n\n\n   \t\t  boyo"
=> "hi there \n\n\n\n   \t\t  boyo"
irb(main):011:0> puts test_string.inspect
"hi there \n\n\n\n   \t\t  boyo"
=> nil
irb(main):012:0> puts rswn2(test_string).inspect
"hi\nthere\nboyo"
=> nil

Now that looks like a “real computer program”, right? There’s a lot more punctuation than I used before, and also a good deal of stuff seems to, you know, be doing something. All kinds of gibberish like “<<” and “(ws ? "\n" : char)” all over the place.

And before somebody comes along and tells me how “bad” this code is: Yes, of course. I know. It’s not refactored, I didn’t test it, it’s just a blob of almost-obfuscatory complexity with loads of edge-cases that are mainly unhandled, and it does something that gsub already does well.

Recall that we’re here because (1) I want to think about what is happening inside an algorithm, and (2) I am going to end up talking about working code that no human being ever wrote. So relax, Strawman. Anyway…. Oh, right.

But this is important: Please also notice I have not reached a point where “all” of what this Ruby method “does” is contained within the bounds of the method’s source code. Disregarding the business of what a “string” is, and also ignoring all of the extensive Ruby libraries that handle the String class, and objects in general, even in this code I am incidentally doing stuff that you might otherwise miss:

constructing an Enumerator over the characters in the argument string;
iterating over those characters in a loop;
using the Enumerable#include? method to determine whether a character appears anywhere in the specified array of “space” things;
doing all this messy business: result << (ws ? "\n" : char) unless (ws && result[-1] == "\n")
1. is ws true?
2. is the last character of the partially-constructed result "\n"?
3. if both of those are true, then append either a newline or the character to the result, depending (again!) on whether ws is true

I’ve intentionally compressed a lot into that single blob of

result << (ws ? "\n" : char) unless (ws && result[-1] == "\n")

for rhetorical reasons. I could spend some time talking about how to break it down into more “traditional” conditional logic, but if I did we’d end up talking all day and making diagrams with straws. So let me draw a line in the sand and say this code “does” what I asked it to. In a useful sense it “does” it in exactly the same way my first example did. It “takes” a specified string argument and it “returns” a new string that differs from it in the intended way.

But realize that I could repeat this “explicitization” process to produce an even more convoluted rswn3, then rswn4, and so on. This code, in rswn2, is also doing a lot of other stuff using the libraries and core language syntax that it invokes.

There are implicit variables somehow being used by the Enumerator to know when we’ve reached the last character of the argument string. There are ephemeral calculations being done whenever we evaluate a complex predicate like “(ws && result[-1] == "\n")”, not just because of “&&” but because “result[-1]” involves some kind of examination of the result data structure. That convenient reference to result[-1] means that something (in the interpreter) somewhere is “aware of” or “discovers” the length of the collection called result, and in doing so it accesses that specific element of the collection at the end (which it has found along the way), when I ask.

If I kept down this path, I might eventually end up having to define core language concepts explicitly, right? How do we check if two characters are identical? How do we address an item in an array? How do we ensure that an iterator/enumerator progresses through all items in a collection but not past the end? At some point, I would be writing a good part of some new language interpreter in my code itself, and even farther down the road I might be invoking hardware-level opcodes.

It seems to me that one of the important things a program does is the part where it hides all that complexity from me. This isn’t surprising. It’s just worth remembering for a minute. In this little snippet of straightforward language like Ruby, a program that “does” something like this task is also “doing” a lot of other work along the way. I don’t only mean this in terms of traditional measures of computational complexity, either.

Lots of stuff is getting written to and read from memory, all over the place, in the course of executing my code. It’s cleaned up afterwards, all this “scratch pad” stuff, but it is work “done” when the code runs.

But there still one awkward question I need to ask, which we have ignored so far in all this talk about what happens in memory before a method is “done”: What does “return” do?

Give me that

As programmers we’ve become used to speaking of “messages being sent”. At least among the object-oriented crowd, we might even be imagining that an actual “packet” of message is moving from the “sender” to the “recipient”. When we know a bit about how interpreters and compilers work we realize that there are probably “addresses” involved, but is it happening with some kind of network socket/client thing? Or is it more like that old radio tower logo in the RKO Radio Pictures splash screen?

But I think we both know that “return” is not literally “sending” some kind of self-contained package through space or memory.

A returned value (whatever that is) is something that has been “set up” by the algorithm’s other dynamics by the time an explicit or implied return gets called. Whatever is done, it directs the attention of some listening (“calling”) process to a that specific data structure, located in a very particular location in memory, where the agreed-upon value has been built to specification. Some makewords or rswn has been told where to look for its argument string, and when it’s done working on the project it’s “for” it “returns” a new location where a “new” string exists.

This can seem a lot like a magic trick, really. Depending on how the function has been written, a lot happens along the way. Ephemeral structures are constructed, new “versions” and representations of necessary things like counters and comparators and enumerators are built and left sitting around (in some sense). But then it all “comes together” somehow to point to the result: this one perfect thing among the pile of other stuff.

Seriously. Envision a stage magician who is producing a rose from—apparently—nowhere. She is a living human being, and so she’s breathing and grinding up ATP and shuffling calcium and potassium around in her head and slowly burning sugar down into carbon dioxide, all that stuff we do. She’s digesting food and moving oxygen around, her circulatory and immune systems are perfusing her body with all kinds of complicated cells and compounds. She’s also developing a cold, and she has a splitting headache, and oh she’s standing on her mark on the stage and not falling over in a heap, she’s remembering (somehow) the right way to be moving her arms, and she’s wiggling her eyebrows and controlling hundreds of facial muscles because she’s got a mental model of the audience watching her that implies that if she smiles now she can quickly flip the thing in her arm or wherever and then move her arm like this and the rose that was over there a moment ago has simply moved a few centimeters but she is moving it so that now you can see it.

She presents the rose, and in that moment the audience looks only at the subset of the molecules in the room which she has indicated. And only then do they see the rose. They don’t see the bacteria in her gut, or the color on her fingernails, the calcium and magnesium her neurons and muscles have moved around in her tissues, or the shimmer of stage light on the black of her tuxedo coat. The magician and all she has “done” is at that moment almost entirely gone, and instead the rose has been returned by the multitude of other processes she embodies.

My Ruby script rswn2 does much the same kind of thing. It isn’t salient which Ruby release is being used, or what operating system it’s running on. It isn’t salient whether this is a 16- or 32- or 64-bit computer, or whether there is Unicode in the argument. It isn’t salient in the moment whether I have completely rewritten an entire Forth interpreter in the method I’ve written, and nobody would care if what was “really happening” was that a Forth program running in that embedded interpreter is actually doing “the work”, and that it in turn is running on a Ruby interpreter, which in turn is compiled C or Java or something, which in turn is running bytecode which manipulates native assembly language on a chip which I can’t even begin to understand, which itself has hard-coded optimization algorithms running….

Yet we are all quite comfortable saying that my Ruby method “returns the correct answer”. When it’s done, it has indicated an ontologically-sound data structure in memory that matches our agreed-upon expectations for “the return string”.

Code found and made

You know already that I’m here to talk about code that people didn’t write or plan, except for deciding what the appropriate “return” values are for a given “argument”. It’s pleasant and informative for us to look at code that people did write, whether they are Dennis Ritchie inventing the core of modern computing, or poor old me just slapping some Ruby together. It’s how we learn.

But my overarching goal in all this work is to think about how we can “write” a piece of “code” that (to use a handy example) replaces the spaces in a string with newlines, when we aren’t allowed to compose that code ourselves.

I’ve found that it’s an informative project to ask about all that other stuff that a program constructs along the way, when any code is run. What are the things that have been built in memory, that will soon be garbage-collected or deallocated? Because when we are “searching” for code, we are inevitably starting from some initial guesses and working “toward” our explicit goal (scare quotes intentional). Those first—and probably most later—guesses will not behave in anything like the desired way.

As I’ve written elsewhere, I think maybe we have adopted too many of the habits we use in our code-writing world, as we try to move over into a world of code-finding. In the rest of this piece, I want to explore some troubling variants of some basic and tacit Computer Science and mathematical definitions of “functions” and “algorithms”, which I think it will be worth noticing.

background (a targeted aside)

As you know, some people I work with seem to think that we can use algorithms inspired by biological evolution to “automatically” construct working software. The field is called Genetic Programming, but over the last decade or so I’ve been leaning forcefully away from the badly-drawn metaphors that name implies. I try to use “Generative Programming” for the same stuff. But we can call both of them—the “evolutionary” one that so many of my colleagues imagine they’re doing, and also the slightly different one that I think I’m doing—by the same name: “GP”.

All flavors of GP rely on the same underlying philosophical framework:

We make a bunch of “random” (but syntactically correct) code that does stuff.
We look at what all that code did, given certain training cases as “arguments”, and then compare the observed and expected behavior to calculate a “score” of some sort for each chunk of random code. Better scores mean, somehow, “closer to what we want in the end”.
We take some of the better-scoring chunks of code, and use their structures (and the structure of the bad ones, too, if we want) to randomly—but with a useful bias—produce new and hopefully better code chunks.
Everybody go to 2.

A lot of folks tell me that this sounds “just like” most other Machine Learning algorithms. But no. It rhymes in a lot of ways, but the key difference lives in that word “code”.

In almost all contemporary Machine Learning projects one is looking for a good set of parameters: Given some fixed-shape data structure—for example, a bunch of matrices or tensors of a given size and shape—in an ML project we try to “learn” the full set of numbers to fill into the “blanks” in those matrices. This is not a trivial process at all, but realize that in almost every case a human being is deciding beforehand on the “form” of the answer, and the process of learning is that of filling in the correct constants within that form.

Think here of building a deep neural network and training it. First a human person draws out a series of big rectangular matrix shapes and sets some hyperparameters a priori, and only then the ML algorithm is invoked. It makes some guesses for all of the constant values in that large equation, then it compares the function return values¹ generated by those guesses (in the context of the matrix form) against the training data, and then it makes adjustments of the constants—in a very smart way—in order to reduce the difference between the observed return values and the desired ones.

But in in most ML projects, the “thing itself” is not under consideration by the search algorithm. The parameters are; the structure into which the parameters “fit” is not.

One more familiar example, in more detail, for comparison.

Suppose the “input” to a neural network is a picture, and the desired “output” is a collection of English words describing objects that appear in that picture. A human designer has already established certain rules, and done certain mathematical transforms “upstream” of the network to produce a (large) set of numerical input values rather than something closer to a “picture file”. Those numbers (not the “picture file”) are fed into a series of matrix math and nonlinear functions, producing (for example) a vector of exactly 1000 numerical values, which represent the weight or “likelihood” or “confidence” of the network that each of a set of 1000 prospective vocabulary words “appears” in the “picture file”.

The “learning work” done by this ML project completely ignores all of that conversion of the weird binary picture files into suitable numbers, probably using a library² somewhere off-stage. It also ignores all the “ontological work” of producing a mutually agreed-upon “return value”. That is done downstream, when the vector of return value³ term weights is converted back into a collection of high-confidence “words”.

The crucial difference between “traditional” ML and GP lies in that separation of concerns.

In GP projects, we don’t do as much of that work ahead of time. We search over the set of frameworks themselves, usually using some higher-order representational language. But also at the same time we try to find the best constant parametric values to fill those structures in with.

Let me work out a (mostly imaginary) GP analogy to the image-tagging neural network project I outlined above.

Here’s the sort of thing a GP project might do:

The training cases would be actual image data structures, not pre-converted large matrices of numerical values.
The search itself would produce functions that in turn invoke a broad collection of standard image-processing library calls. Some of these would accept arguments that are images or numbers or colors or kernel-defining structures, and would produce some combination of those as their results. If we throw in a pantload of mathematical, string-handling, control structure and other abstract data structures, we can say that the GP search is looking for algorithms that are written in that (domain-specific) language we’ve built “upstream”.
When we execute random “guesses” in this domain-specific language, passing in the training images as “arguments”, we will (if we’re very smart) get a collection of words directly out the other end.⁴
Then, just as in ML, we can assign scores to our various random bits of code. The code that generates the most and best words gets the better score, and (in a smart way) we stochastically combine parts of the better-scoring code fragments to form new candidate code, which we evaluate again, and so on.

The distinction I’m trying to emphasize here is between (ML) looking for the right numbers, and (GP) looking for the right numbers and the round-trip functions handling input and output and transformations applied to them.

We are asking a GP process to develop (or “discover” or “construct”) code and data structures we are more likely to understand, as humans, by using programming concepts that are something more like those we might use ourselves if we wrote that same code. That’s different from ML projects, where we are asking the process to find the right values to use in a given structure.

But—and a lot of people say this—what makes us think we even want to “ask code” to “write code”? People write code all the time, and then they use it and read it and if there are problems or shortcomings then the folks who wrote it can fix those. Sometimes these people are smart enough to even invent new kinds of code. Why try to automate that?

If we can get this process to produce running code that does what we want, that’s awesome! But it seems unlikely, and it’s not ever going to be very efficient. But! Even if we wouldn’t find automatically-generated code to run, the process of looking can (might) give us some new ideas.

Because the “generative” (or “genetic”) process is not subject to our own habits and biases. It works blind. It works like an alien. It only ever writes somebody else’s code. And in the few places where we want to read somebody else’s code—for example, in the places where nobody has ever written any successful good code—that’s where GP has a use.

Even the code we write ourselves does a lot of stuff besides building the return value. We throw all that other stuff away afterwards, like the magician who ignores her headache, or the audience who ignores her nail polish, but it happens anyway. That’s the fundamental challenge that GP has, which is not present in ML. In an ML project, you know what to do upstream and downstream, and you also know all the transformations along the way. In GP, you are asking multiple overlapping “tasks” to simultaneously be completed, concurrently, in tandem, all over the place, with no plan: noticing patterns in data, discovering features, constructing general algorithms out of arbitrary components… all while handling edge cases and constraints.

How do you represent a “partial solution” to a problem that may not be numerical at all?

Think again about the RSWN problem. A string goes in, a string comes out. All kinds of other things probably have to happen in between: The features we’re concerned with need to be identified. The completeness of the processing needs to be established. Side-effects need to be avoided.

If I start with a “random” piece of code, and (assuming it runs at all) look at all that it does, how do I know a priori which bit is the one I should think of as “the return value”? Which parts are “scratch pad” that should be thrown away?

What is the figure, what is the ground?

Representations

The history of GP research is punctuated by a series of scholarly battles about (and maybe between) syntactic consistency and type safety. The foundational work in GP (from Koza) limited the earliest projects to strictly arithmetic S-expressions. These are not Turing complete, but they always halt, and they always return an “answer”.

There’s a mind-blowing innovation hidden in there. By focusing on “programs” that look like (+ (* x 9) (- x (* x 2))), and by focusing on “cut-and-swap” recombination operators that exchange syntactically valid sub-trees, Koza made a lot of headway towards more general “automated programming”.

I want to use these arithmetic expressions, and things very like them, for an example later. So let’s go over how they work.

The simplest expressions are just naked literals, I suppose. Something like “99” is a perfectly valid “expression” for a function of some variable x. We can interpret it as \(y=99\). Similarly, “x” means \(y=x\), also a perfectly good function.

Syntactically, we wrap anything more complicated in a pair of parentheses, and write it function-first. For example, to represent \(y=(x+9)\) we would write “(+ x 9)”. Similarly, multiplication and subtraction and division work the same way: “(* x 9)”, “(- x 9)” and “(/ x 9)”. This prefix form is just like that of the Lisp languages.

Any “naked” term (a constant or an input argument) is an expression in itself. Any of these parenthesized functions where both arguments are also expressions is also an expression. In other words, we can nest them, and build “trees” that way. We can build (+ (* x 9) (- x (* x 2))) by ensuring that every +, * and - has exactly two expressions as its arguments, and that every expression is itself complete (a naked term, or a complete sub-tree).

As long as those rules are obeyed, the resulting trees always represent valid arithmetic expressions. There’s something slightly scary in cases where we might end up dividing by zero, but those are runtime problems and let’s ignore them for now.

And as long as we respect those rules, we can fiddle around with them too. Here are some examples where I’ve taken a relatively simple expression in a Lisp-like prefix form like Koza used, and deleted-and-replaced an arbitrary but semantically complete expression with the constant 999. I’ve broken this into two stages, for clarity; first I replace a chunk with some bullets, and then I replace the bullets with 999.

;; inserting 999

(+ (* x 9) (- x (* x 2)))
(+ ••••••• (- x (* x 2)))
(+ 999     (- x (* x 2)))

(+ (* x 9)   (- x (* x 2)))
(+ (* x •)   (- x (* x 2)))
(+ (* x 999) (- x (* x 2)))

(+ (* x 9) (- x (* x 2)))
(+ (* x 9) •••••••••••••)
(+ (* x 9)           999)

(+ (* x 9) (- x (* x 2)))
(+ (* x 9) (- x •••••••))
(+ (* x 9) (- x     999))

... and so on

You can see that the resulting expressions, with 999 in them, are still perfectly good arithmetic “code”.

Koza’s innovation was that all of these functions are “related” to one another in a reasonable and interesting way. From this seed grew the powerful sub-field of GP called Symbolic Regression, which focuses on just this sort of mathematical equation for data modeling. It’s an extremely powerful technique. But it’s not “computer programming” in the sense people had already come to expect by the 1980s.

And so, very early on as GP grew as a field, new representation schemes were designed. Of course folks threw in all the necessities one imagines one needs to write “real” code: control structures, looping and recursion, subroutines, that standard core library of types, sometimes even higher-order functions. The dream has been, of course, to “evolve” code that doesn’t just return numbers. Modern GP representation schemes like Cartesian GP and Push have extended the reach of GP’s guess-and-refine metaheuristics to produce code that’s realistic-feeling in quite a few different ways.

But these “modern” schemes that contain all those extra bits are now Turing-complete languages. We are working now in a world that bumps against the Halting Problem.

A question is begged: What good is a program that doesn’t halt?

Well. It depends on what else it does, doesn’t it?

Back to our roots: arithmetic

Let me work a bit more with that Lisp-like arithmetic scheme, despite the fact that most modern GP representation schemes are structurally quite different.

It’s a tree of expressions. Naked “leaf” expressions like variables and numerical constants, and nested “function” expressions like addition, subtraction, multiplication, division, sine, cosine, square root, all that junk you are used to in basic mathematics.

In this representation scheme, we can express all kinds of lovely ornate functional equations. For instance, I can write

(- (+ x 2) (* (* x 2) (+ 3 x)))

which “means” this:

\[\begin{align*} f_1(x) &= (x+2)-2x\times (x+3) \\ &= -2x^2 - 5x + 2 \end{align*}\]

Or I can write

(* (sin x) (/ x (+ 9 (cos x))))

which “means” this:

\[f_2(x) = \frac{x \times sin(x)}{9 + cos(x)}\]

We’re comfortable with this, right?

I’ve shown what they “mean” in mathematical notation, but how do we evaluate these expressions? There is the way we do it “as people”, and the way we might write an algorithm to do it.

Let me work out (- (+ x 2) (* (* x 2) (+ 3 x))). Suppose we have already assigned x=99.

(- (+ x 2)  (* (* x 2)  (+ 3 x)))
(- (+ 99 2) (* (* x 2)  (+ 3 x)))
(- 101      (* (* x 2)  (+ 3 x)))
(- 101      (* (* 99 2) (+ 3 x)))
(- 101      (* 198      (+ 3 x)))
(- 101      (* 198      (+ 3 99)))
(- 101      (* 198      102))
(- 101      20196)
-20095

In each line, I’ve walked down into the argument sub-expressions of each expression, trying to find a numerical value to bring up as an argument for the higher-level ones. So for example, I can’t evaluate the initial minus, because I don’t actually know its arguments yet. So I walk down into the first argument, (+ x 2), and start working on that. I don’t know its numerical values yet either, since I have x and I want to know the number, so I replace x with its bound value, 99. Now, I have a function and both its arguments in hand, and so I can add them to get 101. But I still don’t have both arguments of the “root” minus, so I start the same process of traversing the other argument branch.

And so on.

Another useful way of thinking of this process might be as a collection of values and “unfinished business” on pushdown stacks. I find 99, then I add 2, and so on.

Let me walk through the same expression again, pushing the items I encounter in my depth-first traversal of the tree onto a stack. (I’m using the left end of the stack as its “top” here). In each step, if the expression is a “function” I move the function over to the stack and “unwrap” its arguments. If it’s a naked item, I move that over to the stack as-is. I’ll count x as a “function” of no arguments, that returns the bound value.

[(- (+ x 2) (* (* x 2) (+ 3 x)))] :: []
[(+ x 2) (* (* x 2) (+ 3 x))]     :: [-]
[x 2 (* (* x 2) (+ 3 x))]         :: [+ -]
[2 (* (* x 2) (+ 3 x))]           :: [x + -]
[(* (* x 2) (+ 3 x))]             :: [2 x + -]
[(* x 2) (+ 3 x)]                 :: [* 2 x + -]
[x 2 (+ 3 x)]                     :: [* * 2 x + -]
[2 (+ 3 x)]                       :: [x * * 2 x + -]
[(+ 3 x)]                         :: [2 x * * 2 x + -]
[3 x]                             :: [+ 2 x * * 2 x + -]
[x]                               :: [3 + 2 x * * 2 x + -]
[]                                :: [x 3 + 2 x * * 2 x + -]

At this stage, I’ve moved a collection of tokens from the original tree over into this new stack on the right, one at a time. Along the way, I threw away all those parentheses. Now let me “unroll” that right-hand stack, by pushing the literal values onto the first stack, pushing the value of x when I have an x, and immediately applying the functions I encounter to the top two items on that stack as I go:

;; this is where I finished "unrolling" above
[]           :: [x 3 + 2 x * * 2 x + -]
[99]         :: [3 + 2 x * * 2 x + -]
[3 99]       :: [+ 2 x * * 2 x + -]
[102]        :: [2 x * * 2 x + -]
[2 102]      :: [x * * 2 x + -]
[99 2 102]   :: [* * 2 x + -]
[198 102]    :: [* 2 x + -]
[20196]      :: [2 x + -]
[2 20196]    :: [x + -]
[99 2 20196] :: [+ -]
[101 20196]  :: [-]
[-20095]     :: []

The single item on the stack at the end of this arduous process is “the answer”. It’s not as tidy-feeling to do it this way, compared to the “human reader” way, but it does work.

That last phase, where I used the “top item from the right-hand stack”, is also the way many modern GP representation schemes—like Push—work. Indeed, what I’ve done here is something like a “translation” of an S-expression tree into an equivalent Push stack-based expression. The Push-like “program” “[x 3 + 2 x * * 2 x + -]” is the “same thing” as the nested S-expression “(- (+ x 2) (* (* x 2) (+ 3 x)))”.

Now let me add one more function to this simple language. This new one is perfectly well-defined in mathematics, but it might also push us out over a computational cliff.

I’ll call this new function “recurse”, although if I was a Computer Scientist I might be more tempted to call it “Y”.

This recurse function takes a single argument, which (because we’re still syntactically pure here) will obviously be an expression. And as I’ve already established above, every expression in this language can be interpreted as a function of x, even if it doesn’t contain the x token. But unlike the other functions we’ve used so far, the recurse function returns an expression rather than a completely-evaluated numerical result.

If the argument expression does not depend (mathematically) on x—that is, it does not include in any subtree an explicit reference to x—then recurse returns that argument unchanged. So (recurse 9) is 9, and (recurse (+ 9 (* 2 -11))) is (+ 9 (* 2 -11)).

But suppose an x does appear in the argument expression. The result is a substitution of the entire (recurse expr) into every position where an x appears within the argument. That is, it’s the result of “applying” the function represented by the expression to itself.

For example, (recurse (+ x 2)) returns the expression (+ (recurse (+ x 2)) 2). We’ve replaced one expression with a slightly different one (which is why this type of function is called a “combinator” more formally). The next time we reach a point where we’re evaluating another (possibly new) recurse function in this result, we do it all again, and so on. Eventually—depending on how we are actually doing the evaluation of an expression, an expression might “unroll” like this:

(recurse (+ x 2))
(+ (recurse (+ x 2)) 2)
(+ (+ (recurse (+ x 2)) 2) 2)
(+ (+ (+ (recurse (+ x 2)) 2) 2) 2)
(+ (+ (+ (+ (recurse (+ x 2)) 2) 2) 2) 2)
...

You can glance at this (or maybe squint) and realize that eventually in the limit it will become \(y=x+2+2+2+\dots=x+\infty\).

Notice that I said “every position” where an x appears. So (recurse (/ x (- x 9))) goes like this:

(recurse (/ x (- x 9)))
(/ (recurse (/ x (- x 9))) (- (recurse (/ x (- x 9))) 9))
(/ (/ (recurse (/ x (- x 9))) (- (recurse (/ x (- x 9))) 9))) (- (/ (recurse (/ x (- x 9))) (- (recurse (/ x (- x 9))) 9))) 9))
...

Those are perfectly good mathematical expressions, even in the second case. Here, look at how they start to unfold and see if you can work out what happens over time:

\[\begin{align*} f_0(x) &= x \\[1ex] f_1(x) &= \frac{x}{x - 9} \\[1ex] f_2(x) &= \frac{\frac{x}{x - 9}}{\frac{x}{x - 9} - 9} \\[1ex] f_3(x) &= \frac{\frac{\frac{x}{x - 9}}{\frac{x}{x - 9} - 9}}{\frac{\frac{x}{x - 9}}{\frac{x}{x - 9} - 9} - 9} \\[0.5ex] \dots \end{align*}\]

Because I was curious, I asked Wolfram Alpha about this second one, and it suggests that the recurrence relation here solves to something like this:

\[f_n = \frac{90 x}{(-9)^n (x-10) + 9 x}\]

I’ll leave it to you to decide what happens, mathematically, as \(n \to \infty\).

Because hey, hey, hey! don’t get side-tracked! Recall that these aren’t supposed to really be mathematical expressions. This is a “program”. These are supposedly representations of algorithms we’re developing, and when I say “recurse returns” something, I mean it returns it right then and there, not as some kind of algebraic limit.

So. How do we evaluate such an expression as code?

You have the details of how any individual evaluation step unfolds: for naked symbols, for functional expressions, and even for the recurse function. In any step of execution of a program like “(recurse (+ x 2))”, you know what to do.

But what is the “return value” of evaluating (recurse (+ x 2))? I mean, we both know that after a loooong time it result in something like POSITIVE_INFINITY, but as I have already said the immediate result is simply a new expression that’s only slightly more complicated than the one we execute. The process of “running” this code won’t even ever “build” a infinite value in memory. What it’s doing as we execute it is building a ridiculously large tree in memory, and never getting to the meat of the arithmetic it contains. In terms of stepwise evaluation, we’re not even really asking our program to count to \(\infty\) by twos, though that’s what we can interpret this code to “mean”.

What it “means” in the context of stepwise execution—standard computational serial execution—is more like “you will never even see the first argument of this arithmetic, you know….”

This may feel like a trap or a trick. It’s not. This has all been talked about since before there even were physical computing machines.

Almost all of the general-purpose modern GP representation schemes include code for iterating over arbitrary ranges, and for recursion, and for all those other control structures you’re used to messing up when you write your own code. Most are Turing-complete. Arbitrary syntactically-correct code written in these languages might terminate and return a value, or it might be like the arithmetic code I’m describing, with recurse in it.

But recursion isn’t even necessary. The practical difference between a literal infinity and a Very Large Number isn’t salient. Recall that GP executes random code, with random constants in it. You can walk to almost-infinity instead of recursing your way there, if you like, for example by building a loop that starts counting from \(-889.3\) to \(971^{73}\) by increments of \(\frac{\pi}{1000}\).

In human-written languages, that would just be some kind of for...next...by loop with some poorly-chosen numbers filled in. You might be smart enough, if you were the author, not to let those particular numbers show up in that particular context, but this is the sort of thing that happens all the time when we build and run random code.

What will you do? A fragment of code that’s counting to a Very Large Number in Very Small Increments is indistinguishable from one that’s recursing infinitely.

Maybe we can handle this weird recurse in some kind of “special way” when it pops up?

I might notice that the recurse function has a limited “scope”. I can keep using the standard arithmetic evaluation rules for all the other arithmetic, but decide to do something qualitatively different whenever a recurse is handled. For instance, if I “notice” something that looks like (recurse (+ x 2)) in my code, I can “substitute” a value of POSITIVE_INFINITY (or whatever I’d prefer that means the same thing).

That feels much more complicated than my tree-evaluation sketches up above, doesn’t it? Look at the first few steps here, as I stepwise evaluate this familiar example code:

[(- (recurse (+ x 2)) (* (* x 2) (+ 3 x)))] :: []
[(recurse (+ x 2)) (* (* x 2) (+ 3 x))] :: [-]  <<--- WEIRD THING ALERT
[(* (* x 2) (+ 3 x))] :: [POSITIVE_INFINITY -]
...

That feels wrong somehow. Inelegant.

For most of the steps, I can look at the first token being processed and in simple steps move it over to the other side. That conversion of a big sub-expression into a limit result feels like some kind of jarring ad hoc thing that I wouldn’t be able to sustain, one that involves a lot of complicated pattern-recognition infrastructure I don’t feel up to writing by myself. Not for all the possible edge cases, surely.

An alternative approach would be to just plow ahead, stepwise, and execute the recurse using the handy rule I’ve already defined: it returns an expression. That would be more like the first evaluation process walk-through I did, the one I described as more “human-like”:

(- (recurse (+ x 2))  (* (* x 2)  (+ 3 x)))
(- (+ (recurse (+ x 2)) 2)  (* (* x 2)  (+ 3 x)))
(- (+ (+ (recurse (+ x 2)) 2) 2)  (* (* x 2)  (+ 3 x)))
(- (+ (+ (+ (recurse (+ x 2)) 2) 2) 2)  (* (* x 2)  (+ 3 x)))
(- (+ (+ (+ (+ (recurse (+ x 2)) 2) 2) 2) 2)  (* (* x 2)  (+ 3 x)))
...

That’s even less satisfying. I’ll never even get to the first x substitution this way. I’m almost immediately stuck in an infinite loop.

But that’s how many of our modern GP representation schemes handle this sort of thing. In Push, for example, this is very close to the way recursion and iteration work. A single step of rewriting occurs in every execution step, and as a result a new iteration/recursion instruction is produced. The loop unrolls a step at a time, so we can keep the “interpreter” running, but it will keep “running” this incredibly boring code forever unless we intervene.

You can start to see why people in the early days were so urgently concerned with non-terminating programs. So concerned that they started working with them and trying to prove things about them before there were even real digital computers in the world.

But here’s my point: Something’s not right about our complaint. Sure, this recurse example feels like we have a problem, and coupled with the sentiments aroused by the Halting Problem we tend to back away from this kind of dynamical black hole very quickly.

And yet, I will be furious if my laptop’s OS “halts”.

My actual real computer, and most of the processes I run on it, are not anything like RSWN or an arithmetic-evaluator. They’re more like this one:

That animation doesn’t ever “end”. It’s designed to unfold a dynamical process, step by step, forever. Until you kill it, by closing the browser window.

Nobody worries about that halting. It’s not even supposed to.

What kind of a hypocritical monster am I, to spend so much time worrying about philosophical matters of arithmetic that never “halts”, but then turn around and intentionally write this horror that never ends?

And next time:

We are not done here.

What is “done”, anyway? I’ll be “done” when I’m dead, and maybe not even then.

We’ve gone about half way so far, I think. We’ve talked about computer programs, and GP, and we looked at a simple simple simple arithmetic representation scheme, one that provided a lot of very useful and familiar dynamics, but one which is also not very good at the Complete Turing Things.

Turing didn’t say things had to halt, you know. But that’s exactly what a lot of people I know seem to imagine: that programs must end, that they must return values, that the return of a value is a privileged event (compared to all the other preceding and apparently incidental events), that there is a beginning and a middle and an end, and that all of those are done in their turn and not at other times.

This is, just looking down at this computer on which an OS has been running non-stop for 13 days, thinking for a moment about the two years I’ve been poking at this essay (now a pair at least), thinking about the magician and her nervous system and her mental model of the audience and her endless sessions of practice, by which mastery was achieved…. wrong.

Things of the world don’t have clean “starts”, or “middles”, or “ends”. Their boundaries are poorly defined; they leak in all sorts of ontologically interesting ways. The pragmatics of computing is insufficient.

And there is, I think, a remarkably useful place to stand. There may even, eventually, come to be a lever long enough to do the work.

Here is what might be in the next part of this piece. This is my notes. This is a map of where I’m dragging us all:

what about cellular automata?
there is no “final state”, just an endless cyclic attractor somewhere far away
there is a Y-ness to it, to the whole thing
if we were writing a CA-based “function”, we’d design it so the “final state” is the value we “want”
cite Melanie & Jim’s work on CA classifiers
but they were looking, not designing; “looking” in the way we do GP
and so they thought they had to set a deadline
they halted the CA “program” at N steps, and said that last state was the “return value”
it was a good idea, though it was a forcing of an arbitrarily big and flexible CA into a traditionally algorithm-shaped lock
they could hope to find a program that was “mostly” done by N steps, and then could check if it converged; that would’ve been good
at best, they would’ve found a program that had already found a solution early, then cycled on just that solution
at worst, they might’ve found a program that was “right” only some of the time in its final cycle; or one that was never all the way “right”
but they set limits, and walked back to “return value”
could we do the same with our recursive arithmetic? sure!
we could do depth-first evaluation and just stumble into a loop when it occurs; as a result we might not terminate
what if we set a deadline?
if we’re in a loop and we are getting closer to a target, we might notice; if we’re already there, even better; if our state is flickering…well, not do good
note that many algorithms, even arithmetic, oscillate wildly as they recurse
but also
we don’t know where in a “tree” our best guess will occur
all the numbers ever seen, maybe even the first one, could be better models
this was [somebody’s] insight for speeding up Symbolic Regression: cache all intermediate chunks of tree; there was at least one paper at GPTP that did this, and I recall several over the years
instead of an arithmetic program being “just the final answer”, it can be all the answers along the way
now (and): systems and code
Dave Ackley’s rant video
“systems thinking” vs “computational thinking”
what is a thing that just works
the problem of getting Avida to do anything useful….
and also! back to machine learning:
recurrent neural networks
do you want a tool? or do you want a conversation?
to think of an algorithm as something that gives you the answer is one approach; to think of a computational process as one with which you are having a conversation is a very different one
an evolved RSWN
an evolved FizzBuzz or two
what does a program mean, if you didn’t write it (if nobody wrote it)?
the problem of pragmatics
pareidolia and co-created meaning

Next time we’ll see what happens. O what will come of it all, when it ends?

Yeah, this is in boldface type for a reason. Can you guess the reason? ↩
Yup, this is in bold on purpose. ↩
And this too. ↩
Or not. I’d say that the vast majority of research work done in representation schemes for GP over the last several decades has focused on ways to force “random code” to also “use the arguments and return the right type”. Hell, this very essay is about that, in an roundabout way. ↩