Among the Big Scientists

Draft of 2016.08.26

May include: academiacultureGP&c.

In the couple of hours we had free this morning before other obligations arose, my wife and I walked along the Black Pond Woods nature trail, so she could finish the AADL Summer Game. As we walked along the path in search of the final code—following what I have to say are the sketchiest and most uninformative directions for finding codes yet—I was reminded of the workshop I was fortunate enough to attend this week.

Explaining that connection will take about 3700 words, I guess.

Workshopping it

Some months ago, I met Kai Staats when several colleagues suggested we invite him to our invitation-only annual genetic programming workshop in Ann Arbor. Kai is a great guy, and super enthusiastic about fostering collaborations and using machine learning approaches to address the wealth of data that’s accumulating in certain Big Science projects. He had just been working on a project applying machine learning to radio astronomy data, and his results drew the attention of folks working in various Astrophysics and Astronomy projects you’ve probably heard about.

(Because you’re a nerd. I know you’re a nerd because you’re reading this.)

So at about the time Kai arrived in Ann Arbor for our GPTP workshop, he heard from the Center for Cosmology and AstroParticle Physics that they would also like to run a workshop. In this case, it was just the sort of working workshop I’ve always touted among the Computer Scientists when I am around them—and which they have inevitably resisted—but this one was born ab initio as an explicit practicum.

Kai invited all of us (at the GPTP workshop) to join them in Columbus for an exploration of the ways in which GP and the broader toolkit of machine learning might help with the Data Panic that’s soon to envelop Astrophysics. There is, after all, an acknowledged surplus of data in High Energy Physics already; all the big colliders have been producing streams of data for decades, and the degree of instrumentation and sensitivity of sensors has only been improving. The same applies to the newer Big Science projects to measure gravity waves, detect exotic high-energy neutrino emissions, and collect increasingly dense deep-sky imagery: There’s a shit-ton of stuff out there, and there are plenty of sensors down here filling up way too many hard drives….

Bill Worzel, Erik Hemberg and I managed to find time to attend—although Worzel had to cancel due to other obligations, and I was forced to leave early because of a miscommunication (more on that another day). Luckily, Una-May O’Reilly was able to provide an excellent project overview of work her team has been doing over the last several years at ALFA, remotely joining us through a remarkably performative Skype connection.1

On the Enthusiasms of Early Successes

Genetic Programming is often mistaken for a Machine Learning technique. I’ve given up trying to make the distinction every time anybody falls into the error around me, because (1) a lot of people around me talk about Genetic Programming a lot, and (2) a lot of those people have only ever seen or considered GP the way it was done 20 years ago. Back in the day, GP was essentially a fancy optimization methodology, and especially in the world of symbolic regression problems.

And also: (3) maybe GP is a sort of Machine Learning. Because nobody else thinks otherwise. So caveat lector, folks. As always. As ever. As you no doubt do whenever you read anything… right?

Be that rare smart nerd who really questions everything.

To review (and utterly ignore the awful Wikipedia page I’m almost certain will be there if I peek): Symbolic regression is the application of GP (and other tools) for the simultaneous search for numerical models and the numerical parameters of those models. Traditional statistical modeling depends on Right Thinking and First Principles and Good Old Occam to design the structure of a model to fit one’s data, and then the application of numerical methods to fit the model to the data. For example, if it has been decided that the model to explore is of the form \(y=\beta_0 + \beta_1 x + \beta_2 x^2 + \ldots + \beta_n x^n\), then you “simply” need to find the “best” values of \(\beta_0\) and \(\beta_1\) and so on. Now a professional statistician will point out that then—after the numerical fitting is complete—you should immediately examine the resulting model and determine whether your choice was reasonable after the fact.

But for the most part, enthusiastic scientists and engineers don’t really want to do all that extra work, and are willing to just report best-fitting constants and p-values and call it a day.

In Symbolic Regression, on the other hand, what we do is search simultaneously for the structure of the models and the constants with which they are applied. By any of innumerable techniques (which I have discussed elsewhere and which really aren’t salient here), many structurally distinct models are produced, and for each of those many numerical assignments of coefficients are examined and rated. From that potentially large collection—that combinatorially unbounded search space, actually—“best” models are discovered, which possess desirable structural and numerical traits.

It’s miraculous, to be frank. It’s amazing what you can find, if you look everywhere….

If it works the first time, you did something wrong

But Genetic Programming is not just Symbolic Regression, and—except in certain very limiting circumstances—neither one is Machine Learning. The toolkits are misleadingly similar, and the technical details and terms of art overlap extensively… but Symbolic Regression and Genetic Programming more broadly are for something else.

In the same way I don’t really care what Wikipedia says about Symbolic Regression, I don’t really care what it says about Machine Learning. I’m talking here about how it is used, and how its users perceive and realize value in its use. Symbolic Regression can be used as a sort of Machine Learning tool in the same way you can press the gas pedal of your car with an umbrella if your foot is in a cast. The umbrella isn’t for driving, but it can be useful, lacking a more appropriate tool.

Machine Learning is a collection of algorithmic tools for reducing the dimensionality of datasets to the level of interpretable models. You don’t “do an ML project” (unless you’re in the business of writing new algorithms), you use Machine Learning to solve problems. If you want to build recognizers, perhaps you use neural networks or decision trees. If you want to build controllers, then you use neural nets or SVMs. If you want to classify new data without having to do basic science every damned time a new measurement arrives, then you use a classifier. If you want to make an automated car follow a road, if you want to plan a route on a map, if you want to maximize profit in a portfolio, if you want… and so forth.

Symbolic Regression is, in many ways, almost a machine learning technique. Especially in the early heady days around 2000, the emphasis was totally on the “automatic” part of the algorithm. It was presented and understood as if it were the miraculous black box of the Bat Computer: feed in the alphabet soup noodles (easy to digest), and out will come the answer. And there is no doubt that for many datasets, Symbolic Regression is able to find a structural model and a set of coefficients that together can do an amazing job of fitting the data.

Those folks who have Just Enough Cynical Statistics Experience will probably be muttering something like: the inevitable result of such a process is overfitting. If they’re really fancy, they’ll probably think I’m about to suggest some sort of stringent cross-validation scheme.

And they would be wrong, more or less at the point where they opened their mouths to mutter.

Recall that Symbolic Regression does not simply fit a single model, but rather explores the entire space of models. If applied over the set of arithmetic operations, it has the capacity to explore all the arithmetic expressions. If applied over the set of trigonometric relations and polynomials and matrix algebra, it has the capacity to explore all the models one could build out of numbers and trigonometry and imaginary numbers and matrices. And so on.

It’s not “overfitting” if you’re not just fitting coefficients. It’s exploration.

That’s the wonder of Symbolic Regression, when you first see the results: If you’re seeing the results unfold as search is underway, you can watch a linear model get close, then a polynomial appear that can get closer, then some more complex model that’s even closer. You can watch “discovery” happening, not just numerical optimization.

That’s why I draw a strong line between Machine Learning and Symbolic Regression, and even a line between Symbolic Regression and GP more broadly: Symbolic Regression is simply the accelerated exploration of alternative models, with a bit of “polishing” thrown in to get some of the coefficients to fit. The result is not a fitted model, and you shouldn’t expect it to be. The “result” is the extension of your ability to consider alternative possibilities quickly, and without the burden of your habits of thought.

GP is there to help you see the statue inside the marble. If you’re good at it, and mindful enough, it will let you see a dozen or a hundred alternative statues in the same block of marble.

But you should immediately start to worry if you ever find yourself using GP to carve the marble for you. You use GP to see, to show you models that apply in some sense to your data. You use it to surprise you, to inform you, and (most productively) to resist your plans.

Aside: “I should’ve seen that”

There were quite a few excellent talks at the workshop I mentioned. Excellent domain knowledge and new science, and also excellent work with ML applications—and (at least while was there) one outstanding GP talk by Sam Stafford. Sam wrote his own GP system, using a very conservative linear combination of multiplicative terms approach, and applied it to the problem of identifying noise and interesting signals in a large dataset they’ve been working with at CCAPP that involves radio emissions detected by one of several Antarctic balloon-borne sensor arrays, emitted from cosmic rays. I think.2

It was a nice project because (1) as I said, he wrote his own GP system, which helps people understand the approach far more subtly, (2) he was parsimonious with the features, (3) he had a particular project in mind, and most especially (4) the thing spit out an obvious answer he had totally missed. I won’t go into the details (though I hope to be able to link to it some time soon), but in broad terms Sam provided a toolkit of operators, and the GP system produced a classifier that did what he had asked. When he interrogated that classifier, though, it basically said “noise usually doesn’t have a peak on this time-scale” (as I recall).

Of course noise doesn’t usually have a peak. He knew that… once he saw the result and thought about it. That’s why I love the result so damned much. That’s what GP is for.

When you use it correctly, that use provides a mechanism by which the world can remind you of something you’ve been missing.

What could go wrong?

When you use a Machine Learning algorithm to perform an analysis, you don’t want to know what’s going on inside its little stupid head. Except, perhaps, for debugging purposes; if it does something dumb, you want to know where to whack it. When you undertake a Genetic Programming project, you’re not just “looking into” the data, but inevitably interrogating your own relationship with the data, the algorithm, and your own plans.

This is a far more agentic notion than most people are comfortable with, especially when they have only ever encountered or internalized GP “as” a Machine Learning tool. This is, as I’ve said elsewhere, a lovely example of Pickering’s Mangle of Practice in action. You undertake a project with some model of the world in mind, and as soon as you begin to implement that model (in a tool, in a program, in an equation, in a machine, in a work of art) the thing you’ve built resists you. The thing you’ve built is—if you’re willing to grant, even for rhetorical and self-serving reasons, some agency to nonliving things out there in the world—telling you something about what the world thinks of your mental model.

If you’re using a mere tool and it resists you, you pitch the tool and grab the right one. That’s Machine Learning: You don’t want your decision tree to “tell you” there’s something wrong with your notion of how the data are labeled. It just didn’t work.

But when you use GP to explore your data, you’re inviting it to tell you far more, and more subtly, than any “mere tool” would be expected to do. You are embarking on a project. When you start watching the GP system “consider your data”, you need to realize you’re part of the system under test. If as it works you see it considering increasingly complex models, and see the associated errors gradually reducing over time, and eventually it gives you an “answer” you like very much, well then you might not realize you’re part of that system. But when (as it almost certainly will do, for any interesting dataset) it settles for something trivial and easy and not at all what you’re interested in, it is not “failing to work”.

It is arguing against the completeness of your model of the world.

Might it be that your understanding of the GP code you’re running is incomplete or perhaps wrong? Might it be that the data you’re using actually is best described by simpler models than you expected? Might it be that the parameters you’ve used to set up and run the GP system are driving the dynamics towards this particular attractor, and that changing those parameters might move you towards a new and more interesting region of the vast solution space?

OK then. Have you run the automated unit and acceptance tests on the GP system and made sure it’s actually implemented correctly? Have you looked at the data and decided there might be some features that might be more informative, some details you might have missed? Have you examined the detailed dynamics of the search itself, and tried to understand what might want to be tuned to get to a different “better” result? Or is there, maybe, something fundamentally strange about the world, that your explorations have uncovered? Like, for example, “noise doesn’t look like signal because signals have peaks.”

Those are accommodations, to use Pickering’s term. You had a plan (“This will find the model!”), and when the thing you’ve built and the project you’ve undertaken doesn’t unfold according to that plan—and adds no extra information—it’s resisting. Some of the paths available for you to accommodate its resistance are right there in the sentence I used. Change one of the words in, “This will find the model!” and see what options present themselves: Really “this”? What is “find” exactly? Do you really mean “the” model, or could it be some models?

Another impressive OSU scientist I met this week was Brian Clark. Brian strikes me as a skilled programmer, but also an able philosopher-in-practice… because he was already willing and able to ask me several times what to do when GP doesn’t work.

That’s an excellent question, and it’s framed exactly right. The quick answer I gave him was off-the-cuff, but in hindsight I approve it still. I said, “You do basic science on the thing you think you’re using to do basic science.”

Sadly, I didn’t have the time while I was in Columbus to sit and explain to him everything that’s entailed, but this brief essay is at least an introduction to that much larger answer.

Which is: You do what you do every day… because you’re already a scientist.

This is especially easy to explain to Brian (I hope) because he mentioned in passing that he was soldering a lot of equipment in the lab for the big experiments they’re working on.

Somebody who solders. The mind boggles. More GP theorists should learn to solder.3 Hell, I wish I knew how to solder.

What do you do when you have a device, and you’re building it right there on your lab bench, and it resists you? It isn’t fitting, it isn’t making the right noise, it doesn’t smell right, it isn’t lining up, it isn’t going into the slot… it’s hovering unexpectedly five inches in the air above the bench… &c &c &c. You don’t say “it didn’t work” and walk away—at least not if you think of it as “your project”. You listen to it, and you consider how to accommodate this resistance.

If you’re patient (and human) enough, you already know that “not working” or even “being odd” isn’t a personal affront to your supreme intelligence and the validity of your original plan for the project. It’s not telling you you’re stupid, it’s telling you something about itself, or your other tools, or the tasks you’re asking it to do.

Maybe, if you’re very patient and listen very closely, it’s telling you something new about the universe that didn’t appear in your original plan or worldview.

In other words: you undertake a GP project in order that the system you’re building can surprise you.

I’ll write more specifically about what I mean, in the context of the CHEAPR workshop itself. I promised folks a presentation if we could find time to run one, but there was no time while I was there. So I’ll pass along what I would have said, in a follow-on to this introduction.

But I started off by talking about walking in the woods, didn’t I?

Nothing like you’ve seen on TV

Nobody in “my line of work” (whatever that is, though I’m told it might well be “Annoying Pissant” lately) would argue with the premise that Machine Learning is a powerful toolkit.

What I’m doing here is making that distinction firmer. Rigid, even. I want you to think of Machine Learning tools (neural networks, decision trees, nearest-neighbors models, clustering, reinforcement learning, all that jazz) as being exactly and no more than tools. Some of them are new, and they’re still being “invented” and “explored” and “polished” and “specialized” by the folks who work in that field. But the premise, the stance if you will, is that Machine Learning provides tools for you to use for particular tasks, and that eventually you will be able to pick up and use one as easily as you pick up and use a screwdriver, or a power drill, or a band saw. Some are big, some are little, some are general, some are specifically tailored to a tiny subset of projects.

You sure as hell don’t want to question your model of reality every time you pick one up.

And on the other side of that firmed-up line I’m placing Genetic Programming. Somewhere very close to the line sits Symbolic Regression, a subset of GP, but something you can sometimes get away with calling a “tool”. Is your computer a “tool”? I suppose that depends on how you use it doesn’t it? But is your car a “tool”? How about a library? How about a University? How about an ecosystem?

We walked in the Black Pond Nature Area this morning because there were some codes there for the Ann Arbor Summer Game. For those of you from outside our island of abnormality, the Ann Arbor District Library runs an amazing game for patrons all summer every year, in which certain “codes” are hidden in the library branches, in the various parks and publicly-accessible streets around town, written on the tops of certain manhole covers, visible only from the top of a certain parking structure, hidden in the number of posts on a bridge at the north end of a park with an entrance at its south end, and so on. You redeem these points for (virtual) badges, and if you get a lot you can also redeem them for some trivial NPR supporter-grade things like picnic blankets and tote bags and knitted winter caps.

So the last code Barbara hadn’t collected yet this year involved the task of counting the posts on a boardwalk across a hard-to-reach lake in a city park. This morning we trudged into the park.

In the moment, I have to say it was a super fucking annoying trudge. It’s Summer as I write this, and it was a dewy morn (but not the poetical kind, just the soaking damp kind), we hadn’t worn long pants, and the first leg of the hike was through a meadow with many long scratchy buggy plants overhanging the path. I expect we got loads of mosquito bites, and there were bees and spider webs and you know… stuff you usually expect to encounter in the woods.

Oh and also Black Pond—in case you were paralyzed by sheer curiosity and couldn’t look it up on your own—is a mud puddle in a glacial moraine dell. No, I mean literally it’s a mud-puddle this time of year: it’s fed only by rainfall, since there are no sources of water from springs or streams. It’s the least pond-like pond you would imagine. It’s not picturesque (well, unless you’re into natural history), it’s not especially informative (unless you’re into local geology), it’s not picturesque (by most standards), and it’s not even secluded and sublime (because there was a puling troupe of abjectly miserable kindergartners there when we arrived).

Compared to the idea of nature, it was utter crap. I loved it. At least for now, I loved it.

It talked to me about my expectations. It told me something about my model of the world. It wasn’t TV nature program bullshit, with lions catching gazelles in slow motion and then tastefully gnawing dainty bits of them off-camera. It wasn’t Jacques Cousteau discovering a new red fish looming out of the darkness. It wasn’t the mulched, car-width paths so many of our other city parks provide. You could hear the traffic, and see the adjacent golf course through the gaps in the trees. It was utterly unmajestic.

And so, in the process of so emphatically resisting my model of the world of nature parks, it revealed to me something hidden right down the street.

So that’s why I wanted to talk about GP when I got home, and after I had finished my other obligations for the day. Because GP is a buggy, rambling, exhausting walk on a hot summer day in a noisy, child-infested park.

It’s not there for you. Sure, maybe it’s there because of you… but it’s not there to be your tool.

So for Brian, who asked what to do when it doesn’t work: When it doesn’t work, it is because it gave you exactly what you wanted the very first time. It didn’t work because it didn’t—contrary to your expectations, at least if you’re paying attention at all to what I’ve been trying to say—confound you in some way.

Everything you can possibly do will take the form of a question you ask yourself.

  1. No seriously, giving a talk via a well-equipped MacBook is apparently totally feasible nowadays. 

  2. I Am Not An AstroParticle Physicist™ 

  3. Interestingly, it’s also the relation most people who invented Cybernetics, Artificial Life, and to some extent old-school Computer Science (the real stuff, not the wan modern echo) had to their projects, and that’s why my favorite book about the history of that stuff is Andrew Pickering’s The Cybernetic Brain. It’s Mangles all the way down.