A self-rewriting system

Draft of 2018.11.30

May include: recreation ↘ programming ↗ &c.

This is gonna take the form of a series of leading questions, and glib explorations without a lot of detailed scholarly reference to existing work. That work does exist, and I’ll refer to as much as I’ve found at the end or in a later episode, but these are intended as steps towards somewhere that is not on the map drawn in the bibliography.

Admittedly, it might not even be leading anywhere interesting, but I’ve been coming back a lot lately to what feels like the connection-points between these steps, in my sketching for the ReQ language and its behavior.

Feel free to muse along. There are some simple-sounding things here that will at least polish your discrete math reflexes….

Shortest string containing all substrings of a given length

This is a famous and challenging and also very well-studied problem, but here I’m going to use it as a starting point. The phrasing I’ll use here is also a bit odd… but for reasons.

What’s the length of the shortest string of digits that contains every possible substring of \(k\) digits in it? That is, if \(k=3\) what’s the shortest string of digits with all 1000 three-digit triples in it?

Personally, I can’t confidently answer this question about triples (or even pairs) without thinking a bit, and trying things first. And then I can’t answer anyway….

Probably like you, I’m tempted to start off by thinking of the trivial \(k=1\) “sequences” of one digit each. Pretty obviously, the shortest string of digits containing all ten of those digits is—whoa, what are the odds??—any permutation of "0123456789". Right? All ten numbers, all sitting there in plain sight.

That’s encouraging, right?

As to pairs: There are 100 pairs of digits (as I figure it), ranging 00 through 99. Many years of rolling percentile dice in games helped me out here.

Maybe I should establish a worst-case scenario first. An absolutely terrible way to write a string containing at least one copy of all 100 pairs would be to just concatenate them all to produce a string 200 characters long. Something like this, where I’ve introduced spaces to show the pairs:

0001020304050607080910111213...979899
00
  01
    02
      03
        ...                   ...
                                 98
                                   99

But looking at that, it strikes me that we could shrink the string by getting rid of some of the “extra” pairs created when we conctanate the “necessary” pairs. For instance, it feels as if I can remove some characters so that all of the occurrences of "00" after the first one disappear, and then do the same for the later copies of "10", and so on.

Except… when I start deleting characters from a string, I’m not only removing adjacent pairs, but also possibly creating new ones. How do I avoid undoing work I did in an earlier pass?

So maybe it’s not as obvious as I thought, and maybe I should try it. But first, it makes me update my expectations a bit. The order of pairs seems very important, especially when I start thinking about overlaps. There are overlaps for two-digit patterns, and so I can do the same work (adding two new pairs to the string) that I might have done by adding two whole pairs, using fewer characters.

This makes me realize that there’s a structural barrier to just how compressed our string might be: In any string of length \(n\) characters, there can only be \(n-1\) adjacent pairs. That’s a given. We just can’t get around it, without redefining what “adjacent pair” actually means, right? So for 100 pairs, we couldn’t possibly force the necessary individual characters into a space smaller than 99 characters, since every digit has at most two neighbors in a string.

So maybe I’ve established some bounds. It certainly seems that the shortest feasible arrangement of digits which contains all 100 pairs of digits must be somewhere between 99 and 200 characters long. It feels like there are a lot of opportunities for overlaps, so I’m willing to imagine that the optimal answer is down near the low end of the range.

But I hesitate to say it’s possible to jam them all into exactly 99 consecutive digits, because I’m worried about the way overlaps can get “used up” in one highly-compressed part of the string, where they might be helpful for shrinking some other part. For example, if I simplify things and try the analogous problem with just the digits [0 1 2], trying to fit all nine pairs [00, 01, 02, 10, 11, 12, 20, 21, 22] into a ten-character string manually… I start to get frustrated.

Finally I just try the simple heuristic approach of deleting the last character of every pattern already present in the raw concatenation of all the pairs. So for example I see the first 00 (which is the first pair in the string), and just delete all “second zeros” later on, wherever they appear.

000102101112202122
00
00102101112202122
 01
0010210112202122
  10
001021112202122
   02
00102111220122
    21
00102111220122
     11
0010211220122
      12
00102112201
       22
00102112201
        20
00102112201
         **

Huh. This actually seems to work. I eventually winnow things down to a 10-character string that has all nine pairs… but am I sure about how this approach might generalize? Not at all, especially because of that last step. Somehow in the process of deleting stuff, I’ve reintroduced a new "01" that wasn’t there earlier. It just doesn’t leave me thinking this would be a good idea for more difficult problems; I get the sense that those “creating new substrings” steps will be more common as combinatorics kicks in when I’m working with more digits and longer substrings. There are a lot of things to check.

In fact, it’s troubling enough to make me look things up online. I find that (1) Martin Gardner once asked a similar-sounding question about beads in a bracelet, and (2) the most productive approaches to doing things like what I’m asking for use combinatorial algorithms and intermediate representations involving de Bruijn diagrams. Along either of these paths, I’d need to also make some adjustments in defining what “adjacent” means (making the strings I asked about “cyclic” to permit the “overlaps” at the ends), or by asking a question I didn’t ask (“Does a sequence exist which contains exactly one copy of each pair?”). I mean strings, not circles. I mean shortest strings with at least one copy, not strings with exactly one copy. Those are different questions.

There are a lot of fruitful and open questions down that other path. Whether you take the puzzle route, or the de Bruijn diagram approach, you will run across any number of eminently practical matters in combinatorics, compression, cryptography, bioinformatics, number theory, and even optimal experimental design.

But for today, I’m not going to redefine what a string is—it will still have a definite start and end for us here—and I’m also not going to be all picky and ask you to produce a string that has only one copy of a pattern. You can be sloppy, and you’re allowed to take a more approximate route to shrinking the string.

Maybe now you want to write some code that tries to make the shortest possible string of digits—in other words, go back to [0,1,2,3,4,5,6,7,8,9]—that has all three-digit combinations at least once. If you’re feeling fancy, you might be able to use a de Bruijn approach, or you might be able to do it heuristically using an approach like the one I did by hand for the easy problem above. Maybe you could try a metaheuristic: something like simulated annealing or genetic algorithms, if those strike your fancy.

In any case, the simplest “workable” approach for stuffing all thousand three-digit substrings into one string is just the concatenation of their digits, right? Like this:

000001002003004...998999
000
   001
      002
         003
            004
               ...
                  998
                     999

And the best possible—but perhaps not feasible—arrangement would be one where every triple shares two characters with both of its neighbors. In the same way that "...011223..." includes "011" and "112" and "122" and "223", using all four triples in the six-character string for different substrings. And you might be able to keep that kind of smushing going for a long time. But… for how long? Can smush all 1000 three-digit substrings into 1002 characters?¹

Meanwhile, let me plant a new flag, over in what might feel like unrelated territory….

Context-sensitive character insertion

Now let me describe a very simple context-sensitive Lindenmayer system for you. It’s a relatively simple sort of Lindenmayer system I’m calling an “insertion system”.

There are a lot of useful and lovely things tucked into the relatively obscure chunk of theoretical computer science that includes Lindenmayer systems. I encourage you to poke around, and if I have the time I’ll add some bibliographic junk at the end of this. But for now let me just spell out the thing that brought me here.

Suppose we have a string of characters. For the sake of foreshadowing, let me say the characters are digits, and the string is like the ones we talked about above.

We also need a set of substitution rules. The ones we need today are a very particular kind of substitution rule. You may have noticed me talking about substitution rules a lot lately, and you probably know that there are all kinds of ways of defining such rules, but here I am talking about the simple sort: “whenever you see this pattern of characters, replace it with this other pattern of characters.” That is, they take the form "xyz -> abc". Wherever the string contains a “match” to the left hand side, "xyz", I’ll replace those characters with the stuff on the right hand side, "abc".

If you’ve ever poked at formal languages or automata theory, or even programmed using regular expression libraries, you’ll have come across this sort of thing. But unlike most of those more familiar systems, in a Lindenmayer system we will apply all matching rules in parallel during each “step”. That’s not the only thing that sets Lindenmayer string-rewriting apart from the other kinds, but it’s probably the most obvious and dramatic.

Let’s look.

Here’s are some characters for us to play with, as our initial string: "10210".

And here are some substitution rules. Bear in mind I’m picking these more or less at random:

-> 010
-> 021
-> 022
-> 100
-> 101
-> 112
-> 210
-> 201
-> 222

Notice that in constructing these substitutions, I’ve done some very particular things:

First, I’ve covered all possible pairs of adjacent digits from [0 1 2]. I didn’t have to, but I did… because why not?

And—this one is crucial—I’ve tried to be careful on the right hand sides to build rules that only insert a new digit between the original pair matched. So for example if my string contains "00", the rule specifies that I will “replace” that substring with "010", which is the same thing as saying I’ll “insert” a "1" between a pair "00".

Because I’ve been so careful to preserve context in these potentially overlapping pairs, we can apply these rules to every match in the entire string simultaneously.

Let’s crank the handle a few notches, and see what we get. I’m starting from "10210" and inserting the rule-indicated character between each adjacent pair—in parallel—for a few steps:

10210
100220100
10010222210210010
100102100222222220100220100102100
10010210022010010222222222222222210210010222210210010210022010010
...

That’s a bit dense for my eyes, though like me you might be able to see some interesting patterns forming already. Here, let me insert some helpful visual spaces between the characters I’ve used as “context” in each step, so that things line up to show where I’ve inserted new characters in each parallel step. Notice that the characters in each string are retained in all subsequent strings in this process:

1               0               2               1               0
1       0       0       2       2       0       1       0       0
1   0   0   1   0   2   2   2   2   1   0   2   1   0   0   1   0
1 0 0 1 0 2 1 0 0 2 2 2 2 2 2 2 2 0 1 0 0 2 2 0 1 0 0 1 0 2 1 0 0
10010210022010010222222222222222210210010222210210010210022010010
...

A few things strike me as I work this all out by hand (and as always, fair warning that I did do it by hand, so there might be mistakes). One is that the strings seem to all be starting with the same sequence, something like "1001021002...". Let me check by re-deriving that bit specifically:

1 0 0 1 0 2 1 0 0 2...
1001021002201001022...
...

Yup. When I run these particular rules on the sequence "100102..." I get a new string that includes the same sequence. I can’t say whether this will eventually become some even-longer fixed point in subsequent iterations, but it feels like it might.

There’s also a big visible blob of "...222222..." growing in the middle. I didn’t plan that; it seems to just fall out of the choices I made in constructing this set of rules and applying them to this string. I mean sure… now that I look, the rule "22 -> 222" seems to be responsible, but I’m not convinced that it was enough to make one big block of "2" digits, instead of some other pattern that had many blocks of "222" between other digits.

Like I said, this is the first time you and I are seeing this process unfold. None of it is planned or cooked intentionally into the choices I’ve made.

More subtle—and something you might have reasonably missed if you didn’t re-type this by hand—is that there’s not a single copy of the pair "12" anywhere in any of these steps. I suppose if the initial string had had one, we’d possibly keep it over subsequent generations. But for this particular path, it sure looks like none will ever arise. A certain kind of enthusiast might be tempted to prove that, but not me.

But I am curious. Let me stick a "12" in there and see what happens….

-> 010
-> 021
-> 022
-> 100
-> 101
-> 112
-> 210
-> 201
-> 222

 0   1...2   1   0
0 0 2 1 1.2 0 1 0 0
100102201011210210010

Interestingly, the "12" is retained (I’ve thrown in some dots to highlight it), but it also doesn’t “propagate”. There’s still only the one copy after three steps, and I doubt any new ones can appear. Huh.

The things you find when you play around.

As I said at up above, there are all kinds of interesting and useful objects and dynamics in the field of Lindenmayer systems. You may already see the intimate relation between these parallel rewriting rules and the dynamics of cellular automata. And if you’ve got any background in formal languages, you might be eyeing that sequence of characters and wondering which words might appear over time, and which words might never appear given these insertion rules, and stuff about whether this is a kind of language which “accepts” certain words or not, and so forth. And when you glance at even the scant Wikipedia page, you’ll probably be quite taken with the pictures of branching plant structures, too.

But for now, let me focus on this, and call this particular specialized sort of Lindenmayer system an insertion system. By that I mean: The rules involve pairs of adjacent characters in a string for the left-hand-sides, and we retain those characters on all of the right-hand sides, while inserting zero or more new characters between them in the subsequent step.

Just to be thorough and explicit: There’s a lot of leeway here for how the matches and insertions work. These could be more complicated, and I’d still call them “insertion systems”.

I might have defined some rules that insert no characters, like "12 -> 12"
I might define rules that insert multiple characters, like "12 -> 12122"
I might have multiple rules matching the same left hand sides, though in that case I would also need to add a rule for disambiguating them. That “extra” rule might be “take the first in the list”, or “take the last”, or “pick one at random”.
I could define rules that insert values that don’t appear in the left-hand-sides of the matchers; for example, "01 -> 0-1". Maybe the hyphen indicates one string becoming two.
I could have skipped some patterns in my collection of left-hand-side matchers, on the assumption that if there’s nothing inserted, there’s also nothing deleted

Let me work through that last one. Above, I included a substitution rule for each of the nine possible pair of digits, from "00" through "22". If I had skipped some, things would’ve still been fine. The lack of a rule doesn’t imply that unmatched characters are deleted, just that nothing is inserted between them.

Here’s an example of an abbreviated version of the same rules I used above. I’ve just thrown three away, and kept the other six:

-> 010
-> 021
-> 100
-> 101
-> 112
-> 201

When I apply this abbreviated set of matching rules to the same starting string, obviously I’m not going to have as much growth, because there are fewer inserted characters in each step now. But it turns out that I also get a qualitatively different string:

                           0 2         1                           0
             0             0 2 0       1               0           0
     0       0     1       0 2 0 2     1       0       0   1       0
 0   0   1   0 2   1   0   0 2 0 2 0   1   0   0   1   0 2 1   0   0
0 0 1 0 2 1 0 0 2 0 1 0 0 1 0 2 0 2 0 2 1 0 0 1 0 2 1 0 0 2 1 0 0 1 0
100102100 2010010 2 02100102100 2 0 2 0 20100102100 2010010 20100102100
...

There’s no growing blob of "...222..." here, and there’s something weird happening now with the sequence ...02020... that I don’t recall seeing before. But you can see that it’s still “fine” without all the rules.

Heck, I imagine you’re already thinking of how you might write a program to do this sort of thing, right? Or maybe you’re wondering about those “blocks” of patterns in the unfolding strings—there’s something interesting in the way the simple rules produce them without being very explicit. If you wanted to say “emergence” quietly to yourself, I’d be fine with that.

Those are excellent and interesting things to be thinking about. I expect we’ll explore them together shortly.

But please notice one thing: In all of this so far, I am using a single set of insertion rules to rewrite consecutive “steps” of the string. I’ve changed the rules a couple of times between examples, but the’ve stayed the same between steps.

Not for long!

Self-rewriting strings

No doubt you’ve noticed some similarities between these two things I’ve discussed. I first asked “What’s the shortest string of digits that has all adjacent pairs in it?” And then I showed you a little bit about a tiny slice of the interesting world of L-systems, specifically about what I’m calling “insertion systems”, where we use a collection of matching rules to rewrite a string of digits in parallel.

Let’s see how they taste together.

First let me define a way of interpreting a string as a set of overlapping insertion rules. For any string of characters, let me break it into all of its (overlapping) triplets, and then convert them into insertion rules using a rule like this: “Triple XYZ means "XZ -> XYZ".”

Let me work through the string "102101210002" and find the rules it encodes, using this approach.

102101210002
102          : 12 -> 102
 021         : 01 -> 021
  210        : 20 -> 210
   101       : 11 -> 101
    012      : 02 -> 012
     121     : 11 -> 121
      210    : 20 -> 210
       100   : 10 -> 100
        000  : 00 -> 000
         002 : 02 -> 002

Unsurprisingly, scanning through this 12-character string and gathering all the 3-character substrings produces \(12-2=10\) “matching rules”. To the right I’ve translated the triples into the equivalent insertion rules. For example, "021" means "01 -> 021", or “insert 2 between each adjacent pair "01"”.

Also unsurprisingly (for a string of this length), there are a few cases where the same left-hand side appears in multiple rules. As I mentioned above, we’ll need to think about how to deal with these extraneous (and sometimes contradicting) rules. For now, let me try what feels like the simplest approach: Use the first one that appears, and ignore any later ones with the same left-hand side.

So in the example, I’ll use "11 -> 101", and not "11 -> 121". Again, there are a wide variety of ways to cope with this, and in fact we’ll come back later and explore those. But for now, let me try “do the first one, ignore any subsequent ones”.

That means I can simplify the diagram a bit by skipping the triples and rules that are already set:

102101210002
102          : 12 -> 102
 021         : 01 -> 021
  210        : 20 -> 210
   101       : 11 -> 101
    012      : 02 -> 012
     121     : 11 -> 121
       100   : 10 -> 100
        000  : 00 -> 000

Well then. We’re ready. "102101210002" encodes those eight rules.

What happens when I apply them to the string from which we derived them? Let me insert some spaces between each pair, and just show the insertions one at a time, moving left to right. I could do them all at once, but this extra expansion might help explain what’s happening in a more visual way.

102101210002
        : 12 -> 102
       : 01 -> 021
      : 20 -> 210
     : 11 -> 101
    : 02 -> 012
   : 11 -> 121
       100   : 10 -> 100
        000  : 00 -> 000

0 2 1 0 1 2 1 0 0 0 2
2 1 0 1 2 1 0 0 0 2
1 0 1 2 1 0 0 0 2
1 0 1 2 1 0 0 0 2
100 1 2 1 0 0 0 2
10021 2 1 0 0 0 2
1002102 1 0 0 0 2
1002102 1 0 0 0 2
1002102 100 0 0 2
1002102 10000 0 2
1002102 1000000 2
1002102 100000012

100121002102100000012

It turns out there wasn’t any rule in the table to handle the pair "21", so when I was done substituting I just snugged them back up. Which you’ll recall is the default behavior I set for missing rules.

Can we do it again? Of course! I’ll just go ahead and skip the lines that try to re-assign the same pair on the left-had sides of the rules I find, and then dive right into the insertion with those new rules:

100121002102100000012
100                   : 10 -> 100
 001                  : 01 -> 001
  012                 : 02 -> 012
   121                : 11 -> 121
    210               : 20 -> 210
         102          : 12 -> 102
             000      : 00 -> 000

1 0 0 1 2 1 0 0 2 1 0 2 1 0 0 0 0 0 0 1 2
100 0 1 2 1 0 0 2 1 0 2 1 0 0 0 0 0 0 1 2
10000 1 2 1 0 0 2 1 0 2 1 0 0 0 0 0 0 1 2
1000001 2 1 0 0 2 1 0 2 1 0 0 0 0 0 0 1 2
100000102 1 0 0 2 1 0 2 1 0 0 0 0 0 0 1 2
100000102 1 0 0 2 1 0 2 1 0 0 0 0 0 0 1 2
100000102 100 0 2 1 0 2 1 0 0 0 0 0 0 1 2
100000102 10000 2 1 0 2 1 0 0 0 0 0 0 1 2
100000102 1000012 1 0 2 1 0 0 0 0 0 0 1 2
100000102 1000012 1 0 2 1 0 0 0 0 0 0 1 2
100000102 1000012 100 2 1 0 0 0 0 0 0 1 2
100000102 1000012 10012 1 0 0 0 0 0 0 1 2
100000102 1000012 10012 1 0 0 0 0 0 0 1 2
100000102 1000012 10012 100 0 0 0 0 0 1 2
100000102 1000012 10012 10000 0 0 0 0 1 2
100000102 1000012 10012 1000000 0 0 0 1 2
100000102 1000012 10012 100000000 0 0 1 2
100000102 1000012 10012 10000000000 0 1 2
100000102 1000012 10012 1000000000000 1 2
100000102 1000012 10012 100000000000001 2
100000102 1000012 10012 10000000000000102

Hmm… it seems like we have a pattern developing here, doesn’t it? But it’s also interesting that somewhere along the way we’ve lost a rule. That is, in the rule table derived from "102101210002" there were eight rules, while in that from "100121002102100000012" we only have seven rules.

Also interesting is the fact that not all the rules are quite the same:

102101210002              
102                   : 12 -> 102 *
 021                  : 01 -> 021
  210                 : 20 -> 210 *
   101                : 11 -> 101
    012               : 02 -> 012 *
     121              : 11 -> 121 *
       100            : 10 -> 100 *
        000           : 00 -> 000 *

100121002102100000012
100                   : 10 -> 100 *
 001                  : 01 -> 001
  012                 : 02 -> 012 *
   121                : 11 -> 121 *
    210               : 20 -> 210 *
         102          : 12 -> 102 *
             000      : 00 -> 000 *

Honestly, if you’d asked me, I’d have guessed that we’d have a completely new set of rules, not… (let me count) six starred rules in common between these dramatic rewritings of the underlying string.

Now I’m curious.

10000010210000121001210000000000000102
100                                    : 10 -> 100 *
 000                                   : 00 -> 000 *
    001                                : 01 -> 001 *
      102                              : 12 -> 102 *
        210                            : 20 -> 210 *
             012                       : 02 -> 012 *
              121                      : 11 -> 121 *

Whoa. The next step after that also has the same seven rules as the prior step. Isn’t that odd? It feels odd. I’d have expected something to change, with all those opportunities for fiddling. I mean, the fronts of the strings—where I’d expect the rules to “stack”—don’t even look much alike to my eye. And what’s weirder is that the rules don’t appear in the same order in the strings. Do you think the next (and subsequent) steps have the same table? Will it ever change?

I don’t know much more about that. You should now have a sense of the direction it’s going, though. Let’s me think out loud for a while about what’s happening here… and also what’s feasible.

small things

First, let’s think about which strings are “active” in the sense of (1) encoding any rules at all, and (2) being capable of rewriting themselves.

For teeny-weeny strings of zero or one character, there are no pairs, and therefore no places to apply insertions anyway. I mention them only for completeness. But if we were to wave this process at those baby strings, they’d just stay the same, right? There are no rules, and no pairs to match. The result of iterating the process over "9" is… "9" because we don’t discard characters, we only add new ones if we can. So "9" stays "9" forever.

And then there are two-character strings that have one pair, but can’t encode even one insertion rule. So applying the process to "20" also doesn’t do anything. No rules encoded by the string, so the default applies to the one pair present: "20 -> 20". Boom we’re done, and like "9" it’s what you might call a “stable attractor”.

But notice that there are also some middling-sized strings—big enough to encode insertion rules—where we might also get an interesting dud. Remember the example I used when I first showed you an insertion system, above? "10210", I think it was.

10210
102   : 12 -> 102
 021  : 01 -> 021
  210 : 20 -> 210

1 0 2 1 0
1 0 2 1 0
1-0-2 1 0
1-0-2-1 0
1-0-2-1-0

10210

This string encodes three rules, but none of those matched patterns actually appear in the string; I’ve jammed hyphens in to show the pairs I’ve checked as I “process” it. So "10210" is another stable attractor under this self-rewriting process, though for a different reason than the tiny strings like "20" and "9".

But I see there’s something else interesting about "10210": it encodes the maximum possible number of rules for its length. There are three different insertion rules encoded in five characters. Some other five-character string, for example "00000", might encode fewer rules because of the disambiguation rule. We only have one, then: "00 -> 000".

If we think about strings of moderate size like this, it feels like we could fit a whole bunch of insertion rules into one, doesn’t it? Now fitting the maximum number of these three-letter rules into one string is slightly different from my earlier question about fitting triples in, right? Because the rule I’ve specified for disambiguating them is to use the first rule that matches on the left-hand side.

Let me try to phrase a new question, then:

How long a string can we build, where none of the rules encoded in it matches any pair in that string? Obviously it will depend on the alphabet we’re using, so let’s limit ourselves to ten digits. What’s the longest string we can build from digits, which contains no insertion rules that match the encoding string itself?

That’s something close to my question about the shortest string that holds all three-character substrings, isn’t it? It’s not that we want to jam as many different triples into one string as possible, or even to be minimally long, but now we want as many triples where the first and third characters are different from the first and third characters of the others, and also none of those first-and-third pairs appear in adjacent positions in the string.

Sounds like constraint programming to me.

growing things

Now let’s think about that weird observation that the example I picked above, "102101210002", seemed to so quickly “stabilize” on a single fixed-feeling rule set during iteration. I mean to say: I don’t know that the rules were “fixed”, not least because they seem to be changing order in each iteration. But the fact that nothing new arose in several steps seems noteworthy. Maybe it keeps on doing that? ¯\_(ツ)_/¯

Could this be due somehow to the limited alphabet of three characters I started with? Or maybe it’s the relatively small number of rules possible in that alphabet? If I’d worked out an example with all ten digits, I suppose we could find anywhere up to 100 rules in a single long-enough string. Though it’s not clear to me from the previous section how likely it is for an arbitrary string of digits to have a target for a rule in it.

Actually, let me box that one up, too:

For a string of \(n\) characters, sampled uniformly from the digits "0" through "9", what’s the likelihood that the string will encode a rule that matches a pair anywhere inside it?

The string "102101210002" encodes eight insertion rules, but it’s certainly odd to me that the iterative application of those rules to the string produces a new result that’s longer, and which has fewer rules in it. I’ve got to wonder how long that keeps up.

Obviously we also can’t ever gain characters through iteration and insertion. The rules are encoded explicitly in these strings, so the only source for an inserted character is by copying it from elsewhere in the string. Indeed, there’s an interesting diagram one could draw, that links each character in a string to the position in its “ancestor” where it was copied from.

But it’s clearly possible to encounter new rules, using the existing characters. "102101210002" contained the rule "01 -> 021", while the next string in the series "100121002102100000012" contained "01 -> 001" instead. Regardless of the alphabet size \(a\), there are only \(a^3\) possible rules, so in the case of three characters we only have 27 possible rules. But for decimal numbers, there are 1000.

One wonders how many of those alternative rules a given string might visit, as we translate it and build the next tables. Because of the disambiguation rule I’ve chosen (take the first example of a left-hand side only), the maximum number of rules that a string—as long as it’s long enough for all of them to fit—can encode is only \(a^2\). For three-character strings, we can sample only nine of these 27 rules before we disambiguation kicks in. In a single decimal-alphabet string, we’d encode a maximum of a hundred rules before we started to revisit left-hand match patterns.

So I poked at a terminal irb session a little (no need to inflict the code I kludged together on you), just to have a look. I made a random 100-character string of digits:

"3629424442008315423594250724507295647577632898186048388790102585677370034179354191660842207527543565"

That encodes these 62 unique rules. It’s not 98, because I’ve only shown the first one defined for each adjacent pair (so 36 of them were too late in the string to count):

{"32"=>"362", "69"=>"629", "24"=>"294", "92"=>"942", "44"=>"424", "42"=>"442", "40"=>"420", "20"=>"200", "08"=>"008", "03"=>"083", "81"=>"831", "35"=>"315", "14"=>"154", "52"=>"542", "43"=>"423", "25"=>"235", "39"=>"359", "54"=>"594", "45"=>"425", "57"=>"507", "02"=>"072", "74"=>"724", "79"=>"729", "96"=>"956", "67"=>"647", "77"=>"757", "76"=>"776", "73"=>"763", "62"=>"632", "38"=>"328", "29"=>"289", "88"=>"898", "91"=>"981", "16"=>"186", "80"=>"860", "64"=>"604", "87"=>"887", "89"=>"879", "70"=>"790", "00"=>"010", "12"=>"102", "05"=>"025", "28"=>"258", "55"=>"585", "86"=>"856", "30"=>"370", "04"=>"034", "31"=>"341", "47"=>"417", "19"=>"179", "95"=>"935", "34"=>"354", "51"=>"541", "49"=>"419", "11"=>"191", "60"=>"660", "68"=>"608", "82"=>"842", "27"=>"207", "72"=>"752", "53"=>"543", "36"=>"356"}

And of those, only 44 of the left-hand sides appear in the string. Meaning: we’d insert 44 new characters into this 100-character string, and end up with a new one 144 characters long. I haven’t bothered fiddling the kludgy code I’ve written to see what happens after that, but this feels like it’s the direction I’d go to explore the questions I’ve posed above.

I wonder how long a random decimal string would need to be in order to encode 100 rules? That’s not a hard-sounding question in probability theory, but it feels salient.

But still, there are 1000 overall decimal patterns. My random pick up here only manages to encode 62 of them (the other 36 being masked). This next (final?) question really depends on the ways in which repeated applications of these rules to the strings that encode them plays out. Some strings might grow in a way that expands out into the space of rules. Other strings (like the example I used above) might stagnate or putter around in a small attractor of some sort. But I wonder what we can see if we look for the most expansive ones of all?

Are there strings which, when decoded into insertion rules and applied to themselves, manage to visit all the possible (single-character) insertion rules for the alphabet they contain?

That is: If the alphabet is three characters, are there strings (of any length) that manage to visit all 27 possible rules? There can’t be strings that contain all 27 rules in “active” form, because only the first unique nine left-hand sides in any single string are ever active. So clearly I’m asking for a dynamic solution: a string that becomes a string that becomes a string, and so on, where all 27 possible rules end up being active.

And are there, by extension, decimal strings that eventually visit all thousand possible rules? That sounds pretty doubtful, doesn’t it? But maybe there are “tricks” somewhere in all these dynamics.

Probably I could write a real program and take a look, huh?

further and more

I’ve made some arbitrary decisions here. We might want to explore alternatives, or just think about some of those alternatives until we better understand the ones we’ve got here.

finding rules

I’ve only described one kind of “interpreter rule” for deriving a set of insertion rules from strings. Here I’ve used “XYZ means insert Y between XZ”, but in an earlier exploration I tried “XYZ means insert Z between XY”. That was… well, definitely different. You might poke at it and try to decide why I changed to the current one.

But there are numerous other reasonable-seeming approaches. One that might be interesting would be to have “skip triplets”, as in “X#Y#Z means insert Y between XZ, where # is any single character.” In this version, the specific characters used to derive a rule aren’t adjacent, but one apart. That means there are fewer rules, but they also have a different relation to one another:

102101210002
2 0        : 10 -> 120
1 1       : 01 -> 011
0 2      : 22 -> 202
1 1     : 11 -> 111
2 0    : 00 -> 020
1 0   
0 0  : 20 -> 200
       1 0 2 : 12 -> 102

021 0 1 21 0 0 02
1202120110212000202

Or we could break the string into non-overlapping chunks, but still apply our insertions to the overlapping pairs. Or or or or….

picking rules

I’ve picked one arbitrary disambiguation rule, but there’s no way it’s the “right” one. We could’ve picked the last defined one for a given left-hand side. We could’ve permitted stochastic expansion, picking one of the successful matches with uniform probability. We could’ve been really mean, and decided that only rules that match unique left-hand sides can work, and as soon as a second triple comes along that shares the first and last characters, we destroy all of them for the rest of the process.

They could be arbitrarily baroque, in other words.

Why did I pick these at all? Because it was something I could easily draw in text diagrams.

simple rules

Why am I only inserting a single character? No reason I can think of, except that “every rule is expressed as a triple” was easy enough.

There might an even easier one: Express each rule as a pair, and replace the left-hand side with the right-hand side. But we don’t end up with a lot of rules that way, do we? And the strings wouldn’t expand. I like that they expand.

rewriting itself

We don’t need to rewrite every occurrence of a match at the same time. We might pick a random pair of characters from somewhere in the string, then pick a random match from the set of all encoded rules, and replace it, and repeat… but that wouldn’t have these nice “steps”, would it. Or we could do something like a Cyclic Tag-system, where we cycle through the set of rewriting rules, and apply each to the first place it applies, and then move on.

And so on.

The penumbra of a problem

All of these “or or or” alternatives might be better or more interesting or more practical or more fun than the one I’ve spelled out here. But the point is not the maximization of utility, it’s the intentional step out into the lesser-known algorithmic world. As I said, a lot of this stuff is in fact useful in modern computational science, not to mention all the business applications from bioinformatics to bitcoin.

My point in setting up these exercises is that they are poised on the edge of big sprawling lightly-settled chunks of territory. I mean look: Do you recall maybe five hours or so ago, when you were reading along earlier in this same essay, and I said, “We could insert a space character into a string with a rule, and treat that as a new string.”

Did you notice that?

Now think about that throw-away notion for a moment, please. I’m here talking about treating a single string as a source of rules to re-write the string. But what if the string contains a space. What is that, then? Well, it’s kind of two separate strings, isn’t it? We might write them out in one set of quotation marks like "abc def", but if we treat space-delimited strings as “separate”… well, are their rewriting rules also separate then?

Or if we decide to treat it as a single string for translation, do we now get a new rewriting rule that inserts a space, and then (potentially) we’ll have two, and then four, and so on? We’ll have a propagating collection of chunks separated by spaces… a string ensemble if you like.

What characteristics will that ensemble have?

What might we do with it?

And that’s the point. We do this, so we can also see the closed door standing mysteriously out there in the middle of that empty field. It can’t be opened until it’s been seen.

Nope! ↩