Looking for pattern-avoiding sets of numbers

Draft of 2019.04.07

May include: mathematical recreations ↘ Number Theory ↗ &c.

It’s seems remarkably easy to find problems and patterns in Number Theory that “feel” simple, but turn out to be tucked down low near the base of a big overbalanced stack of advanced and esoteric mathematics. In another piece, I recently wandered around over near Pell’s Equation, and discussed its use as a new “GP benchmark”.

An offhand passage from some other source—which I have now totally lost among my own overbalanced mental stacks—has made me think of a new one. Whoever it was that inspired this bit, either on Twitter or somewhere in a library book¹ had been talking about taking integers and… probably multiplying them? And then looking at the resulting product (?) value and noticing that it didn’t contain a certain… digit?

¯\_(ツ)_/¯

As I write this, I’m starting to be more confident that it was something about digit-avoiding. Maybe “the product doesn’t have a 9 in it,” or something along those lines.

At any rate, that was enough to expand into some awful flower of Rather Quite Difficult problem-posing in my head. Here’s where I’ve ended up:

Pattern-avoiding sets of integers

Suppose I give you a set \(N\) containing integers (in base-10 notation, though that’s one of those parameters we can twiddle later if you want). Say, for example, I give you the set of integers \(0 < n < 1000\). Here I’ve specified a range of values, but \(N\) could be any set of positive integers (possibly including zero), and of any size.

I also specify a forbidden pattern, also in the form of a positive “integer”… though we won’t actually treat it quite that way in practice. Call this pattern \(p\).

The score of the set \(N\) is the sum of:

the number of items of \(N\) which contain at least one contiguous occurrence of the digits of \(p\), plus
the number of items in the set of pairwise sums of \(N\) which contain at least one contiguous occurrence of the digits of \(p\), plus
the number of items in the set of pairwise products of \(N\) which contain at least one contiguous occurrence of the digits of \(p\).

So for example, say I specify

\[N = \{152, 231, 276, 427, 440, 706, 715, 741, 745, 756\} \\ p = 11\]

Turns out there are no occurrences of the string pattern "11" in any of those ten numbers. Here though are the pairwise sums:


[383, 428, 579, 592, 858, 867, 893, 897, 908, 507, 658, 671, 937, 946, 972, 976, 987, 703, 716, 982, 991, 1017, 1021, 1032, 867, 1133, 1142, 1168, 1172, 1183, 1146, 1155, 1181, 1185, 1196, 1421, 1447, 1451, 1462, 1456, 1460, 1471, 1486, 1497, 1501]

If you look through the list, you’ll see 1133, 1142. 1168, 1172, 1183, 1146, 1155, 1181, 1185, 1196. That’s ten copies of “11” in the sums.

And here are the pairwise products:


[35112, 41952, 64904, 66880, 107312, 108680, 112632, 113240, 114912, 63756, 98637, 101640, 163086, 165165, 171171, 172095, 174636, 117852, 121440, 194856, 197340, 204516, 205620, 208656, 187880, 301462, 305305, 316407, 318115, 322812, 310640, 314600, 326040, 327800, 332640, 504790, 523146, 525970, 533736, 529815, 532675, 540540, 552045, 560196, 563220]

Here we have more digits, and more “tries”. But interestingly. there are only seven copies of "11" present, in 35112, 112632, 113240, 114912, 171171, 117852, and 318115.

So the score of the set {152, 231, 276, 427, 440, 706, 715, 741, 745, 756} is \((0+10+7) = 17\).

Compare that with this starting point:

\[N = \{67, 101, 122, 370, 375, 384, 393, 445, 486, 838\} \\ p = 11\]

Here, there are no copies of the pattern "11" in the values I give you, and also no copies in the set of pairwise sums or products. The score of this set on this pattern is zero.

Some unsolved (by me) problems

Now those two little examples are just random sets of ten integers I found by sampling with some Ruby code (down at the bottom of this essay) and poking around. But when I tug on this little tidbit, a lot of interesting things feel like they’re connected. Here are the ones that have come to mind for me, so far.

The largest pattern-avoiding subset

Suppose I give you some \(N\) of integers, and ask you to remove items from it until the values remaining, and their pairwise sums, and their pairwise products all lack the bad pattern \(p\). What is the largest subset of \(N\) that avoids \(p\)?

For example, if I give you \(N\) as the integers \(i, 1 \leq i \leq 999\), and pattern \(p = 11\), you’d obviously need to remove all values of the form “11*” and “*11” to avoid the simplest violations. That is, you have to pitch 110 and 111 and 119, and also 711 and 311 and certainly 111.

But then given the values you have remaining, you’ll almost certainly have pairs that sum or multiply to produce results containing “11”. And obviously you can remove every remaining value that participates in all those sums and products and be absolutely certain no violating values remain.

But that seems over-enthusiastic. It seems to me that if you remove certain “key” items, perhaps those that participate in multiple “violating” interactions, you could remove far fewer items to satisfy the specified constraints.

So what’s the largest subset we can retain?

How does one go about finding that subset? The “greedy” heuristic I’ve already spelled out—“Throw any value away that participates in any sum or product resulting in a violation”—is a bit heavy-handed. But as we step “away” from it, and try to be a little more thoughtful… one doesn’t immediately see what to look for. That is, what are the features of the numbers that we should pay attention to, besides “produce a sum or product”?

What does it even mean, mathematically or computationally, for a certain number to “cause” a pattern appear in a sum or product? Indeed, is this a property of any single number, or is it something to do with pairs of numbers? And if so, where in the pair does the “stuff” live that has a role in it?

And in my head, I’m hearing the word “hypergraph” bubbling up. That may not be the case for you… but it is for me.

The hard-and-easy pattern problem

In the examples above, I used the avoided pattern “11”. What happens if I specify \(p = 9\), or \(p=883\), without changing the initial set of integers \(N\)?

When \(p=9\), it certainly feels as if the problem has gotten “harder”. That is, there are more opportunities for the single digit “9” to appear in a given set \(N\) of integers, and in their pairwise sums and products.

Similarly, when \(p=883\), it feels pretty unlikely. There’s only one number below 1000 that contains that pattern, and not a lot in the pairwise sums, either.

But there are subtler reasons why the score of two patterns might differ. Consider the case of \(p=11\) vs \(p=12\). Here, at least for numbers of relatively few digits (say up to four), there’s actually a real difference in the number of “copies” of the pattern we can fit into any given integer. You can have two copies of "11" in a three-digit integer, but you can’t have two copies of "12"!

And on the easy-peasy end of the spectrum, if I specify \(N\) as the integers between \(1\) and \(1000\), and say that \(p=12345678\), well… I think we can agree that the score is easy to calculate. There will be no copies of that pattern of digits anywhere in \(N\), nor in its pairwise sums and products. It just doesn’t fit.

Thus, for any set \(N\) and pattern \(p\), there will be some particular, measurable score. We can look and see, if nothing else.

So here are some more open questions:

For a given set \(N\) of integers, what are the worst-scoring patterns \(p\)? That is, which patterns produce the highest score?
For a single set \(N\), and two patterns \(p_1\) and \(p_2\), can one predict which of the patterns will have the better or worse score, without doing the explicit counting?

This latter one especially feels interesting. Obviously there are heuristics: if only one pattern is literally unable to fit in the set or the derived sets, it’s more likely that the other one will have a higher score. But not given.

Similarly, there are some odd-feeling things happening when I speak of pairwise sums vs pairwise products. Those are qualitatively different patches of number theory, algorithmically….

There are also questions here about algorithms. Which, for me, are kinda sorta the point.

The self-avoiding set problem

Suppose I give you a set \(N\) of integers, as before. But instead of specifying a particular \(p\), I tell you to select any one of the elements of \(N\) as \(p\).

Obviously, there will already be one copy of \(p\) already present, so the best possible score the set can receive for any one of its own elements will be \(1\). But that’s only possible when there are also no pairwise sums or products that happen to “duplicate” \(p \in N\) in their representations.

Are there any sets of integers \(N\), with two or more elements, that have a score of \(\|N\|\)? That is, where the only occurrence of any element is in the set, not in the sets of sums or products? Sure! If I specify \(N=\{3, 4\}\) then the pairwise sums and products are \(\{7\}\) and \(\{12\}\), respectively, and there are no copies of either \(p=3\) or \(p=4\) in those.

But sets of two items are boring.

What’s the largest set of integers we can construct such that the pairwise sums and products of all its elements avoid containing any (additional) copies of any element?

The open-ended do-it-forever problem

You knew this was coming, surely.

In the play above, I’ve talked about taking set \(N\), producing the set of pairwise sums (call this \(N_{+}\)) and pairwise products (call this \(N_{\times}\)). The score is defined as the number of occurrences of the pattern \(p\) in all those.

Call that “all those” set of values, sums and products \(N' = (N \cup N_{+} \cup N_{\times})\).

Now so far, I’ve been counting the contributions of \(N_{+}\) and \(N_{\times}\) separately, and under certain circumstances it’s entirely possible that a value might appear in two of the three sets; for example, what if \(0 \in N\)? When we add up the three contributions to get a score, I may therefore be “double counting”.

I’m good with that, but I want to spell it out before this next step.

If I define \(N' = (N \cup N_{+} \cup N_{\times})\), we can obviously continue this process. That is I could produce \(N'' = (N' \cup N'_{+} \cup N'_{\times})\), and so on.

Those sets \(N'\) and \(N''\) and so forth are getting very large very quickly. And their elements are also getting very large—with more digits—as well. At some point, there are going to be some very long strings of digits in play. It feels almost inevitable that for some pattern \(p\) we will stumble across it in \(N'\) (the “basic” problem outlined above), but if not then maybe in \(N''\), or certainly somewhere not much farther along.

Are there sets \(N\) and patterns \(p\) for which we will never encounter \(p\), anywhere in the infinite series of recursively-applied expansion?

Even if there aren’t such things, it seems as though there will (probably?) be a few “better” choices of \(N\) and \(p\), where the number of iterations we need to stumble over the first \(p\) is large. What are those like?

What do you mean “pairwise”?

Another problem-generating heuristic I’ve avoided so far should also be bubbling up in your mind, if you’re like me: Why did I say “pairwise” sums and “pairwise” products? What about sums and products of triples? Or addition and multiplication mapped onto all subsets of \(N\)?

Why not try those, Tozier?

Well, first, sure. OK, I will call you on that and happily heat up my laptop looking around. But bear in mind that the scores I’ve defined here still apply, and so yes there are probably some very cunning and subtle choices to be made for \(N\) and \(p\) that give wildly differing scores.

In fact, it feels as though this is probably the most interesting-but-difficult way to think about these problems. We’re quickly going to launch ourselves into the parts of Number Space where things need to be theoretical to be tractable. It’s hard for me to, for example, calculate the whole collection of subset-wise sums and products for a 1000-element set, for example.

Go ahead, work out how many subsets of a 100-element set there are. Got it? Now you see that there are \(2^{100} - 1\) of those, at least if you avoid including an empty set. So by all means start your laptop working out all the sums and products over all of those.

I’m not even joking. While you’re waiting, I suspect something else will present itself. If not to you, then to your descendants.

For instance, given a set \(N=\{2,4,6, 20, 40, 60\}\), and pattern \(p=7\), what do you think this “super-score” of sums and products over all subsets of \(N\) will be? Will it be large? Small? Compared to what?

And yet for \(N=\{1,2,3,10,20,30\}\) and \(p=10\) I think it might be a higher score. I could be wrong. Something interesting in that.

Now: How far can you push it?

Why?

These are all perfectly reasonable mathematical explorations, but as I mentioned above, my interest is even more esoteric. I’m gathering examples of problems for which genetic programming might turn out to be useful. There are simple abstract data structures here, and there are well-known operators and groups involved in the Number Theory and algorithms people might bring to bear. So the big question for me is: Given an unsolved problem that’s this easy to specify, and a toolkit that includes all the parts and patterns that real hyoo-man mathematicians might invoke, what can GP come up with?

Can these be solved with arithmetic and a little conditional logic? With set theory? Compass-and-straightedge constructions? Gears and cogs? Are there approximations along the way, and are there insights to be gleaned from seeing the places where automated search finds a foothold?

Yes, of course. For that, though, we’ll all have to wait a little while and see.

The Ruby code I used to poke around

def blocked?(number,pat_string)
  number.to_s.include?(pat_string)
end

def remove_pattern(set,p)
  pat = p.to_s
  return set.reject {|n| blocked?(n,pat)}
end

def count_pattern(set,p)
  pat = p.to_s
  return set.find_all {|n| blocked?(n,pat)}.count
end

def pairwise_sums(set)
  return set.combination(2).collect {|p| p[0]+p[1]}.uniq
end

def pairwise_products(set)
  return set.combination(2).collect {|p| p[0]*p[1]}.uniq
end

def random_subset(set,size)
  return set.sample(size)
end

range = 1000
s = (1..range)
p = 11

1000.times do |i|
  size = 10
  subset = s.to_a.sample(size)
  bad = count_pattern(subset,p)
  bad_sum = count_pattern(pairwise_sums(subset),p)
  bad_prod = count_pattern(pairwise_products(subset),p)
  puts "#{i}, #{subset.sort}, #{bad}, #{bad_sum}, #{bad_prod}, #{bad_sum+bad_prod}"
end

You start to see the problem, I suppose…. ↩