Permutation Puzzle: “send+more=money”

John M. Dlugosz

4.77/5 (4 votes)

May 18, 2018

CPOL

7 min read

6994

“Know your libraries!” You may already have the code you need.

Permutation

I saw this puzzle on Bartosz Milewski’s blog, with an entry on using monads in C++. I’d like to hold it up as an example of a completely different lesson to C++ programmers: A central idea I want to teach is know your libraries.

I recall it was in the early ’90s, when the STL for C++ was submitted as a proposal to include in the upcoming standard. I noticed there were algorithms called next_permutation and prev_permutation, and wondered how they work—how do you order them and re-arrange your collection to the next such, without keeping an auxiliary state? Then I wondered what I would ever use such a thing for. Well, nearly 25 years later, I found out!

You should look through the list of algorithms every once in a while just to see what’s there. Otherwise, you might only know about the common ones that you use. Consider the analogy with a tool (say, a special bent screwdriver only used to attach doorknobs) that you know is in the very back of the drawer, though you may need to rummage around to locate it. Remembering you have that tool makes for a quick job. Having to rig up something from common parts instead (to continue the analogy, use locking pliers to grab a short screwdriver bit from the side) is not as good, and certainly more work.

So… 8 nested for loops followed by 7 if statements containing 28 conditions? Get real! If you have a line that reads });});});});});});});}); then the code review will be a sight indeed.

Solution in C++ w/standard Library

Here’s the meat of my solution:

using cellT = int8_t;

cellT A[10] {0,1,2,3,4,5,6,7,8,9};

void solve1()
{
    do {
	++iteration_count;
	auto [ig1,ig2, s,e,n,d,m,o,r,y] {A};
	int send= decode({s,e,n,d});
	int more= decode ({m,o,r,e});
	int money= decode ({m,o,n,e,y});
	if(send+more==money) {
	    ++solution_count;
	    solutions.push_back({send,more,money});
	}
    } while (std::next_permutation(std::begin(A), std::end(A)));
}

You’ll notice that besides the uniform initialization syntax introduced in C++11, this uses something you might not have seen before (if you’re reading this in 2017). Hot off the press in C++17 is structured bindings.

	auto [ig1,ig2, s,e,n,d,m,o,r,y] {A};

This defines 10 variables and assigns all the elements of the array to them. The first two are ignored so I used scratch names, and the others are simply the names of the letters in the puzzle.

One thing I have to point out from Milewski’s listing is the call to turn a word into a numeric value. He writes:

int send = asNumber(vector{s, e, n, d});

This creates a std::vector on every use. Let me elaborate: it allocates a block of memory from the heap (vectors can’t use a small-vector optimization). Then after the call returns, it is deallocated. Then the next two lines do the same thing. And that happens on every iteration.

The constructor for std::vector takes this handy literal list. Now in C++, these containers are not special language features, but are ordinary libraries. It should be clear that anything they do — cool syntax or unusual ability — you can do yourself on your own code! My version of the same construct does not create a vector, doesn’t require more words to make the call, and most importantly does not have any overhead.

int send = decode({s,e,n,d});

And here is the function that takes such an initializer list:

int decode (std::initializer_list<cellT> lst)
{
	int total= 0;
	for (auto digit : lst)
		total= total*10+digit;
	return total;
}

The solving function doesn’t print the results because I want to time just the solving logic. So the solutions are pushed onto a vector, and the caller prints them after stopping the clock. In a real program, this might be an object (not globals) and the results available in the object afterwards, or as the return value from the solving function. In another post, I’ll make it lazy.

Make It Faster

This simple function found 50 solutions, one of which doesn’t have a leading zero. It ran in 39.6 milliseconds, trying all 3,628,800 permutations. That’s 92 million iterations per second.

The value of 10 factorial is an interesting number here. Donald Knuth, in the The Art of Computer Programming, wrote that this is about the size that separates things that may be done by brute force from things that are impractical to simply try all possibilities. Volume 3 was published in 1973. I hazard to guess that computers now are about 2³⁰ (or about a billion) times the power of the machines that were around when he wrote that. A billion times 30 milliseconds is 460 years. So, I revise that to more like ten million times the speed, if I postulate that he could have run this program to completion in a matter of days.

Anyway, to make it faster, I need to skip over permutations that are “the same” as one I already rejected. The order of the two ignored digits don’t change the solution, so if I decide that one order is canonical and when the other is detected I skip over the whole lot, that would cut the number of iterations in half.

So how do you skip over states in the next_permutation algorithm? I looked it up — a StackOverflow post described it well and also pointed to a Wikipedia page on it. The states are generated in lexicographical order, so when a given digit changes everything to the right is in reverse sorted order, and it “rolls over” to advance that digit by one and everything to the right is now in forward sorted order — the lowest value of that substring.

So, when I identify a digit value that I know will not be a solution no matter what the other digits are, I can skip to when that digit is advanced again by reversing everything to the right of it.

void skipfrom (int pos)
{
    std::reverse (std::begin(A)+pos, std::end(A));
}

    ⋮
    do {
	++iteration_count;
	auto [ig1,ig2, s,m,e,d,y,n,o,r] {A};
	if(ig1 > ig2) {
	    skipfrom(2);
	    continue;
	}
    ⋮

Indeed, it still found 50 solutions but the iteration_count showed half the number: only 1.8 million times through the loop. However, the time only went down to 26ms — about two thirds the time, not half.

We also don’t want solutions with a leading zero, so filter those out too. Notice in the listing above I changed the order of declaring the digit variables. It doesn’t matter to the solution algorithm, but putting these farther to the left means I can skip over more.

        ⋮
	if(s==0) {
	    skipfrom(3);
	    continue;
	    }
	if(m==0) {
	    skipfrom(4);
	    continue;
	}
        ⋮

That didn’t save much though: 1.45 million iterations in 22 ms.

Another big constraint can be found on the right side of the puzzle. I would think that parsing the arrangement of digits into ints would be slow, seeing as that involves multiplying by 10 and adding each digit. Looking at the rightmost (units) digit only, the puzzle has d+e=y with a possible carry. Test that before parsing the int values, and furthermore skip ahead until one of those three digits changes again. To that end, the declaration order has d, e, and y next after the previous items we wanted on the left. This leaves only 3 letters to the right, so each time the units digits don’t work, it can skip 6 iterations.

I added a counter to that, and see that it happened 219,360 times. The loop only executed 355,053 times now, taking a mere 4.7 ms.

Faster yet?!

Notice in the listing that I declared a type CellT for the array of digits and anything that holds a digit. My thought was that keeping the array small would save in parameter passing to decode. Keeping the size of the data small is generally good for memory caching, but it probably doesn’t matter here because I don’t have enough data to exceed the L1 cache anyway.

But, I can change the type in one place in the code and it affects everything in the program. I changed it from int8_t to plain int (int is 32 bits here), and… the time was down to 3.4 ms!

64-bit was the same. 16-bit values was the slowest at 5.2 ms. So, the CPU is inefficient at loading 8-bit values and worse with 16. I don’t know if the actual load is slower, or the compiler generated somewhat different code rather than just using MOVXS instead of MOV. This was a difference of a quarter of the execution time over the entire algorithm, and loading values is only a part of what it does, with values kept in registers once loaded.