For my graphics research, I ended up having to encode/decode 3D coordinates into Morton order / Z-order a lot. For a detailed explanation on how I use them and what they are, check this blog post, which sparked an interesting discussion on the implementation details, and turned out to be a popular entry point for many people working with Morton order coordinates.
Since then, I bundled the methods I use for Morton encoding/decoding in a library called (oh, the originality) libmorton on Github. I’ve followed up on some of the advice and ideas from the original thread in the previous blog post. There have been a lot of bugfixes and there is now a method for efficiently decoding. Here’s a reddit thread about it, as well.
- Header-only library: just include morton.h and go
- 2D and 3D, 32-bit and 64-bit Morton order decoding/encoding
- Available methods:
- LUT (shifted and not-preshifted, early termination optional). This is the fastest and default method.
- Magicbits method
- For-loop based method
- GPL v3, and send me an e-mail if you use this in a project
Also, I’m working on seperate benchmarks and tests, in which I test correctness and performance for encoding / decoding an increasing amount of Morton codes, both in linear ( (0,0,0), (1,0,0), (2,0,0) ) and a random way, since it was correctly noted by fenbf that testing in a linear fashion might influence the methods because some recent results (LUT table entries) might reside in CPU L1/L2 caches. These benchmarks can be found in the test folder in the repository, along with a Visual Studio 2015 solution. By default, they check the methods for correctness, and try to compute an increasing number of encoding/decoding steps.
Here are some (early) results. These were all obtained on an Intel i5-2500 CPU clocked to 3.9 Ghz. The tests were compiled using MSVC 1400, single-threaded, with all optimizations turned off, in x86 mode. An average of 10 tests was taken.
As you can see (and as expected), for encoding, the simple lookup table-based method is fastest. I use a table with pre-splits for 1 byte-sized values. Each pre-split is 32 bit, 256 in a table, each for x, y and z. That’s 32 x 256 x 3 bits = 24576 bits = 3 kb of extra table data which has to be baked in your executable. You can also generate it on the fly at runtime, if necessary.
A small price to pay for a method which at 256^3 coordinates (that’s about 16 million) is two times at fast as a magic-bits-shift method and twenty-five times as fast as the for-loop method. It is implemented using standard table lookups and bit shifts, which should be available on every target platform (like CUDA, see more later).
When it comes to the difference between linear and random encoding, we see that there could indeed be a cached-values-effect playing when working with non-linear encoding input.
This is partly explained by the extra time to fetch the random coordinates, which are generated in bulk and stored in a big array before running this test. Calling random() during the test itself results in a big overhead, and (alternatively) starting/stopping the performance timer every time to exclude that part skews the results as well. Sample-based profiling makes clear that at least 5% of total running time is spent in table lookup (jumping around in a 192 Mb table). Still, this way of testing doesn’t lend itself to a good conclusion, since there is still too much of a time difference for the LUT method. I’m working on a better way to test this.
When it comes to 3 Kb of LUT tables (like I’m using), I would expect this to fully reside in memory. For my processor, the Level 1 cache size is 4 x 32 Kb for data caches, and the Level 2 cache size is 4 x 256 Kb. I would find it very strange if all tables left the L1/L2 cache at any time, since they are constantly being used and do not exceed L1 or L2 cache size at all. Then again, I’m not well versed in how particular CPU caching algorithms work (LILO? Hit count?) and what the OS overhead is (all the windows DLL’s jumping around on there).
To be continued!