Structure From Motion

For our Capita Selecta project – a freeform 2-semester project with prof. P. Dutré and prof. E.Duval – we decided to build something around Structure From Motion. This involves starting from photographs of an object/environment and ending up with a fully textured 3d mesh – a nontrivial task. Tech babble and more results behind the cut, cool videos here.

We (that’s me and Steven Van Acker, also a CS student) based our research on the paper Reconstructing Building Interiors from Images. This features the following pipeline to build a working solution:

The Structure From Motion block is implemented using Noah Snavely’s Bundler software. This is the research software which is at the foundation of Microsoft’s PhotoTour technology too. It combines (bundles) a few necessary steps in the process:

  • Estimate camera parameters (using EXIF data)
  • Radial Undistort of Images
  • Finding keypoints on every image (using SIFT)
  • Pairwise matching of these keypoins
  • Outputting these results to a file

As said, this is research software, so some tweaking and scripting was required to get it working smoothly.

The next step (which has the camera parameters + source images) as input is Point-Based Multiview Stereo. This step, shorthand PMVS, creates a 3d point cloud. It exploits the knowledge of camera position to create depth, much like the human eye does.  For the final step (Manhattan-World Stereo), no research software was available. Experiments with other ways of generating a 3d mesh from a set of normal-added point clouds (Poisson Surface Reconstruction) have not yet delivered any good results. (Got good pointers? Contact me!)

All of this comes at a price: matching keypoints and expanding point cloud patches is a CPU and memory-greedy business. We’re looking into parallelizing part of the process, so we can run it on multiple machines, and fully exploit computing power. Compiling all the tools was a small hell, so in order to have them run independently on machines in the PC rooms at the university we’d have to compile them staticly (permissions on those machines are limited, so I doubt we can install all required libraries there).

To visualise the point clouds, Scanalyze from Stanford University was suggested on several websites. Pardon my language, but this is a bitch to compile on Linux. In the end, we winded up coding our own viewer, based on the Trimesh2 library mesh-viewer, with which I have experience in my thesis project.

Enough with the tech-babble, time for some results!

First we tried some basic objects, like this Harry Potter Book:

Resulting in (enlarged point to get a good color perception)

Bar the noise on the book edges, this is a good result.

Then this Coca-Cola pen holder:

Resulting in

The pipeline interprets black as see-through, and I don’t think we took enough pictures for this one (angle difference is too big). This explains this rather crummy result, although the Tweety figure is modeled nicely, as are the rings.

Time to step up the plate. Let’s try it with the front of our Computer Sciences building, shall we?

After 1 hour and 45 minute of agonizing rendering (click for bigger)

Apart from the noise … not too bad, aye?


Stepping up the plate involved filming my head from several directions, then slicing this movie into seperate frames and running these through the pipeline. Since 400+ frames stressed out my cores and ran out of memory, we only reconstructed the following using 1/5 of all frames available, picked at a fixed interval.

and the result

Resulting in this:

Leave a Reply