Apple invented a neural network configuration that can segment objects in point clouds obtained with a LIDAR sensor.
Recently Apple joined the field of autonomous vehicles. Although we don't know a lot about their car, a lot of companies that are building autonomous vehicles use a so-called LIDAR to detect obstacles around them. The LIDAR emits a light-pulse and measures the time it takes for this pulse to return to its sensor. This gives a distance from the vehicle to nearby obstacles. Rotating around the sensor can give a distance to obstacles all around the sensor.
Distances obtained with a LIDAR are stored in a so-called point cloud. After visualizing this collection of points, humans are good at detecting several types of objects in these point clouds such as humans, cars, and bicycles. Unfortunately, this remains a difficult task for computers. Try for yourself to indicate where the human and cars are in the picture above.
Traditional methods rely on handcrafted features to make sense of this data. Examples of hand-crafted features are methods that segment the cloud into sub-clouds, or methods that separate the point cloud into surfaces. Another way to make sense of the LIDAR data is selecting a viewpoint and putting the image through existing computer-vision algorithms. The downside of these approaches is that designing these features is difficult, and it's difficult to design features which generalize well to all situations. Apple now created an end-to-end neural network to solve this problem. This approach does not rely on any hand-crafted features or other machine learning algorithms than neural networks.
The first part of their approach is the so-called "Feature learning network". Apple divides the world into so-called voxels (3D pixels). When detecting cars they make each voxel have a size of 2 meters high and 2.4 meters wide (so a car fits nicely into one voxel). In every voxel, they randomly select a subset of points (some voxels contain a lot of points, some only a few, this way each voxel contains the same amount of input for the neural network). They feed this subset of points through a neural network which creates a representation in 128-dimensional space.
Doing this for each point in the world gives you a data structure you can feed through the same network architecture that you see in neural network approaches to computer vision. With several convolutional layers, the neural network projects its output on a probability map and a regression map (see image below). The probability map indicates for each voxel in the world if it contains an object or not. The regression map indicates for each voxel where in this voxel the object is.
Apple tested their approach on the KITTI Vision Benchmark. They compared their approach with other approaches that use similar data, but that DO use handcrafted features. It turns out that their approach has a better performance than all existing approaches, including approaches that DO use handcrafted features.
With this research, Apple shows the approach they are taking in their autonomous vehicle project. This year it was released that Apple is using six out of a total of twelve LIDAR sensors on the top of their car. They released their results in a paper that can be downloaded from the ArXiv.