In our previous article [1], we discussed how to integrate Lucene with HBase for improved scalability and availability. In this article I will show how to extend this Implementation with the spatial support.
Lucene spatial contribution package [2, 3, 4, 5] provides powerful support for spatial search, but is limited to finding the closest point. In reality spatial search often has significantly more requirements, for example, which points belong to a given shape (circle, bounding box, polygon), which shapes intersect with a given shape and so on. Solution, presented in this article allows solving all of the above problems.
Spatial Search Approach
The proposed implementation is based on a two level search - first level search is based on Cartesian Grid [3] search and the second level implements shape specific spatial calculations.
For Cartesian Grid implementation I tile[1] the whole world at some zoom level[2]. In this case geometric shape is defined by the set of IDs of tiles that contain the complete shape. The calculation of these tiles IDs is trivial for any shape type and can be implemented in a several lines of code[3]. In this case both the document, referring to a certain shape and the search shape can be represented by the list of tile IDs. This means that the first level search implementation is just finding intersection between tile IDs of the search shape and the document’s shape.[4] This can be implemented as a “standard” text search operation, which Lucene is very good at. Once the initial selection is done a required geometrical arithmetic can be used to get more precise results by filtering of the results of the first level search. Technically these calculations can be as complex as they need to be to improve the precision of the search. Because these calculations are technically not part of the base implementation, every interested party can develop their own implementation that is most suited for their needs.
Figure 1 - Two level search
Let's consider a concrete example (Figure 1). Here I am showing only a limited amount of tiles - 1 through 12 and 4 documents "a" through "e". Based on its location and shape, each document can be described as a set of tile IDs, for example, "b" is contained in tiles 11 and 12. Now let's assume that we are searching for every document intersecting with a bounding box (wide line). As we can see at Figure 1 a search bounding box can be represented as a set of tiles with IDs 1 - 12. All shapes "a" - "e" will be found as a result of the first level search. The second level search will throw away shapes "a" and "e", that do not intersect with result bounding box and the overall search will return back shapes "b", "c" and "d".
Incorporating support for Spatial Search in Lucene
Incorporating described above algorithm into Lucene is fairly straightforward. In our implementation the job of Lucene engine is to do a first level search - tile matching. The second level search can be done through external filtering, that do not necessarily need to be incorporated into Lucene.
To support both, the Lucene document needs to be extended with additional fields:
- Shape field describes the shape[5] associated with the document and can be used by the filter to obtain the shape that the document is associated with to perform explicit filtering.
- Tile Id field(s) contain field(s) describing tile(s) that the shape is contained in for different zoom levels.
Although theoretically these additional fields can be generated by client directly, I have decided to introduce a special class – Spatial Document class, extending Lucene Document class by adding shape information. This allows us to hide these additional fields from the user and generate them automatically for any Spatial Document instance.
I decided to use similar approach for searcher as well. Instead attempting to extend Lucene’s Query class to support our geo spatial constructs, I have implemented a SpatialReaderFilter class that can be set on the Reader[6] class prior to the search.
This approach allows for building a generic first level geo spatial search extension for Lucene, which can be used as a foundation for solving any complex geo spatial problems.
Optimizing Geo Spatial information support
Putting geo spatial information directly into the document fields is the most simplistic approach for storing the geo spatial information in Lucene index. Its drawback that it either leads to the “explosion” of the index size[7] and necessity to do a lot of data “joins” (Lucene “and” query) or requires a “brute force” approach - traversing existing term/value index and using only items passing filter criteria (belonging to a set of tiles for a given zoom level). Both will work fine for simple cases, when the average size of the Lucene index is relatively small (approximately thousands and tens of thousands of items). In the more complex cases, when the size of the index grows, both approaches will lead to significant increase of the search time. In this latter case, a better approach to the storing document’s information for a given field/term value as multilevel hash tables (Figure 2).
Figure 2 - Storing of geo spatial indexes for a given field/term
The first level of hash is by zoom level. Level 1 - the whole world is reserved for the requests without spatial information and contains documents information for all documents containing given value for a given term. All other levels contain a hash map containing documents information for every document containing given value for a given term and located in a given tile id. Because the amount of documents contained in a given tile is always significantly smaller compared to the amount of documents for a whole world the search based on such index organization is going to be significantly faster searches containing geo spatial component, while preserving roughly the same search time for searches without them.
Although such index organization can increase required memory footprint, because the overall implementation is based on the expiring lazy cache, this increase will be relatively small.
It is fairly straightforward to modify in memory cache [1] to leverage multilevel hash tables (Figure 2) for index implementation.
Optimizing HBase tables
Although implementation of multilevel hash tables (Figure 2) can be supported in original HBase table design [1] by introducing additional keys structure, for our implementation providing geo spatial support I decided to modify it in order to better cope with the significantly increased amount of rows and further improve search scalability. Instead of using a single HBase index table, in this case I am using N+1 index tables (Figure 3) - one for global data and one for every zoom level.
Figure 3 - HBase Index tables
The whole world table in (Figure 3) is exactly the same as index table described in [1], while tables for every zoom level have the same content but a different key structure. For them a key is concatenation of not two, but three values - tile ID, field name and term value.
Besides the fact that splitting tables by zoom level allows to make each table smaller, which creates more manageable tables, it can add to parallelization of searches - requests for different zoom levels will read from different tables. Presented above keys definition allows for natural support for scanning on the field/term level for a given zoom level/tile ID required by Lucene implementation.
Implementing Second level search
Once the first level search is complete and Lucene's TopDocs class, containing search results is returned, a second level filtering of the results can be executed. A simple class (Listing 1) shows how this execution can be done.
package com.navteq.lucene.spatial.geometry.filters;import java.io.IOException;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;import com.navteq.lucene.hbase.indexing.HBaseIndexReader;
import com.navteq.lucene.spatial.GeometricShapeHelper;
import com.navteq.lucene.spatial.geometry.GeometricShape;public class SpatialFilterProcessor {
private AbstractSpatialFilter _filter = null;
public SpatialFilterProcessor(AbstractSpatialFilter filter){
_filter = filter;
}public TopDocs FilterDocuments(IndexReader reader, TopDocs set)
throws IOException {ScoreDoc[] original = set.scoreDocs;
List<Integer> ids = new ArrayList<Integer>(original.length);
List<ScoreDoc> filtered = new LinkedList<ScoreDoc>();
float maxScore = -1;
for (ScoreDoc d : original)
ids.add(d.doc);
List<Document> documents =
((HBaseIndexReader)reader).documents(ids);
int i = 0;
for (Document doc : documents) {
if (doc != null) {
GeometricShape shape =
GeometricShapeHelper.getGeometry(doc);
if (shape != null){
if (_filter.accept(shape)) {
filtered.add(original[i]);
if (original[i].score > maxScore)
maxScore = original[i].score;
}
}
}
i++;
}
ScoreDoc[] resSet = (ScoreDoc[])filtered.toArray(
new ScoreDoc[filtered.size()]);
int hits = set.totalHits - (original.length - resSet.length);
return new TopDocs(hits, resSet, maxScore);
}
}
Listing 1 - Invoking second level filtering
This class leverages optimization in our Lucene implementation, described in [1], allowing to read multiple documents based on the set if document ID. It relies on the fact that multi get operations in HBase are faster that the sequence of individual gets.
It also leverages abstract spatial filter interface (Listing 2) to check whether to accept given document, based on a shape associated with it.
package com.navteq.lucene.spatial.geometry.filters;import com.navteq.lucene.spatial.geometry.GeometricShape;
public interface AbstractSpatialFilter {
public abstract boolean accept(GeometricShape shape);
}
Listing 2 - Abstract spatial filter class
Although I have provided some of the basic implementations of the abstract spatial filter interface, for example, point in bounding box, point inside the circle, bounding box inside bounding box, etc., our implementation is geared towards user-driven extensibility. Each user is free to implemented spatial filters, that best suits his needs. This approach also allows to build compound spatial filters[8] out of the simple one, for example, point or bounding box inside a bounding box, etc.
Further improvements
Secondary level filtering integration described here (Listing 1) allows removing some of the search results, but does not modify the weights of the found documents. In reality, in the case of geo spatial search, the result weight is often driven by proximity/distance from the center point. Presented here implementation (Listing 1, Listing 2) can be easily modified to support this requirement.
Conclusion
The two level approach to the geo spatial search, described in this article allows to create a very powerful and extendable implementations Lucene-based geo spatial search engine, that can be used to solve many real world problems.
About the Authors
Boris Lublinsky is principal architect at Nokia, where he is working on defining architecture vision for large Hadoop – based data management and processing and SOA and implementing various Nokia projects. He is also an SOA editor for InfoQ and a participant of SOA RA working group in OASIS. Boris is an author and frequent speaker, his most recent book is "Applied SOA".
References
1. Boris Lublinsky, Michael Segel. Integrating Lucene with HBase.
2. Michael McCandless, Erik Hatcher, Otis Gospodnetic. Lucene in Action, Second Edition.
3. Local Lucene Geographical Search.
4. Grant Ingersoll. Location-aware search with Apache Lucene and Solr.
5. Mike Haller. Spatial search with Lucene.
[1] Tiling is a partitioning of a space (the whole world in our case) into a finite number of distinct shapes (we will use equal sized bounding boxes - tiles).
[2] Zoom level defines the size of tiles, typically for the zoom level n the amount of tiles for the world is 2^n.
[3] In reality it can be a little bit more complex, for example for polygons with holes, but for all practical purposes the straightforward enclosure in the bounding box and then mapping of the bounding box in tiles is good enough for implementation of first level search
[4] This is an oversimplification to explain a principle. The real implementation contains IDs for the multiple zoom levels and selects a most appropriate zoom level based on the size and shape of the search criteria.
[5] Current implementation is based on the serializable shape super class and several specific shape implementations, including point, bounding box, circle and polygon.
[6] A reader reset method (very lightweight), resets this filter and allows to set the new one.
[7] A new term/value needs to be created for every tile ID/zoom level combination
[8] Compare to Lucene logical filters.