Appropriate indexes are the key to getting the most out of pretty much any database system. I’ve recently taken a deeper look at how SpatiaLite indexes work, and I thought there might be some interest in this.
Why index?
The first thing to note is that the purpose of the index is to avoid scanning large tables (or worse, huge joins of large tables). Why? Because full table scans are slow. Of course, full table scans of really small tables probably won’t be noticed. Conversely, if the results are going to be pretty much everything in the table, then using an index to find that isn’t going to help.
So if we have a table of places in various parts of the world, and an application that displays that data on a map, then when the user zooms to world scale, you might be better off turning off the display of your data until the user zooms in to something more sensible. Of course, in this case, the drawing all those things to the screen is probably going to take much more time than a full table scan, but you hopefully get the idea.
So we could index on some property (like the name, or some feature class), but SpatiaLite does its magic in the field of coordinates (typically northings and eastings, or latitude and longitude).
Importing Data
So lets look at an example. We’ll need some data, and the wonderful people at /http://www.geonames.org/ have made some available. I downloaded the “allcountries.zip” data set from http://download.geonames.org/export/dump/. It is about 200MB zipped.
Unzip allcountries.zip, and you will now have allcountries.txt, which is about 1GB of comma separated values.
Now import it into SpatiaLite, using something like:
.timer ON BEGIN; -- Open the original tab separate values geonames file CREATE VIRTUAL TABLE geonamesvt USING VirtualText("allCountries.txt", "UTF-8", 0); -- Create the real table CREATE TABLE geonames (geonameid INT, name TEXT, asciiname TEXT, alternatenames TEXT, latitude DOUBLE, longitude DOUBLE, featureclass TEXT, featurecode TEXT, countrycode TEXT, cc2 TEXT, admin1code TEXT, admin2code TEXT, admin3code TEXT, admin4code TEXT, population LONG, elevation INT, gtopo30alt INT, timezone TEXT, modificationdate TEXT); -- import the contents of the temporary geonames table into the real table INSERT INTO geonames (geonameid, name, asciiname, alternatenames, latitude, longitude, featureclass, featurecode, countrycode, cc2, admin1code, admin2code, admin3code, admin4code, population, elevation, gtopo30alt, timezone, modificationdate) SELECT COL001,COL002,COL003,COL004,COL005,COL006,COL007,COL008,COL009,COL010,COL011,COL012,COL013,COL014,COL015,COL016,COL017,COL018,COL019 FROM geonamesvt; SELECT AddGeometryColumn("geonames", "Geometry", 4326, "POINT", "XY"); SELECT CreateSpatialIndex("geonames", "Geometry"); UPDATE geonames SET Geometry=MakePoint(longitude,latitude,4326); DROP TABLE geonamesvt; -- COMMIT that ANALYZE geonames; COMMIT; VACUUM;
If you save that SQL script to a file (say “importgeonames.sql”), you can use it as an input to spatialite on the command line:
spatialite geonames.sqlite < importgeonames.sql
.
This is going to take a long time – about 25 minutes on my laptop. The output will probably be about 2GB.
Full table scan
Now we can open that using spatialite:
spatialite geonames.sqlite
.
If you turn on timers, you can see how long the queries are taking. Lets do that, and then run a full table scan:
spatialite> .timer on spatialite> select COUNT(name) from geonames; 8178593 CPU Time: user 2.088131 sys 0.900056
You might want to do that a few times to see if its fairly consistent.
So know we know we have about 8 million entries in the table, and that a full table scan takes about 2 seconds. So if we see any query taking that kind of time, and not returning millions of results, we need to find out why.
Since this a bit geographical, I thought I’d look for places around Greenwich (of Observatory fame, but there is a lot of cool stuff out there if you ever get a chance to visit). Just to make it interesting, lets find places around there that start with “Green”
spatialite> SELECT geonames.name, geonames.latitude, geonames.longitude FROM geonames WHERE geonames.latitude > 51 AND geonames.latitude < 52 AND geonames.longitude > -0.5 and geonames.longitude < 0.5 AND geonames.name LIKE "Green%"; Greenwich|51.47785|-0.01176 Greensted|51.70286|0.21586 Greenhithe|51.45026|0.28539 Greenford|51.52866|-0.35508 Greenwich|51.45|0.05 Green Parks House Psychiatric Unit|51.36866|0.05864 Greenhill|51.58342|-0.3386 Green Man Roundabout|51.57223|0.01635 Greenwich DLR Station|51.4781|-0.014 Greenhithe Railway Station|51.4506|0.27923 Greenwich Railway Station|51.4781|-0.014 Greenwich DLR Station|51.47805|-0.01574 Green End Farm|51.96373|-0.37371 Greenford Station|51.5426|-0.3463 Greenford Underground Station|51.54241|-0.34598 Green Park Underground Station|51.50667|-0.1427 Green Park|51.50413|-0.14415 Greenwich Park|51.47663|0.00225 CPU Time: user 2.092131 sys 0.984061
So that query says “give me anything located in the area bounded by 0.5W and 0.5E and 51N and 52N degrees that has a name starting with Green”.
Those results look reasonable, but clearly we’re doing a full table scan to find them. In this query, I also used the latitude and longitude column values rather than SpatiaLite’s Geometry. We could try it with Geometry:
SELECT geonames.name, geonames.latitude, geonames.longitude FROM geonames WHERE ST_Y(geonames.Geometry) > 51 AND ST_Y(geonames.Geometry) < 52 AND ST_X(geonames.Geometry) > -0.5 and ST_X(geonames.Geometry) < 0.5 AND geonames.name LIKE "Green%"; Greenwich|51.47785|-0.01176 Greensted|51.70286|0.21586 Greenhithe|51.45026|0.28539 Greenford|51.52866|-0.35508 Greenwich|51.45|0.05 Green Parks House Psychiatric Unit|51.36866|0.05864 Greenhill|51.58342|-0.3386 Green Man Roundabout|51.57223|0.01635 Greenwich DLR Station|51.4781|-0.014 Greenhithe Railway Station|51.4506|0.27923 Greenwich Railway Station|51.4781|-0.014 Greenwich DLR Station|51.47805|-0.01574 Green End Farm|51.96373|-0.37371 Greenford Station|51.5426|-0.3463 Greenford Underground Station|51.54241|-0.34598 Green Park Underground Station|51.50667|-0.1427 Green Park|51.50413|-0.14415 Greenwich Park|51.47663|0.00225 CPU Time: user 6.408401 sys 1.108069
Thats worse (and I’m a bit surprised by how much worse). I also tried some MbrContains operations, but that was still slower – perhaps a topic for further investigation and blogging.
Using an Index
In any case, we spent all that effort up front to create an index (with the SELECT CreateSpatialIndex("geonames", "Geometry");
bit of the SQL script), so lets try using it.
select geonames.name, geonames.latitude, geonames.longitude from geonames, idx_geonames_Geometry WHERE idx_geonames_Geometry.ymin > 51 AND idx_geonames_Geometry.ymax < 52.0 AND idx_geonames_Geometry.xmin > -0.5 AND idx_geonames_Geometry.xmax < 0.5 AND idx_geonames_Geometry.pkid == geonames.rowid AND geonames.name LIKE "Green%"; Greenhithe|51.45026|0.28539 Greenhithe Railway Station|51.4506|0.27923 Greenwich|51.47785|-0.01176 Greenwich DLR Station|51.4781|-0.014 Greenwich Railway Station|51.4781|-0.014 Greenwich DLR Station|51.47805|-0.01574 Green Parks House Psychiatric Unit|51.36866|0.05864 Greenwich|51.45|0.05 Greenwich Park|51.47663|0.00225 Greenford|51.52866|-0.35508 Green Park Underground Station|51.50667|-0.1427 Green Park|51.50413|-0.14415 Greensted|51.70286|0.21586 Green End Farm|51.96373|-0.37371 Greenford Station|51.5426|-0.3463 Greenford Underground Station|51.54241|-0.34598 Greenhill|51.58342|-0.3386 Green Man Roundabout|51.57223|0.01635 CPU Time: user 0.016001 sys 0.000000
OK, so that is better – a couple of orders of magnitude faster.
How does the index work?
The magic to this is the use of idx_geonames_Geometry
table to find the row IDs that match the rowids in the geonames table. The spatial query in the idx_geonames_Geometry
table is optimised by SQLite using the R*Tree algorithm.
The idx_geonames_Geometry
table (or any other two dimensional R*Tree index) looks like:
spatialite> .schema idx_geonames_Geometry CREATE VIRTUAL TABLE "idx_geonames_Geometry" USING rtree( pkid, xmin, xmax, ymin, ymax);
pkid
is the primary key ID value – an Integer. The other four columns define the bounding box, and are floating point values.
That VIRTUAL TABLE
bit in the schema means that the data isn’t really in that table, but will be constructed by SQLite when needed. The real data is contained in what SQLite calls “shadow tables” – they have names that start with the name of the virtual table, and end in _node
, _rowid
and _parent
.
R*Tree is optimised for doing range queries (i.e. where some database column value is in some range of values). So it can find the items (well, the row id of the items) in the table we’re really interested in, based on a bounding box [xmin...xmax, ymin...ymax]
that comes from the index table.
Things to watch out for
The R*Tree doesn’t store exact minimum bounding rectangles (MBRs), but instead uses an approximation based on 32-bit floating point numbers, although the error is tiny (of the order 0.000012% according to the SQLite documentation). New versions of SQLite always round in the direction that makes sure that the location of items are included in the bounding box (i.e. xmin
and ymin
get rounded down, xmax
and ymax
get rounded up). That does what you want for the common case (geometry in an region of interest). It may not be what you want if you are doing points “not in” (e.g. excluding an inner region). It also may not be what you want if you are trying for overlap (i.e. where equal to matters). See the SQLite documentation for R*Tree if that matters to you.
The x coordinate is longitude (or easting). The y coordinate is latitude (or northing).
When selecting the bounding rectangle based on maps on the screen, make sure you get the largest rectangle. For example, for some projections, you can just use any two opposing corners. That may not work everywhere though, since as you approach the poles, the longitude range at the top of the screen isn’t the same as the longitude range at the bottom of the screen. Also, if the map can be rotated, the screen to geo conversion may not produce the greatest latitude number at the top of the screen.
Summary
So by eliminating the number of entries that we have to do something difficult on (and the SQLite LIKE isn’t that computationally intensive, compared to some of the more esoteric spatial operations that SpatiaLite can do) we can get to the “feasible set” very quickly, and put the CPU cycles to work most effectively.