03.14.13
Posted in Maps at 12:11 am by ducky
I imagine some epidemiologist somewhere, who has statistics on the something like the measles rate by postal code, who wants to see if there is a geographic trend, like if warmer places have more measles. She has a spreadsheet with the postal codes and the number of cases in that postal code, and wants to turn that into a map where each postal code’s colour represents the number of cases per capita in that postal code.
She should not need to know what a shapefile is, should not need to know that the name of the map type she wants is “choropleth”, and should not have to dig up the population of that postal code. The boundaries of the jurisdictions she cares about (postal codes, in this case) and the population are well-understood and don’t change often; the technology to make such a map out to be invisible to her. She should be able to upload a spreadsheet and get her map.
I find it almost morally wrong that it is so hard to make a map.
Making that possible would be my dream job. It is a small enough job that I could do it all by myself, but it is a large enough job that it would effectively prevent me from doing other paying work for probably about a year, and I can’t see a way to effectively monetize it.
The challenges are not in creating a map that is displayed onscreen — that’s the easy part. To develop this service would require (in order of difficulty):
- code and resources to enable users to store their data and map configurations securely;
- code to pick out jurisdiction names and data columns from spreadsheets, and/or a good UI to walk the user through picking the columns;
- fuzzy matching code which understands that e.g. “PEI” is really “Prince Edward Island”, a province in Canada; that “St John, LA” is actually “Saint John the Baptist Parish”, a county-equivalent in Louisiana; that there are two St. Louis counties in Misouri; that Nunavut didn’t exist before 1999;
- code to allow users to share their data if they so choose;
- UI (and underlying code) to make the shared data discoverable, usable, and combinable;
- code (and perhaps UI) to keep spammers from abusing the system;
- code to generate hardcopy of a user’s map (e.g. PNG or PDF);
- code for a user account mechanism and UI for signing in
This service would give value to many people: sales managers trying to figure out how to allocate sales districts, teachers developing lesson plans about migration of ethnic minorities, public health officials trying to understand risk factors, politicians targeting niche voters, urban planning activists trying to understand land use factors, etc.
Unfortunately, for the people to whom this really matters, if they already have money, they already ponied up the money for an ESRI mapping solution. If they don’t have money, then they won’t pay for this service.
GeoCommons tries to do this. GeoCommons makes maps from users’ data, and you stores and shares users’ data, but their map making is so slow it is basically unusable, and it is not easy to combine data from multiple sources into one map.
It might be that one of the “big data” organizations, e.g. Google or Amazon, might provide this as an enticement for getting people to use their other services. Google, for example, has a limited ability to do this kind of thing with their Fusion Tables (although if you want to do jurisdictions other than countries, then you have to provide a shapefile). Amazon provides a lot of data for use with the Amazon Web Services.
However, it would be almost as difficult for Google or Amazon to monetize this service as it would for me. Google could advertise and Amazon could restrict it to users of its AWS service, but it isn’t clear to me how much money that could bring in.
If anybody does figure out a way to monetize it, or wants to take a gamble on it being possible, please hire me!
Permalink
03.12.13
Posted in Hacking, Maps at 11:54 am by ducky
In the past, when people asked me how I managed to make map tiles so quickly on my World Wide Webfoot Maps site, I just smiled and said, “Cleverness.” I have decided to publish how I optimized my map tile generation in hopes that others can use these insights to make snappier map services. I give a little background of the problem immediately below; mapping people can skip to the technical details.
Background
Choropleth maps colour jurisdictions based on some attribute of the jurisdiction, like population. They are almost always implemented by overlaying tiles (256×256 pixel PNG images) on some mapping framework (like Google Maps).
|
Map tile from a choropleth map (showing 2012 US Presidential voting results by county) |
Most web sites with choropleth maps limit the user: users can’t change the colours, and frequently are restricted to a small set of zoom levels. This is because the tiles are so slow to render that the site developers must render the tile images ahead of time and store them. My mapping framework is so fast that I do not need to pre-render all the tiles for each attribute. I can allow the users to interact with the map, going to arbitrary zoom levels and changing the colour mapping.
Similarly, when people draw points on Google Maps, 100 is considered a lot. People have gone to significant lengths to develop several different techniques for clustering markers. By contrast, my code can draw thousands very quickly.
There are 32,038 ZIP codes in my database, and my framework can show a point for each with ease. For example, these tiles were generated at the time this web page loaded them.
|
32,038 zip codes at zoom level 0 (entire world) |
|
|
Zip codes of the Southeast US at zoom level 4 |
(If the images appeared too fast for you to notice, you can watch the generation here and here. If you get excited, you can change size or colour in the URL to convince yourself that the maps framework renders the tile on the fly.)
Technical Details
The quick summary of what I did to optimize the speed of the map tile generation was to pre-calculate the pixel coordinates, pre-render the geometry and add the colours later, and optimize the database. In more detail:
Note that I do NOT use parallelization or fancy hardware. I don’t do this in the cloud with seventy gajillion servers. When I first wrote my code, I was using a shared server on Dreamhost, with a 32-bit processor and operating system. Dreamhost has since upgraded to 64-bits, but I am still using a shared server.
Calculating pixel locations is expensive and frequent
For most mapping applications, buried in the midst of the most commonly-used loop to render points is a very expensive operation: translating from latitude/longitude to pixel coordinates, which almost always means translating to Mercator projection.
While converting from longitude to the x-coordinate in Mercator is computationally inexpensive, to convert from latitude to y-coordinate using the Mercator projection is quite expensive, especially for something which gets executed over and over and over again.
A spherical mercator translation (which is actually much simpler than the actual projection which Google uses) uses one logarithmic function, one trigonometric function, two multiplications, one addition, and some constants which will probably get optimized away by the compiler:
function lat2y(a) { return 180/Math.PI * Math.log(Math.tan(Math.PI/4+a*(Math.PI/180)/2)); }
(From the Open Street Maps wiki page on the Mercator projection)
Using Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs by Agner Fog, I calculated that a tangent can take between 11 and 190 cycles, and a logarithm can take between 10 and 175 cycles on post-Pentium processors. Adds and multiplies are one cycle each, so converting from latitude to y will take between 24 and 368 cycles (not counting latency). The average of those is almost 200 cycles.
And note that you have to do this every single time you do something with a point. Every. Single. Time.
If you use elliptical Mercator instead of spherical Mercator, it is much worse.
Memory is cheap
I avoid this cost by pre-calculating all of the points’ locations in what I call the Vast Coordinate System (VCS for short). The VCS is essentially a pixel space at Google zoom level 23. (The diameter of the Earth is 12,756,200 meters; at zoom level 23, there are 2^23 tiles, and each tile has 256 or 2^8 pixels, so there are 2^31 pixels around the equator. This means that the pixel resolution of this coordinate system is approximately .6cm at the equator, which should be adequate for most mapping applications.)
Because the common mapping frameworks work in powers of two, to get the pixel coordinate (either x or y) at a given zoom level from a VCS coordinate only requires one right-shift (cost: 1 cycle) to adjust for zoom level and one bitwise AND (cost: 1 cycle) to pick off the lowest eight bits. The astute reader will remember that calculating the Mercator latitude takes, for the average post-Pentium processor, around 100 times as many cycles.
Designing my framework around VCS and the Mercator does make it harder to change the projection, but Mercator won: it is what Google uses, what Yahoo uses, what Bing uses, and even what the open-source Leaflet uses. If you want to make map tiles to use with the most common services, you use Mercator.
Furthermore, should I decide that I absolutely have to use a different projection, I would only have to add two columns to my points database table and do a bunch of one-time calculations.
DISTINCT: Draw only one ambiguous point
If you have two points which are only 10 kilometers apart, then when you are zoomed way in, you might see two different pixels for those two points, but when you zoom out, at some point, the two points will merge and you will only see one pixel. Setting up my drawing routine to only draw the pixel once when drawing points is a big optimization in some circumstances.
Because converting from a VCS coordinate to a pixel coordinate is so lightweight, it can be done easily by MySQL, and the DISTINCT keyword can be used to only return one record for each distinct pixel coordinate.
The DISTINCT keyword is not a big win when drawing polygons, but it is a HUGE, enormous win when drawing points. Drawing points is FAST when I use the DISTINCT keyword, as shown above.
For polygons, you don’t actually want to remove all but one of a given point (as the DISTINCT keyword would do), you want to not draw two successive points that are the same. Doing so is a medium win (shaving about 25% of the time off) for polygon drawing when way zoomed out, but not much of a win when way zoomed in.
Skeletons: Changing the colour palette
While the VCS speed improvement means that I could render most tiles in real time, I still could not render tiles fast enough for a good user experience when the tiles had very large numbers of points. For example, the 2000 Census has 65,322 census tracts; at zoom level 0, that was too many to render fast enough.
Instead, I rendered and saved the geometry into “skeletons”, with one set of skeletons for each jurisdiction type (e.g. census tract, state/province, country, county). Instead of the final colour, I filled the polygons in the skeleton with an ID for the particular jurisdiction corresponding to that polygon. When someone asked for a map showing a particular attribute (e.g. population) and colour mapping, the code would retrieve (or generate) the skeleton, look up each element in the colour palette, decode the jurisdictionId, look up the attribute value for that jurisdictionId (e.g. what is the population for Illinois?), use the colour mapping to get the correct RGBA colour, and write that back to the colour palette. When all the colour palette entries had been updated, I gave it to the requesting browser as a PNG.
While I came up with the idea of fiddling the colour palette independently, it is not unique to me. A friend also came up with this idea independently, for example. What I did was take it a bit farther: I modified the gd libraries so they had a 16-bit colour palette in the skeletons which I wrote to disk. When writing out to PNG, however, my framework uses the standard format. I then created a custom version of PHP which statically linked my custom GD libraries.
(Some people have asked why I didn’t contribute my changes back to gd. It’s because the pieces I changed were of almost zero value to anyone else, while very far-reaching. I knew from testing that my changes didn’t break anything that I needed, but GD has many many features, and I couldn’t be sure that I could make changes in such a fundamental piece of the architecture without causing bugs in far-flung places without way more effort than I was willing to invest.)
More than 64K jurisdictions
16 bits of palette works fine if you have fewer than 64K jurisdictions on a tile (which the 2000 US Census Tract count just barely slid under), but not if you have more than 64K jurisdictions. (At least not with the gd libraries, which don’t reuse a colour palette entry if that colour gets completely overwritten in the image.)
You can instead walk through all the pixels in a skeleton, decode the jurisdiction ID from the pixel colour and rewrite that pixel instead of walking the colour palette. (You need to use a true-colour image if you do that, obviously.) For large numbers of colours, changing the colour palette is no faster than walking the skeleton; it is only a win for small numbers of colours. If you are starting from scratch, it is probably not worth the headache of modifying the graphics library and statically linking in a custom version of GD into PHP to walk the colour palette instead of walking the true-colour pixels.
(I had to modify GD anyway due to a bug I fixed in GD which didn’t get incorporated into the GD release for a very long time. My patch finally got rolled in, so you don’t need to do that yourself.)
My framework now checks to see how many jurisdiction are in the area of interest; if there are more than 64K, it creates a true-colour image, otherwise a paletted image. If the skeleton is true-colour, it walks pixels, otherwise it walks the palette.
Credits: My husband implemented the pixel-walking code.
On-demand skeleton rendering
While I did pre-render about 10-40 tiles per jurisdiction type, I did not render skeletons for the vast majority of tiles. Instead, I render and save a skeleton only when someone asks for it. I saw no sense in rendering ahead of time a high-maginification tile of a rural area. Note that I could only do this on-demand skeleton generation because the VCS speedup made it so fast.
I will also admit that I did generate final tiles (with the colour properly filled in, not a skeleton) to zoom level 8 for some of my most commonly viewed census tract attributes (e.g. population by census tract) with the default value for the colour mapping. I had noticed that people almost never change the colour mapping. I did not need to do this; the performance was acceptable without doing so. It did make things slightly snappier, but mostly it just seemed to me like a waste to repeatedly generate the same tiles. I only did this for US and Australian census jurisdictions.
MySQL vs. PostGIS
One happy sort-of accident is that my ISP, Dreamhost, provides MySQL but does not allow PostGIS. I could have found myself a different ISP, but I was happy with Dreamhost, Dreamhost was cheap, and I didn’t particularly want to change ISPs. This meant that I had to roll my own tools instead of relying upon the more fully-featured PostGiS.
MySQL is kind of crummy for GIS. Its union and intersection operators, for example, use bounding boxes instead of the full polygon. However, if I worked around that, I found that for what I needed to do, MySQL was actually faster (perhaps because it wasn’t paying the overhead of GIS functions that I didn’t need).
PostGIS’ geometries are apparently stored as serialized binary objects, which means that you have to pay the cost of deserializing the object every time you want to look it or one of its constituent elements. I have a table of points, a table of polygons, and a table of jurisdictionIds; I just ask for the points, no deserialization needed. Furthermore, at the time I developed my framework, there weren’t good PHP libraries for deserializing WKB objects, so I would have had to write my own.
Note: not having to deserialize is only a minor win. For most people, the convenience of the PostGIS functions should be worth the small speed penalty.
Database optimization
One thing that I did that was entirely non-sexy was optimizing the MySQL database. Basically, I figured out where there should be indices and put them there. This sped up the code significantly, but it did not take cleverness, just doggedness. There are many other sites which discuss MySQL optimization, so I won’t go into that here.
Future work: Feature-based maps
My framework is optimized for making polygons, but it should be possible to render features very quickly as well. It should be possible to, say, decide to show roads in green, not show mountain elevation, show cities in yellow, and not show city names.
To do this, make a true-colour skeleton where each bit of the pixel’s colour corresponds to the display of a feature. For example, set the least significant bit to 1 if a road is in that pixel. Set the next bit to 1 if there is a city there. Set the next bit to 1 if a city name is displayed there. Set the next bit to 1 if the elevation is 0-500m. Set the next bit to 1 if the elevation is 500m-1000m. Etc. You then have 32 feature elements
which you can turn on and off by adjusting your colour mapping function.
If you need more than 32 feature elements, then you could use two skeletons and merge the images after you do the colour substitutions.
You could also, if you chose, store elevation (or depth) in the pixel, and adjust the colouring of the elevation with your colour mapping function.
Addendum: I meant to mention, but forgot, UTFGrid. UTFGrid has an array backing the tile for lookup of features, so it is almost there. However, it is just used to make pop-up windows, not (as near as I can tell) to colour polygons.
Permalink