COVID19 Animation

Posted in Health, Maps at 3:11 pm by ducky

I have just published a video to YouTube which builds off of the COVID19 maps I mentioned in my last blog post.

After a brief introduction to cartograms, it shows rolling seven-day average COVID-19 case counts by county, animated by day.

My husband commented that it was easy to see how the suffering moved around. Big dark areas show where there was a lot of suffering. Small dark areas show where there was intense, concentrated suffering.

As I mentioned in my last post, the suffering moved around quite a bit, and that is really easy to see in the video.

Check it out!


COVID19 in the US

Posted in Maps at 12:09 pm by ducky

I have been working pretty intimately with US COVID19 cases and deaths data for the past six weeks because of my COVID map application, and here are my big takeaways:

It sucks to go first.

New York and New Orleans got absolutely hammered by COVID early on, and they had a huge number of both cases and deaths. (Yellow is confirmed cases, red is confirmed deaths. Dots are individual days; lines are seven-day rolling averages. Cases are per ten thousand people while deaths are per one hundred thousand people. All of the graphs in this post are at the same scale.)

New York had a huge spike in confirmed cases, followed almost immediately by a huge spike in deaths, so immediately that it’s hard to tell the red line from the yellow line. (I believe that the odd spike in mid-May is due to a reporting issue; about 1500 deaths were reported on that day, but I bet they didn’t happen on that day.)

Note that you need to be a little careful with your interpretation here because there were almost no tests available outside of research labs until mid-March, and very limited tests for quite some time afterwards. The case count is almost certainly an undercount; there is almost no lag between the case peak and the death peak because people didn’t get diagnosed until they were close to death.

Similarly, New Orlean’s first spike was followed almost immediately by a huge spike in deaths. But look at the later two spikes: deaths did not jump significantly after the later two spikes.

I see this over and over. For example, Miami got hit very hard in July:

While Miami did see an increase in deaths, it was not nearly as bad as the deaths New York and Miami saw in April.

Why is this? Unclear. I was not the only person to notice it: here’s a Washington Post article about the phenomenon.

These are a bunch of possibilities:

  1. The virus has mutated and gotten less deadly. The article above says that genetic studies do not bear that out, but doesn’t give any more details.
  2. More testing. While it is true that more testing makes the case count higher, that isn’t a big enough effect to account for the numbers.
  3. The virus is hitting younger people. Another way to look at it might be “we have learned how to protect nursing homes”. I don’t have enough data to be able to tell how big a factor that is.
  4. More people have (at least limited) immunity due to prior non-COVID19 corona virus infections. (Many “common cold” viruses are coronaviruses.) I don’t think that would explain it, unless you think that people in New Orleans didn’t have colds before March but did later on.
  5. We are better at treating the disease. This is certainly true. Giving steroids at the right time appears to be significantly better than not.
  6. People are getting infected lower doses of virus and hence are not getting as sick. This also appears to be the case. The viral load of people showing up at hospitals has been dropping, according to the Washington Post article. This is almost certainly due to people wearing masks, standing six feet apart, avoiding restaurants, etc.

The virus moves.

COVID-19 hotspots move around. First, it was New York.

13 April 2020

(This is a cartogram, where counties’ areas are proportional to their population. Big regions mean cities; small regions are rural. Yellow is good, brown is bad.)

In the summer the Sunbelt got hit.

26 July 2020

Currently, the hotspots are in the northern central-west US — Wisconsin, Wyoming, Montana, Utah, etc., and scattered rural areas.

1 October 2020

There are very few places where the virus has not gotten to — yet. Places that have had outbreaks, particularly which have had nasty outbreaks (I’m looking at you, Northeast), are mostly doing a good job of managing now.

It almost seems like places aren’t able to learn from everyone else’s misery, they need to have a nasty outbreak themselves to truly believe that it can happen to them.

Reopening universities in person was a bad idea.

Cases spiked in early September in college towns:

Home of the University of Illinois
Home of the University of Iowa
Home of Indiana University
Home of the University of Missouri
Home of the University of Pennsylvania

Go VT, NH, ME, PA!

The far northeastern states have done an outstanding job, with almost no cases — yet.

Aside from some minor outbreaks early on and an outbreak right now in Centre County — which is almost certainly due to Penn State University reopening (see above) — Pennsylvania has done a really good job — so far.

Western Oregon and Western Washington have also done a good job — so far. Colorado has mostly done a good job — so far.


Unemployment map

Posted in Maps at 8:54 pm by ducky

I have developed some maps which show seasonally adjusted unemployment rates, by month, by county, for the past 23 years, as either a cartogram or as a standard mercator projection.

One of the stories that my unemployment map tells clearly is just how bad the financial meltdown in 2008 was, and how sudden.

This might surprise you if you saw the video of unemployment by county by LaToya Egwuekwe which went viral.  It is a fine video for what it is, but I think it is slightly misleading.

Her video showed pretty clearly that things started to slide a little bit in late 2008, but the real rise in unemployment hit in 2009, with the worst being in June 2010.


Still from unemployment video by LaToya Egwuekwe

I was quite surprised when I saw this, as it didn’t match what I knew of the situation.

Below is what the national unemployment rate actually looked like over time. (The seasonally adjusted rate is darker; the unadjusted rate is lighter.)


The seasonally adjusted unemployment rate actually peaked in October 2009, not in mid-2010.

Why the difference?  If you look at the fine print of Egwuekwe’s video, it is a map of the 12-month rolling average of unemployment, which is a lagging indicator.  This means that, for example, in October 2008, right after the financial meltdown, the unemployment numbers she used for the map included November 2007 through September 2007 — which were generally pretty good.  Similarly, the unemployment later seemed to be higher than it really was because the rolling average included the previous year — which included very high unemployment.

Here is a comparison of the seasonally adjusted unemployment rate and the 12-month rolling average for 2007 through the present:seasonalVsRolling

It is perfectly understandable that Ms. Egwuekwe would use the 12-month rolling average instead of the seasonally-adjusted unemployment.  Why?  Because the Bureau of Labor Statistics does not publish seasonally adjusted unemployment rates at the county level, and it is a royal pain to calculate the seasonally adjusted unemployment for each county individually.  (I should know, because I did calculate seasonally adjusted unemployment for each county individually.)

She could have used the unadjusted unemployment rate for a county, but there can be so much seasonal variation that it is hard to see the longer-term trends.  This is particularly true in agricultural communities.  For example, here is a graph which shows the unadjusted unemployment rate for Clark County, IL in light blue, and the seasonally adjusted rate in darker blue.


(For comparison, the national unemployment rate is in light green.)

Thus, if Ms. Egwuekwe had used the raw unadjusted unemployment numbers for her video, the short-term fluctuations would have made it difficult to see the longer-term trends.  It was a reasonable tradeoff to make.

One other complaint I have about maps like hers, and well, almost all thematic maps, is that they give too much visual weight to rural areas compared to urban areas.  When you see a map like the one at the top of the post, you probably do sort of a visual averaging to figure out what the overall unemployment is like.  However, because there are a huge number of people in cities, and because cities are very small compared to the vast stretches of rural areas, what you see in rural areas dominates the map.  A population-based cartogram — where jurisdictions are distorted so that the area of a jurisdiction is proportional to its area — gives a map which is less misleading.

Again, it’s completely understandable that people would normally use maps which show area instead of population.  It’s a royal pain to generate a cartogram.  (I should know, I did generate cartograms.)

Here is a population-based cartogram of the unemployment in June 2008, before the financial crisis hit, when the US seasonally adjusted unemployment rate was 4.6%:


(Green is good; yellow is bad.  Full green is 0% unemployment; full yellow is 17% unemployment.  States are outlined.)

Now here is an image from the worst month after the financial meltdown, October 2009:



The financial crisis hit fast, and it hit hard.

The good news is that it is getting better.  Here is a map of the latest month which I have data for, June 2013:




Note: it is a little difficult to recognize places when they are distorted.  On my unemployment maps web page, you can show city names; you can click on a county to get more information about that county.



City Labels on Maps

Posted in Maps at 6:35 pm by ducky

When I watched people look at my cartograms, I saw that they frequently had trouble figuring out which distorted shapes on the map corresponded to which shapes on the more familiar non-distorted maps they were familiar with.

Cartogram without labels

Clearly, I needed to give them some reference points.

The first thing that I thought of doing was distorting Open Street Map tiles to match the cartographic distortion.  That didn’t work out so well, and I decided that I could get 90% of the benefit just by showing city names.

The next thing I tried was to make tiles with city names on them.  This turned out to be difficult because city names frequently crossed tile boundaries, and the city names were variable widths.

What worked: I made markers with custom icons, where the icon was an image of the city name, and placed the icons at the appropriate location on the cartogram.  This worked well: the city names moved with the background image when dragged, and creating the custom icons was quite lightweight.

Having solved how to show city names, I then needed to figure out when to show city names.  Clearly I should show bigger cities first, but if you zoom in, you want to see smaller cities.  I don’t know how Google does it, but they probably can afford to decide on a city-by-city basis at which zoom level that city should appear, but there are an awful lot of cities and an awful lot of zoom levels, and my name’s not Google.  I can’t afford to do that.

I experimented with coming up with a formula to specify what population threshold to use for which zoom level, but I was unsatisfied with that.  I couldn’t find a formula which would show enough cities in lightly-populated areas but not too many in densely populated areas.

The next thing I tried was to figure out which area of the map was showing, and to label only the top six (by population) visible cities. This means that you see only really big cities when zoomed out a lot:

Six biggest cities labelled

But when you zoom in (or move so that one of the labelled cities stops being in the view), more cities appeared:

(When I did this, I was surprised to find out how small (in population) some major cities are.  Jacksonville, FL is bigger than Miami.  Indianapolis is bigger than Boston.  El Paso is bigger than Seattle.  Now, partly that’s because I’m labelling cities and not metro areas, but it still surprised me.)

Even only showing six, there were still times when the cities names got crowded.  Also, when way zoomed out, big sections of the country didn’t have any labels.  What I finally did was look at the top 23 visible cities, and if there was a larger city nearby (in pixels), I skipped the smaller city.  This seems to work really well:

It sure beat keeping a list of which cities to show at which zoom level!



Cartographic “suicide caucus” map

Posted in Art, Maps at 10:13 am by ducky

Ryan Lizza posted an article and map in the New York Times which showed the locations of the US Congressional Districts whose Representatives backed the US federal government’s shutdown in an attempt to defund Obamacare.  Here is a version of the map which I made, with yellow for the “suicide caucus”:


The article and map were good.  I liked them.  But there’s a real danger when looking at a map that you will — consciously or unconsciously — mentally equate the relative number of pixels on a map into the relative number of people.  

Unfortunately, the geographical distribution of people is wildly, wildly uneven: from 0.04 people per square mile in the Yukon-Koyukuk Borough to more than 70,000 people per square mile in Manhattan.  Yes, there are 1.75 MILLION more people per square mile in Manhattan than rural Alaska.

The map above makes it look like a higher percentage of congresspeople supported the shutdown than actually did.  If you look at the shutdown districts on a cartogram — a map where the area of a congressional is distorted to be proportional to the population of that district — instead, it becomes even more clear just how few representatives were involved.


I have made a web page where you can explore congressional districts yourself.

In addition to seeing the maps above, you can also see thematic maps (both cartogram and regular) of

  • percent without insurance
  • percent white
  • median family income
  • median gross rent
  • median home value
  • percent living in poverty
  • percent of children living in poverty
  • percent of elderly living in poverty
  • median age
  • congressional election results from 2012

Additionally, if you click on a congressional district, you can see who represents that district, plus all of the above values for that district.  If you click on the map away from a congressional district, you can see a table comparing the shutdown districts with the non-shutdown districts.

You can also look at maps for the presidential 2012 election results and seasonally-adjusted unemployment, but because those are county-based figures, you can’t do a strict comparison between shutdown/non-shutdown districts, so they aren’t in the comparison table or the per-district summaries.

Implementation notes

I used ScapeToad to generate the cartograms.  It was a lot of trial and error to figure out settings which gave good results.  (For those of you who want to make cartograms themselves: I defined cartogram parameters manually, and used 6000 points for the first grid, 1024 for the second, and 3 iterations.)

I used QGIS and GRASS to clean it up afterward (to remove slivers and little tiny holes left between the districts sometimes) and to merge congressional districts to make cartogram shapes.

NB: I use the state boundaries which I created by merging the cartogramified congressional districts, even for the maps which are based on counties (e.g. unemployment and the presidential results).  It is pretty impressive how well the merged congressional district state boundaries match the county cartogram state borders.  It wasn’t at all obvious to me that would happen.  You could imagine that ScapeToad might have been more sensitive to the shapes of the counties, but somehow it all worked.  Kudos to ScapeToad!

At some zoom levels, not all the district boundaries get drawn.  That’s because I don’t want the map to be all boundary when way zoomed out, so I check the size before drawing boundaries.  If the jurisdiction is too small, I don’t draw the boundary.

As a starting point, I used the Congressional District shapefiles from the US Census Bureau. For the population used for generating the cartogram, I used the Census Bureau American Community Survey 2011 values.  For the other map attributes, I specify the source right under the “Color areas by”.

I made the map tiles myself, using custom PHP code.  You can read more about it in Optimizing Map Tile Generation.  I came up with my own algorithm for showing city labels.


Variably-sized points on maps

Posted in Hacking, Maps at 1:20 am by ducky

The Huffington Post made a very nice interactive map of homicides and accidental gun deaths since the shooting at Sandy Hook.  It’s a very nice map, but has the (very common problem) that it mostly shows where the population density is high: of course you will have more shootings if there are more people.

I wanted to tease out geographical/political effects from population density effects, so I plotted the gun deaths on a population-based cartogram.  Here was my first try.  (Click on it to get a bigger image.)

Unfortunately, the Huffington Post data gives the same latitude/longitude for every shooting in the same city.  This makes it seem like there are fewer deaths in populated areas than there really are.  So for my next pass, I did a relatively simple map where the radius of the dots was proportional to the square root of the number of gun deaths (so that the area of the dot would be proportional to the number of gun deaths).



This also isn’t great.  Some of the dots are so big that they obscure other dots, and you can’t tell if all the deaths were in one square block or spread out evenly across an entire county.

For the above map, for New York City, I dug through news articles to find the street address of each shooting and geocoded it (i.e. determined the lat/long of that specific address). You can see that the points in New York City (which is the sort of blobby part of New York State at the south) seem more evenly distributed than for e.g. Baltimore.  Had I not done that, there would have been one big red dot centered on Manhattan.

(Side note: It was hugely depressing to read article after article about people’s — usually young men’s — lives getting cut short, usually for stupid, stupid reasons.)

I went through and geocoded many of the cities.  I still wasn’t satisfied with how it looked: the size balance between the 1-death and the multiple-death circles looked wrong.  It turns out that it is really hard — maybe impossible — to get area equivalence for small dots.  The basic problem is that with radiuses are integers, limited by pixels.  In order to get the area proportional to gun deaths, you would want the radius to be proportional to the square root of the number of gun deaths, or {1, 1.414, 1.732 2.0, 2.236, 2.449, 2.645, 2.828, 3.000}, the rounded numbers will be {1, 1, 2, 2, 2, 2, 3, 3, 3}; instead of areas of {pi, 2*pi, 3*pi, 4*pi, …}, you get {pi, pi, 4*pi, 4*pi, 4*pi, 9*pi, 9*pi, 9*pi}.

Okay, fine.  We can use a trick like anti-aliasing, but for circles: if the square root of the number of gun deaths is between two integer values (e.g. 2.236 is between 2 and 3), draw a solid circle with a radius of the just-smaller integer (for 2.236, use 2), then draw a transparent circle with a radius of the just-higher integer (for 2.236, use 3), with the opacity higher the closer the square root is to the higher number. Sounds great in theory.

In practice, however, things still didn’t look perfect.  It turns out that for very small dot sizes, the dots’ approximations to circles is pretty poor.  If you actually zoom in and count the pixels, the area in pixels is {5, 13, 37, 57, 89, 118, 165, …} instead of what pi*R^2 would give you, namely {3.1, 12.6, 28.3, 50.3, 78.5, 113.1, 153.9, …}.


But wait, it’s worse: combine the rounding radiuses problem with the problem of approximating circles, and the area in pixels will be {5, 5, 13, 13, 13, 13, 37, 37, 37, …}, for errors of {59.2%, -60.2% -54.0% -74.1% -83.4% -67.3% -76.0%, …}.  In other words, the 1-death dot will be too big and the other dots will be too small.  Urk.

Using squares is better.  You still have the problem of the rounding the radius, but you don’t have the circle approximation problem.  So you get areas in pixels of {1, 1, 4, 4, 4, 9, 9, 9, …} instead of {1, 2, 3, 4, 5, 6, 7, 8, …} for errors of {0.0%, -50.0%, 33.3%, 0.0%, -20.0%, -33.3%, 28.6%, …} which is better, but still not great.

Thus: rectangles:

Geocoding provided by GeoCoder.ca



My Dream Job

Posted in Maps at 12:11 am by ducky

I imagine some epidemiologist somewhere, who has statistics on the something like the measles rate by postal code, who wants to see if there is a geographic trend, like if warmer places have more measles. She has a spreadsheet with the postal codes and the number of cases in that postal code, and wants to turn that into a map where each postal code’s colour represents the number of cases per capita in that postal code.

She should not need to know what a shapefile is, should not need to know that the name of the map type she wants is “choropleth”, and should not have to dig up the population of that postal code. The boundaries of the jurisdictions she cares about (postal codes, in this case) and the population are well-understood and don’t change often; the technology to make such a map out to be invisible to her. She should be able to upload a spreadsheet and get her map.

I find it almost morally wrong that it is so hard to make a map.

Making that possible would be my dream job. It is a small enough job that I could do it all by myself, but it is a large enough job that it would effectively prevent me from doing other paying work for probably about a year, and I can’t see a way to effectively monetize it.

The challenges are not in creating a map that is displayed onscreen — that’s the easy part. To develop this service would require (in order of difficulty):

  • code and resources to enable users to store their data and map configurations securely;
  • code to pick out jurisdiction names and data columns from spreadsheets, and/or a good UI to walk the user through picking the columns;
  • fuzzy matching code which understands that e.g. “PEI” is really “Prince Edward Island”, a province in Canada; that “St John, LA” is actually “Saint John the Baptist Parish”, a county-equivalent in Louisiana; that there are two St. Louis counties in Misouri; that Nunavut didn’t exist before 1999;
  • code to allow users to share their data if they so choose;
  • UI (and underlying code) to make the shared data discoverable, usable, and combinable;
  • code (and perhaps UI) to keep spammers from abusing the system;
  • code to generate hardcopy of a user’s map (e.g. PNG or PDF);
  • code for a user account mechanism and UI for signing in

This service would give value to many people: sales managers trying to figure out how to allocate sales districts, teachers developing lesson plans about migration of ethnic minorities, public health officials trying to understand risk factors, politicians targeting niche voters, urban planning activists trying to understand land use factors, etc.

Unfortunately, for the people to whom this really matters, if they already have money, they already ponied up the money for an ESRI mapping solution. If they don’t have money, then they won’t pay for this service.

GeoCommons tries to do this. GeoCommons makes maps from users’ data, and you stores and shares users’ data, but their map making is so slow it is basically unusable, and it is not easy to combine data from multiple sources into one map.

It might be that one of the “big data” organizations, e.g. Google or Amazon, might provide this as an enticement for getting people to use their other services. Google, for example, has a limited ability to do this kind of thing with their Fusion Tables (although if you want to do jurisdictions other than countries, then you have to provide a shapefile). Amazon provides a lot of data for use with the Amazon Web Services.

However, it would be almost as difficult for Google or Amazon to monetize this service as it would for me.  Google could advertise and Amazon could restrict it to users of its AWS service, but it isn’t clear to me how much money that could bring in.

If anybody does figure out a way to monetize it, or wants to take a gamble on it being possible, please hire me!


Optimizing Map Tile Generation

Posted in Hacking, Maps at 11:54 am by ducky

In the past, when people asked me how I managed to make map tiles so quickly on my World Wide Webfoot Maps site, I just smiled and said, “Cleverness.” I have decided to publish how I optimized my map tile generation in hopes that others can use these insights to make snappier map services. I give a little background of the problem immediately below; mapping people can skip to the technical details.


Choropleth maps colour jurisdictions based on some attribute of the jurisdiction, like population. They are almost always implemented by overlaying tiles (256×256 pixel PNG images) on some mapping framework (like Google Maps).

Map tile from a choropleth map (showing 2012 US Presidential voting results by county)

Most web sites with choropleth maps limit the user: users can’t change the colours, and frequently are restricted to a small set of zoom levels. This is because the tiles are so slow to render that the site developers must render the tile images ahead of time and store them.  My mapping framework is so fast that I do not need to pre-render all the tiles for each attribute. I can allow the users to interact with the map, going to arbitrary zoom levels and changing the colour mapping.

Similarly, when people draw points on Google Maps, 100 is considered a lot. People have gone to significant lengths to develop several different techniques for clustering markers. By contrast, my code can draw thousands very quickly.

There are 32,038 ZIP codes in my database, and my framework can show a point for each with ease. For example, these tiles were generated at the time this web page loaded them.

32,038 zip codes at zoom level 0 (entire world)
Zip codes of the Southeast US at zoom level 4

(If the images appeared too fast for you to notice, you can watch the generation here and here. If you get excited, you can change size or colour in the URL to convince yourself that the maps framework renders the tile on the fly.)

Technical Details

The quick summary of what I did to optimize the speed of the map tile generation was to pre-calculate the pixel coordinates, pre-render the geometry and add the colours later, and optimize the database. In more detail:

Note that I do NOT use parallelization or fancy hardware. I don’t do this in the cloud with seventy gajillion servers. When I first wrote my code, I was using a shared server on Dreamhost, with a 32-bit processor and operating system. Dreamhost has since upgraded to 64-bits, but I am still using a shared server.

Calculating pixel locations is expensive and frequent

For most mapping applications, buried in the midst of the most commonly-used loop to render points is a very expensive operation: translating from latitude/longitude to pixel coordinates, which almost always means translating to Mercator projection.

While converting from longitude to the x-coordinate in Mercator is computationally inexpensive, to convert from latitude to y-coordinate using the Mercator projection is quite expensive, especially for something which gets executed over and over and over again.

A spherical mercator translation (which is actually much simpler than the actual projection which Google uses) uses one logarithmic function, one trigonometric function, two multiplications, one addition, and some constants which will probably get optimized away by the compiler:

function lat2y(a) { return 180/Math.PI * Math.log(Math.tan(Math.PI/4+a*(Math.PI/180)/2)); }

(From the Open Street Maps wiki page on the Mercator projection)

Using Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs by Agner Fog, I calculated that a tangent can take between 11 and 190 cycles, and a logarithm can take between 10 and 175 cycles on post-Pentium processors. Adds and multiplies are one cycle each, so converting from latitude to y will take between 24 and 368 cycles (not counting latency). The average of those is almost 200 cycles.

And note that you have to do this every single time you do something with a point. Every. Single. Time.

If you use elliptical Mercator instead of spherical Mercator, it is much worse.

Memory is cheap

I avoid this cost by pre-calculating all of the points’ locations in what I call the Vast Coordinate System (VCS for short). The VCS is essentially a pixel space at Google zoom level 23. (The diameter of the Earth is 12,756,200 meters; at zoom level 23, there are 2^23 tiles, and each tile has 256 or 2^8 pixels, so there are 2^31 pixels around the equator. This means that the pixel resolution of this coordinate system is approximately .6cm at the equator, which should be adequate for most mapping applications.)

Because the common mapping frameworks work in powers of two, to get the pixel coordinate (either x or y) at a given zoom level from a VCS coordinate only requires one right-shift (cost: 1 cycle) to adjust for zoom level and one bitwise AND (cost: 1 cycle) to pick off the lowest eight bits. The astute reader will remember that calculating the Mercator latitude takes, for the average post-Pentium processor, around 100 times as many cycles.

Designing my framework around VCS and the Mercator does make it harder to change the projection, but Mercator won: it is what Google uses, what Yahoo uses, what Bing uses, and even what the open-source Leaflet uses. If you want to make map tiles to use with the most common services, you use Mercator.

Furthermore, should I decide that I absolutely have to use a different projection, I would only have to add two columns to my points database table and do a bunch of one-time calculations.

DISTINCT: Draw only one ambiguous point

If you have two points which are only 10 kilometers apart, then when you are zoomed way in, you might see two different pixels for those two points, but when you zoom out, at some point, the two points will merge and you will only see one pixel. Setting up my drawing routine to only draw the pixel once when drawing points is a big optimization in some circumstances.

Because converting from a VCS coordinate to a pixel coordinate is so lightweight, it can be done easily by MySQL, and the DISTINCT keyword can be used to only return one record for each distinct pixel coordinate.

The DISTINCT keyword is not a big win when drawing polygons, but it is a HUGE, enormous win when drawing points. Drawing points is FAST when I use the DISTINCT keyword, as shown above.

For polygons, you don’t actually want to remove all but one of a given point (as the DISTINCT keyword would do), you want to not draw two successive points that are the same. Doing so is a medium win (shaving about 25% of the time off) for polygon drawing when way zoomed out, but not much of a win when way zoomed in.

Skeletons: Changing the colour palette

While the VCS speed improvement means that I could render most tiles in real time, I still could not render tiles fast enough for a good user experience when the tiles had very large numbers of points. For example, the 2000 Census has 65,322 census tracts; at zoom level 0, that was too many to render fast enough.

Instead, I rendered and saved the geometry into “skeletons”, with one set of skeletons for each jurisdiction type (e.g. census tract, state/province, country, county). Instead of the final colour, I filled the polygons in the skeleton with an ID for the particular jurisdiction corresponding to that polygon. When someone asked for a map showing a particular attribute (e.g. population) and colour mapping, the code would retrieve (or generate) the skeleton, look up each element in the colour palette, decode the jurisdictionId, look up the attribute value for that jurisdictionId (e.g. what is the population for Illinois?), use the colour mapping to get the correct RGBA colour, and write that back to the colour palette. When all the colour palette entries had been updated, I gave it to the requesting browser as a PNG.

While I came up with the idea of fiddling the colour palette independently, it is not unique to me. A friend also came up with this idea independently, for example. What I did was take it a bit farther: I modified the gd libraries so they had a 16-bit colour palette in the skeletons which I wrote to disk. When writing out to PNG, however, my framework uses the standard format. I then created a custom version of PHP which statically linked my custom GD libraries.

(Some people have asked why I didn’t contribute my changes back to gd. It’s because the pieces I changed were of almost zero value to anyone else, while very far-reaching. I knew from testing that my changes didn’t break anything that I needed, but GD has many many features, and I couldn’t be sure that I could make changes in such a fundamental piece of the architecture without causing bugs in far-flung places without way more effort than I was willing to invest.)

More than 64K jurisdictions

16 bits of palette works fine if you have fewer than 64K jurisdictions on a tile (which the 2000 US Census Tract count just barely slid under), but not if you have more than 64K jurisdictions. (At least not with the gd libraries, which don’t reuse a colour palette entry if that colour gets completely overwritten in the image.)

You can instead walk through all the pixels in a skeleton, decode the jurisdiction ID from the pixel colour and rewrite that pixel instead of walking the colour palette. (You need to use a true-colour image if you do that, obviously.) For large numbers of colours, changing the colour palette is no faster than walking the skeleton; it is only a win for small numbers of colours. If you are starting from scratch, it is probably not worth the headache of modifying the graphics library and statically linking in a custom version of GD into PHP to walk the colour palette instead of walking the true-colour pixels.

(I had to modify GD anyway due to a bug I fixed in GD which didn’t get incorporated into the GD release for a very long time.  My patch finally got rolled in, so you don’t need to do that yourself.)

My framework now checks to see how many jurisdiction are in the area of interest; if there are more than 64K, it creates a true-colour image, otherwise a paletted image. If the skeleton is true-colour, it walks pixels, otherwise it walks the palette.

Credits: My husband implemented the pixel-walking code.

On-demand skeleton rendering

While I did pre-render about 10-40 tiles per jurisdiction type, I did not render skeletons for the vast majority of tiles. Instead, I render and save a skeleton only when someone asks for it. I saw no sense in rendering ahead of time a high-maginification tile of a rural area. Note that I could only do this on-demand skeleton generation because the VCS speedup made it so fast.

I will also admit that I did generate final tiles (with the colour properly filled in, not a skeleton) to zoom level 8 for some of my most commonly viewed census tract attributes (e.g. population by census tract) with the default value for the colour mapping. I had noticed that people almost never change the colour mapping. I did not need to do this; the performance was acceptable without doing so. It did make things slightly snappier, but mostly it just seemed to me like a waste to repeatedly generate the same tiles. I only did this for US and Australian census jurisdictions.

MySQL vs. PostGIS

One happy sort-of accident is that my ISP, Dreamhost, provides MySQL but does not allow PostGIS. I could have found myself a different ISP, but I was happy with Dreamhost, Dreamhost was cheap, and I didn’t particularly want to change ISPs. This meant that I had to roll my own tools instead of relying upon the more fully-featured PostGiS.

MySQL is kind of crummy for GIS. Its union and intersection operators, for example, use bounding boxes instead of the full polygon. However, if I worked around that, I found that for what I needed to do, MySQL was actually faster (perhaps because it wasn’t paying the overhead of GIS functions that I didn’t need).

PostGIS’ geometries are apparently stored as serialized binary objects, which means that you have to pay the cost of deserializing the object every time you want to look it or one of its constituent elements. I have a table of points, a table of polygons, and a table of jurisdictionIds; I just ask for the points, no deserialization needed. Furthermore, at the time I developed my framework, there weren’t good PHP libraries for deserializing WKB objects, so I would have had to write my own.

Note: not having to deserialize is only a minor win. For most people, the convenience of the PostGIS functions should be worth the small speed penalty.

Database optimization

One thing that I did that was entirely non-sexy was optimizing the MySQL database. Basically, I figured out where there should be indices and put them there. This sped up the code significantly, but it did not take cleverness, just doggedness. There are many other sites which discuss MySQL optimization, so I won’t go into that here.

Future work: Feature-based maps

My framework is optimized for making polygons, but it should be possible to render features very quickly as well. It should be possible to, say, decide to show roads in green, not show mountain elevation, show cities in yellow, and not show city names.

To do this, make a true-colour skeleton where each bit of the pixel’s colour corresponds to the display of a feature. For example, set the least significant bit to 1 if a road is in that pixel. Set the next bit to 1 if there is a city there. Set the next bit to 1 if a city name is displayed there. Set the next bit to 1 if the elevation is 0-500m. Set the next bit to 1 if the elevation is 500m-1000m. Etc. You then have 32 feature elements
which you can turn on and off by adjusting your colour mapping function.

If you need more than 32 feature elements, then you could use two skeletons and merge the images after you do the colour substitutions.

You could also, if you chose, store elevation (or depth) in the pixel, and adjust the colouring of the elevation with your colour mapping function.

Addendum: I meant to mention, but forgot, UTFGrid.  UTFGrid has an array backing the tile for lookup of features, so it is almost there.  However, it is just used to make pop-up windows, not (as near as I can tell) to colour polygons.


Maps of US Religions, take 2

Posted in Maps at 10:07 pm by ducky

I added a few more religious denominations to my elections/demographics site, again from Churches and Church Membership in the United States, 1990.

Note that these denominations have fewer adherents than the denominations I featured in my previous post, so these have full white corresponding to 0%, while full blue is 70% (vs. 100% in the previous post).

Here are adherents to the Unified Methodist Church:

% United Methodist Church Adherents 1990

I hadn’t realized that Methodists were concentrated in the center band of the country like that.

Here are the adherents to the Presbyterian Church (USA):

% Presbyterian Church USA Adherents 1990

I was surprised at how diffuse the Presbyterians are.

Here is the African Methodist Episcopal Zion adherents:

% African Methodist Episcopal Adherents 1990

I was surprised at how concentrated the AMEZ church was — in North Carolina and Alabama.


Maps of religion

Posted in Maps at 12:42 am by ducky

I recently added some data from Churches and Church Membership in the United States, 1990 to my election/demographics map. Collected by the Association of Statisticians of American Religious Bodies (ASARB) and distributed by the Association of Religion Data Archives. (1990).

There is data on about 130 denominations, with number of houses of worship, number of adherents, and number of members for (almost) every one, by county.  Houses of worship were surveyed, not individuals.  “Adherents” is a somewhat looser criterion than “Members”, but the survey allowed the houses of worship to interpret the question as they chose.  The combination of self-reporting and self-interpretation means that you probably shouldn’t pay too much attention to the raw numbers.  In particular, the respondents might well be over-estimating: Joe’s Church might be counting people who went to Joe’s Church only once.  However, I think the relative values across the country are interesting.

Here is the percentage of the population in the Continental US that is an adherent to any denomination (remember, as measured by the houses of worship).  The more blue, the more adherents.

Adherents as a % of population


I was a little surprised at how non-churchgoing the West Coast, Florida, and Maine were.

Here is the % of the population which adheres to the Church of Jesus Christ of Latter-Day Saints (also known as “the Mormons”):

% LDS Adherents - 1990


It isn’t surprising how the concentration of LDS adherents is centered in Utah, but I was surprised at how clearly you can see the Utah state borders.

Here is the percent of the population which adheres to any of the twelve denomination with the word “Lutheran” in the name:

% Lutheran Adherents 1990


Here is a map of the percentage of the population which adheres to a denomination with the name “Lutheran” in the name.  I was surprised at how concentrated the Lutherans were in the upper center of the country.  I had sort of thought that a group which had a “Missouri Synod” would have significant adherents in, you know, Missouri.

Here is a map of the percentage of the population which adheres to the Southern Baptist Convention. Note that there are 25 different denomination with the word “Baptist” in it, this is just the “big one”, the Southern Baptist Convention:

% Southern Baptist Adherents - 1990

I was really surprised at how clear the state boundaries were, especially for Missouri and Kansas.  I guess I kind of knew that the Southern Baptist Convention was sort of the religion of slavery, but I hadn’t realized just how long the geographical connection would persist.  (The Southern Baptist Convention split off from the northern branch in 1845, specifically over slavery.  They did apologize in 1995.)

Here is a map of the percentage of the population which adheres to Roman Catholicism:

% Catholic Adherents 1990

I was amazed at how few Catholics there were in the Deep South.  Aside from Latino influence in southern Texas and, to a lesser extent in Florida, plus the French influence in Louisiana, there are practically no Catholics in the south.  (At least, not in 1990.)  I grew up a few hours south of Chicago, so I rather had the impression that Catholics were ubiquitous.

I have a lot more data, but I’m not really sure what groupings make sense.  For example, do I group “Holy Apostolic Catholic Assyrian Church of the East” in with “Greek Orthodox”?  I have no idea if they have similar doctrines, if they hate each others’ guts, or both.  Similarly, I think it would be useful to group together evangelical churches, but I’m not sure how to tell which churches are properly called “evangelical”.  Stay tuned.

« Previous entries Next Page » Next Page »