04.18.13
Posted in Maps at 1:20 am by ducky
The Huffington Post made a very nice interactive map of homicides and accidental gun deaths since the shooting at Sandy Hook. It’s a very nice map, but has the (very common problem) that it mostly shows where the population density is high: of course you will have more shootings if there are more people.
I wanted to tease out geographical/political effects from population density effects, so I plotted the gun deaths on a population-based cartogram. Here was my first try. (Click on it to get a bigger image.)

Unfortunately, the Huffington Post data gives the same latitude/longitude for every shooting in the same city. This makes it seem like there are fewer deaths in populated areas than there really are. So for my next pass, I did a relatively simple map where the radius of the dots was proportional to the square root of the number of gun deaths (so that the area of the dot would be proportional to the number of gun deaths).

This also isn’t great. Some of the dots are so big that they obscure other dots, and you can’t tell if all the deaths were in one square block or spread out evenly across an entire county.
For the above map, for New York City, I dug through news articles to find the street address of each shooting and geocoded it (i.e. determined the lat/long of that specific address). You can see that the points in New York City (which is the sort of blobby part of New York State at the south) seem more evenly distributed than for e.g. Baltimore. Had I not done that, there would have been one big red dot centered on Manhattan.
(Side note: It was hugely depressing to read article after article about people’s — usually young men’s — lives getting cut short, usually for stupid, stupid reasons.)
I went through and geocoded many of the cities. I still wasn’t satisfied with how it looked: the size balance between the 1-death and the multiple-death circles looked wrong. It turns out that it is really hard — maybe impossible — to get area equivalence for small dots. The basic problem is that with radiuses are integers, limited by pixels. In order to get the area proportional to gun deaths, you would want the radius to be proportional to the square root of the number of gun deaths, or {1, 1.414, 1.732 2.0, 2.236, 2.449, 2.645, 2.828, 3.000}, the rounded numbers will be {1, 1, 2, 2, 2, 2, 3, 3, 3}; instead of areas of {pi, 2*pi, 3*pi, 4*pi, …}, you get {pi, pi, 4*pi, 4*pi, 4*pi, 9*pi, 9*pi, 9*pi}.
Okay, fine. We can use a trick like anti-aliasing, but for circles: if the square root of the number of gun deaths is between two integer values (e.g. 2.236 is between 2 and 3), draw a solid circle with a radius of the just-smaller integer (for 2.236, use 2), then draw a transparent circle with a radius of the just-higher integer (for 2.236, use 3), with the opacity higher the closer the square root is to the higher number. Sounds great in theory.
In practice, however, things still didn’t look perfect. It turns out that for very small dot sizes, the dots’ approximations to circles is pretty poor. If you actually zoom in and count the pixels, the area in pixels is {5, 13, 37, 57, 89, 118, 165, …} instead of what pi*R^2 would give you, namely {3.1, 12.6, 28.3, 50.3, 78.5, 113.1, 153.9, …}.

But wait, it’s worse: combine the rounding radiuses problem with the problem of approximating circles, and the area in pixels will be {5, 5, 13, 13, 13, 13, 37, 37, 37, …}, for errors of {59.2%, -60.2% -54.0% -74.1% -83.4% -67.3% -76.0%, …}. In other words, the 1-death dot will be too big and the other dots will be too small. Urk.
Using squares is better. You still have the problem of the rounding the radius, but you don’t have the circle approximation problem. So you get areas in pixels of {1, 1, 4, 4, 4, 9, 9, 9, …} instead of {1, 2, 3, 4, 5, 6, 7, 8, …} for errors of {0.0%, -50.0%, 33.3%, 0.0%, -20.0%, -33.3%, 28.6%, …} which is better, but still not great.
Thus: rectangles:

Geocoding provided by GeoCoder.ca
Permalink
03.14.13
Posted in Maps at 12:11 am by ducky
I imagine some epidemiologist somewhere, who has statistics on the something like the measles rate by postal code, who wants to see if there is a geographic trend, like if warmer places have more measles. She has a spreadsheet with the postal codes and the number of cases in that postal code, and wants to turn that into a map where each postal code’s colour represents the number of cases per capita in that postal code.
She should not need to know what a shapefile is, should not need to know that the name of the map type she wants is “choropleth”, and should not have to dig up the population of that postal code. The boundaries of the jurisdictions she cares about (postal codes, in this case) and the population are well-understood and don’t change often; the technology to make such a map out to be invisible to her. She should be able to upload a spreadsheet and get her map.
I find it almost morally wrong that it is so hard to make a map.
Making that possible would be my dream job. It is a small enough job that I could do it all by myself, but it is a large enough job that it would effectively prevent me from doing other paying work for probably about a year, and I can’t see a way to effectively monetize it.
The challenges are not in creating a map that is displayed onscreen — that’s the easy part. To develop this service would require (in order of difficulty):
- code and resources to enable users to store their data and map configurations securely;
- code to pick out jurisdiction names and data columns from spreadsheets, and/or a good UI to walk the user through picking the columns;
- fuzzy matching code which understands that e.g. “PEI” is really “Prince Edward Island”, a province in Canada; that “St John, LA” is actually “Saint John the Baptist Parish”, a county-equivalent in Louisiana; that there are two St. Louis counties in Misouri; that Nunavut didn’t exist before 1999;
- code to allow users to share their data if they so choose;
- UI (and underlying code) to make the shared data discoverable, usable, and combinable;
- code (and perhaps UI) to keep spammers from abusing the system;
- code to generate hardcopy of a user’s map (e.g. PNG or PDF);
- code for a user account mechanism and UI for signing in
This service would give value to many people: sales managers trying to figure out how to allocate sales districts, teachers developing lesson plans about migration of ethnic minorities, public health officials trying to understand risk factors, politicians targeting niche voters, urban planning activists trying to understand land use factors, etc.
Unfortunately, for the people to whom this really matters, if they already have money, they already ponied up the money for an ESRI mapping solution. If they don’t have money, then they won’t pay for this service.
GeoCommons tries to do this. GeoCommons makes maps from users’ data, and you stores and shares users’ data, but their map making is so slow it is basically unusable, and it is not easy to combine data from multiple sources into one map.
It might be that one of the “big data” organizations, e.g. Google or Amazon, might provide this as an enticement for getting people to use their other services. Google, for example, has a limited ability to do this kind of thing with their Fusion Tables (although if you want to do jurisdictions other than countries, then you have to provide a shapefile). Amazon provides a lot of data for use with the Amazon Web Services.
However, it would be almost as difficult for Google or Amazon to monetize this service as it would for me. Google could advertise and Amazon could restrict it to users of its AWS service, but it isn’t clear to me how much money that could bring in.
If anybody does figure out a way to monetize it, or wants to take a gamble on it being possible, please hire me!
Permalink
03.12.13
Posted in Hacking, Maps at 11:54 am by ducky
In the past, when people asked me how I managed to make map tiles so quickly on my World Wide Webfoot Maps site, I just smiled and said, “Cleverness.” I have decided to publish how I optimized my map tile generation in hopes that others can use these insights to make snappier map services. I give a little background of the problem immediately below; mapping people can skip to the technical details.
Background
Choropleth maps colour jurisdictions based on some attribute of the jurisdiction, like population. They are almost always implemented by overlaying tiles (256×256 pixel PNG images) on some mapping framework (like Google Maps).
 |
| Map tile from a choropleth map (showing 2012 US Presidential voting results by county) |
Most web sites with choropleth maps limit the user: users can’t change the colours, and frequently are restricted to a small set of zoom levels. Sometimes the maps take a very long time to display. This is because the tiles are so slow to render that the site developers must render the tile images ahead of time and store them. My mapping framework is so fast that I do not need to pre-render all the tiles for each attribute. I can allow the users to interact with the map, going to arbitrary zoom levels and changing the colour mapping.
Similarly, when people draw points on Google Maps, 100 is considered a lot. People have gone to significant lengths to develop several different techniques for clustering markers. By contrast, my code can draw thousands very quickly.
There are 32,038 ZIP codes in my database, and my framework can show a point for each with ease. For example, these tiles were generated at the time this web page loaded them.
 |
| 32,038 zip codes at zoom level 0 (entire world) |
|
 |
| Zip codes of the Southeast US at zoom level 4 |
(If the images appeared too fast for you to notice, you can watch the generation here and here. If you get excited, you can change size or colour in the URL to convince yourself that the maps framework renders the tile on the fly.)
Technical Details
The quick summary of what I did to optimize the speed of the map tile generation was to pre-calculate the pixel coordinates, pre-render the geometry and add the colours later, and optimize the database. In more detail:
Note that I do NOT use parallelization or fancy hardware. I don’t do this in the cloud with seventy gajillion servers. When I first wrote my code, I was using a shared server on Dreamhost, with a 32-bit processor and operating system. Dreamhost has since upgraded to 64-bits, but I am still using a shared server.
Calculating pixel locations is expensive and frequent
For most mapping applications, buried in the midst of the most commonly-used loop to render points is a very expensive operation: translating from latitude/longitude to pixel coordinates, which almost always means translating to Mercator projection.
While converting from longitude to the x-coordinate in Mercator is computationally inexpensive, to convert from latitude to y-coordinate using the Mercator projection is quite expensive, especially for something which gets executed over and over and over again.
A spherical mercator translation (which is actually much simpler than the actual projection which Google uses) uses one logarithmic function, one trigonometric function, two multiplications, one addition, and some constants which will probably get optimized away by the compiler:
function lat2y(a) { return 180/Math.PI * Math.log(Math.tan(Math.PI/4+a*(Math.PI/180)/2)); }
(From the Open Street Maps wiki page on the Mercator projection)
Using Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs by Agner Fog, a tangent can take between 11 and 190 cycles, and a logarithm can take between 10 and 175 cycles on post-Pentium processors. Adds and multiplies are one cycle each, so converting from latitude to y will take between 24 and 368 cycles (not counting latency). The average of those is almost 200 cycles.
And note that you have to do this every single time you do something with a point. Every. Single. Time.
If you use elliptical Mercator instead of spherical Mercator, it is much worse.
Memory is cheap
I avoid this cost by pre-calculating all of the points’ locations in what I call the Vast Coordinate System (VCS for short). The VCS is essentially a pixel space at Google zoom level 23. (The diameter of the Earth is 12,756,200 meters; at zoom level 23, there are 2^23 tiles, and each tile has 256 or 2^8 pixels, so there are 2^31 pixels around the equator. This means that the pixel resolution of this coordinate system is approximately .6cm at the equator, which should be adequate for most mapping applications.)
Because the common mapping frameworks work in powers of two, to get the pixel coordinate (either x or y) at a given zoom level from a VCS coordinate only requires one right-shift (cost: 1 cycle) to adjust for zoom level and one bitwise AND (cost: 1 cycle) to pick off the lowest eight bits. The astute reader will remember that calculating the Mercator latitude takes, for the average post-Pentium processor, around 100 times as many cycles.
Designing my framework around VCS and the Mercator does make it harder to change the projection, but Mercator won: it is what Google uses, what Yahoo uses, what Bing uses, and even what the open-source Leaflet uses. If you want to make map tiles to use with the most common services, you use Mercator.
Furthermore, should I decide that I absolutely have to use a different projection, I would only have to add two columns to my points database table and do a bunch of one-time calculations.
DISTINCT: Draw only one ambiguous point
If you have two points which are only 10 kilometers apart, then when you are zoomed way in, you might see two different pixels for those two points, but when you zoom out, at some point, the two points will merge and you will only see one pixel. Setting up my drawing routine to only draw the pixel once when drawing points is a big optimization in some circumstances.
Because converting from a VCS coordinate to a pixel coordinate is so lightweight, it can be done easily by MySQL, and the DISTINCT keyword can be used to only return one record for each distinct pixel coordinate.
The DISTINCT keyword is not a big win when drawing polygons, but it is a HUGE, enormous win when drawing points. Drawing points is FAST when I use the DISTINCT keyword, as shown above.
For polygons, you don’t actually want to remove all but one of a given point (as the DISTINCT keyword would do), you want to not draw two successive points that are the same. Doing so is a medium win (shaving about 25% of the time off) for polygon drawing when way zoomed out, but not much of a win when way zoomed in.
Skeletons: Changing the colour palette
While the VCS speed improvement means that I could render most tiles in real time, I still could not render tiles fast enough for a good user experience when the tiles had very large numbers of points. For example, the 2000 Census has 65,322 census tracts; at zoom level 0, that was too many to render fast enough.
Instead, I rendered and saved the geometry into “skeletons”, with one set of skeletons for each jurisdiction type (e.g. census tract, state/province, country, county). Instead of the final colour, I filled the polygons in the skeleton with an ID for the particular jurisdiction corresponding to that polygon. When someone asked for a map showing a particular attribute (e.g. population) and colour mapping, the code would retrieve (or generate) the skeleton, look up each element in the colour palette, decode the jurisdictionId, look up the attribute value for that jurisdictionId (e.g. what is the population for Illinois?), use the colour mapping to get the correct RGBA colour, and write that back to the colour palette. When all the colour palette entries had been updated, I gave it to the requesting browser as a PNG.
While I came up with the idea of fiddling the colour palette independently, it is not unique to me. My friend also came up with this idea independently, for example. What I did was take it a bit farther: I modified the gd libraries so they had a 16-bit colour palette in the skeletons which I wrote to disk. When writing out to PNG, however, my framework uses the standard format. I then created a custom version of PHP which statically linked my custom PHP libraries.
(Some people have asked why I didn’t contribute my changes back to gd. It’s because the pieces I changed were of almost zero value to anyone else, while very far-reaching. I knew from testing that my changes didn’t break anything that *I* needed, but gd has many many features, and I couldn’t be sure that I could make changes in such a fundamental piece of the architecture without causing bugs in far-flung places without way more effort than I was willing to invest.)
More than 64K jurisdictions
16 bits of palette works fine if you have fewer than 64K jurisdictions on a tile (which the 2000 US Census Tract count just barely slid under), but not if you have more than 64K jurisdictions. (At least not with the gd libraries, which don’t reuse a colour palette entry if that colour gets completely overwritten in the image.)
You can instead walk through all the pixels in a skeleton, decode the jurisdiction ID from the pixel colour and rewrite that pixel instead of walking the colour palette. (You need to use a true-colour image if you do that, obviously.) For large numbers of colours, changing the colour palette is no faster than walking the skeleton; it is only a win for small numbers of colours. If you are starting from scratch, it is probably not worth the headache of modifying the graphics library and statically linking in a custom PHP to walk the colour palette instead of walking the true-colour pixels.
(I had to modify GD anyway due to a bug I fixed in GD which didn’t get incorporated into the GD release for a very long time.)
My framework now checks to see how many jurisdiction are in the area of interest; if there are more than 64K, it creates a true-colour image, otherwise a paletted image. If the skeleton is true-colour, it walks pixels, otherwise it walks the palette.
Credits: My husband implemented the pixel-walking code.
On-demand skeleton rendering
While I did pre-render about 10-40 tiles per jurisdiction type, I did not render skeletons for the vast majority of tiles. Instead, I render and save a skeleton only when someone asks for it. I saw no sense in rendering ahead of time a high-maginification tile of a rural area. Note that I could only do this on-demand skeleton generation because the VCS speedup made it so fast.
I will also admit that I did generate final tiles (with the colour properly filled in, not a skeleton) to zoom level 8 for some of my most commonly viewed census tract attributes (e.g. population by census tract) with the default value for the colour mapping. I had noticed that people almost never change the colour mapping. I did not need to do this; the performance was acceptable without doing so. It did make things slightly snappier, but mostly it just seemed to me like a waste to repeatedly generate the same tiles. I only did this for US and Australian census jurisdictions.
MySQL vs. PostGIS
One happy sort-of accident is that my ISP, Dreamhost, provides MySQL but does not allow PostGIS. I could have found myself a different ISP, but I was happy with Dreamhost, Dreamhost was cheap, and I didn’t particularly want to change ISPs. This meant that I had to roll my own tools instead of relying upon the more fully-featured PostGiS.
MySQL is kind of crummy for GIS. Its union and intersection operators, for example, use bounding boxes instead of the full polygon. However, if I worked around that, I found that for what I needed to do, MySQL was actually faster (perhaps because it wasn’t paying the overhead of GIS functions that I didn’t need).
PostGIS’ geometries are apparently stored as serialized binary objects, which means that you have to pay the cost of deserializing the object every time you want to look it or one of its constituent elements. I have a table of points, a table of polygons, and a table of jurisdictionIds; I just ask for the points, no deserialization needed. Furthermore, at the time I developed my framework, there weren’t good PHP libraries for deserializing WKB objects, so I would have had to write my own.
Note: not having to deserialize is only a minor win. For most people, the convenience of the PostGIS functions should be worth the small speed penalty.
Database optimization
One thing that I did that was entirely non-sexy was optimizing the MySQL database. Basically, I figured out where there should be indices and put them there. This sped up the code significantly, but it did not take cleverness, just doggedness. There are many other sites which discuss MySQL optimization, so I won’t go into that here.
Future work: Feature-based maps
My framework is optimized for making polygons, but it should be possible to render features very quickly as well. It should be possible to, say, decide to show roads in green, not show mountain elevation, show cities in yellow, and not show city names.
To do this, make a true-colour skeleton where each bit of the pixel’s colour corresponds to the display of a feature. For example, set the least significant bit to 1 if a road is in that pixel. Set the next bit to 1 if there is a city there. Set the next bit to 1 if a city name is displayed there. Set the next bit to 1 if the elevation is 0-500m. Set the next bit to 1 if the elevation is 500m-1000m. Etc. You then have 32 feature elements
which you can turn on and off by adjusting your colour mapping function.
If you need more than 32 feature elements, then you could use two skeletons and merge the images after you do the colour substitutions.
You could also, if you chose, store elevation (or depth) in the pixel, and adjust the colouring of the elevation with your colour mapping function.
Addendum: I meant to mention, but forgot, UTFGrid. UTFGrid has an array backing the tile for lookup of features, so it is almost there. However, it is just used to make pop-up windows, not (as near as I can tell) to colour polygons.
Permalink
03.26.12
Posted in Maps at 10:07 pm by ducky
I added a few more religious denominations to my elections/demographics site, again from Churches and Church Membership in the United States, 1990.
Note that these denominations have fewer adherents than the denominations I featured in my previous post, so these have full white corresponding to 0%, while full blue is 70% (vs. 100% in the previous post).
Here are adherents to the Unified Methodist Church:

% United Methodist Church Adherents 1990
I hadn’t realized that Methodists were concentrated in the center band of the country like that.
Here are the adherents to the Presbyterian Church (USA):

% Presbyterian Church USA Adherents 1990
I was surprised at how diffuse the Presbyterians are.
Here is the African Methodist Episcopal Zion adherents:

% African Methodist Episcopal Adherents 1990
I was surprised at how concentrated the AMEZ church was — in North Carolina and Alabama.
Permalink
03.25.12
Posted in Maps at 12:42 am by ducky
I recently added some data from Churches and Church Membership in the United States, 1990 to my election/demographics map. Collected by the Association of Statisticians of American Religious Bodies (ASARB) and distributed by the Association of Religion Data Archives. (1990).
There is data on about 130 denominations, with number of houses of worship, number of adherents, and number of members for (almost) every one, by county. Houses of worship were surveyed, not individuals. “Adherents” is a somewhat looser criterion than “Members”, but the survey allowed the houses of worship to interpret the question as they chose. The combination of self-reporting and self-interpretation means that you probably shouldn’t pay too much attention to the raw numbers. In particular, the respondents might well be over-estimating: Joe’s Church might be counting people who went to Joe’s Church only once. However, I think the relative values across the country are interesting.
Here is the percentage of the population in the Continental US that is an adherent to any denomination (remember, as measured by the houses of worship). The more blue, the more adherents.

Adherents as a % of population
I was a little surprised at how non-churchgoing the West Coast, Florida, and Maine were.
Here is the % of the population which adheres to the Church of Jesus Christ of Latter-Day Saints (also known as “the Mormons”):

% LDS Adherents - 1990
It isn’t surprising how the concentration of LDS adherents is centered in Utah, but I was surprised at how clearly you can see the Utah state borders.
Here is the percent of the population which adheres to any of the twelve denomination with the word “Lutheran” in the name:

% Lutheran Adherents 1990
Here is a map of the percentage of the population which adheres to a denomination with the name “Lutheran” in the name. I was surprised at how concentrated the Lutherans were in the upper center of the country. I had sort of thought that a group which had a “Missouri Synod” would have significant adherents in, you know, Missouri.
Here is a map of the percentage of the population which adheres to the Southern Baptist Convention. Note that there are 25 different denomination with the word “Baptist” in it, this is just the “big one”, the Southern Baptist Convention:

% Southern Baptist Adherents - 1990
I was really surprised at how clear the state boundaries were, especially for Missouri and Kansas. I guess I kind of knew that the Southern Baptist Convention was sort of the religion of slavery, but I hadn’t realized just how long the geographical connection would persist. (The Southern Baptist Convention split off from the northern branch in 1845, specifically over slavery. They did apologize in 1995.)
Here is a map of the percentage of the population which adheres to Roman Catholicism:

% Catholic Adherents 1990
I was amazed at how few Catholics there were in the Deep South. Aside from Latino influence in southern Texas and, to a lesser extent in Florida, plus the French influence in Louisiana, there are practically no Catholics in the south. (At least, not in 1990.) I grew up a few hours south of Chicago, so I rather had the impression that Catholics were ubiquitous.
I have a lot more data, but I’m not really sure what groupings make sense. For example, do I group “Holy Apostolic Catholic Assyrian Church of the East” in with “Greek Orthodox”? I have no idea if they have similar doctrines, if they hate each others’ guts, or both. Similarly, I think it would be useful to group together evangelical churches, but I’m not sure how to tell which churches are properly called “evangelical”. Stay tuned.
Permalink
11.21.09
Posted in Maps at 1:27 am by ducky
My friend Maciek Chudek and I entered two maps into the Mashup Australia contest: Shades of a Sunburnt People and Stimulating a Sunburnt People. The former shows information about the 2006 Australian Census:

Median age
Redder areas have a higher median age; gold areas are younger. (The red maxes out at 45 years old; any area with a median age of 25 or under is full gold.) Grey areas are ones which had so few people that the Australian Bureau of Statistics withheld the data for privacy reasons.
Our other map shows information about the rail, roads, and community infrastructure component of the Australian economic stimulus package:

Australian stimulus program spending
Blue areas are represented by the Australian Labor Party (which controls Parliament), and reddish areas are controlled by other parties. The darker the colour, the more money has been allocated. Dots represent individual projects. Like the Canadian economic stimulus package, we found a systematic bias favouring areas represented by the governing party.
Since these are so similar to my US census map and the Canadian stimulus map, you might think that this was totally straightforward to do. You might be wrong. We did quite a bit of massaging the data to get it out, and Maciek did a lot of analysis of the stimulus information.
Permalink
10.09.09
Posted in Maps at 11:34 am by ducky
I have two more political layers up on my political/demographic map: US state senators and US state representatives (or assemblymembers, as they are called in some states). Alaska and Hawaii didn’t fit nicely on these images, but you can see them on the political/demographic map.
In the pictures below (and on my site), red is Republican; blue is Democratic. (To those outside the US who are used to red meaning liberal, the US does its colours backwards, sorry.) Some districts elect multiple members; in those cases I average the colour, with exact Democratic/GOP balance being white. In cases where there is a vacancy or a third-party affiliation, the colour is also white.
Here are the state senators:

Continental US State Senator Party Affiliation
Below is the party affiliation of the lower chamber members (which are usually called Representatives, but also sometimes Assemblymembers or Delegates). Note that Nebraska doesn’t have a lower chamber.

Continental US State Lower Chamber Members' Party Affiliation
Most of the party affiliation data came from the excellent Project Vote Smart. What they didn’t have, I gleaned from the appropriate state legislature’s page, Wikipedia, or both.
For comparison, the images below show all the districts in the continental US in random colours:

Continental US State Senate Districts

Continental US State Lower Chamber Districts
There are almost 8000 state and federal legislators in the US for a population of 300M people, or about one legislator per 375,000 people. The number of legislators varies wildly by state, however. New Hampshire currently has 424 state and federal legislators representing a population of 1.3M, or one legislator for every 3066 people. California currently has 176 representing a population of 36M, or one legislator for every 204,000 people.
Permalink
10.07.09
Posted in Maps at 7:11 pm by ducky
I just added the median household income to a demographics map, and my oh my you see so much more at the census tract level than you do at the county level. (I recommend making it a bit more opaque to help you see better.)
The map makes me think of mosquito bites: cities have a white center (low-income), surrounded by an angry red ring (the wealthy suburbs), with white again out in the rural areas:

In this image, full white is a median household income of $30,000 per year (in 1999 dollars), while full red is $150,000. Grey is for areas that the Census Bureau didn’t report a median income for — presumably because too few people lived there. The data is from the 2000 census.
Permalink
08.15.09
Posted in Hacking, Maps at 9:54 pm by ducky
It might not look like I have done much with my maps in a while, but I have been doing quite a lot behind the scenes.
Census Tracts
I am thrilled to say that I now have demographic data at the census tract level now on my electoral map! Unlike my old demographic maps (e.g. my old racial demographics map), the code is fast enough that I don’t have to cache the overlay images. This means that I can allow people to zoom all the way out if they choose, while before I only let people zoom back to zoom level 5 (so you could only see about 1/4 of the continental US at once).
These speed improvements were not easy, and it’s still not super-fast, but it is acceptable. It takes between 5-30 seconds to show a thematic map for 65,323 census tracts. (If you think that is slow, go to Geocommons, the only other site I’ve found to serve similarly complex maps on-the-fly. They take about 40 seconds to show a thematic map for the 3,143 counties.)
A number of people have suggested that I could make things faster by aggregating the data — show things per-state when way zoomed out, then switch to per-county when closer in, then per-census tract when zoomed in even more. I think that sacrifices too much. Take, for example, these two slices of a demographic map of the percent of the population that is black. The %black by county is on the left, the %black by census tract is on the right. The redder an area is, the higher the percentage of black people is.

Percent of population that is black; by counties on left, by census tracts on the right
You’ll notice that the map on the right makes it much clearer just how segregated black communities are outside of the “black belt” in the South. It’s not just that black folks are a significant percentage of the population in a few Northern counties, they are only significantly present in tiny little parts of Northern counties. That’s visible even at zoom level 4 (which is the zoom level that my electoral map opens on). Aggregating the data to the state level would be even more misleading.
Flexibility
Something else that you wouldn’t notice is that my site is now more buzzword-compliant! When I started, I hard-coded the information layers that I wanted: what the name of the attribute was in the database (e.g. whitePop), what the English-language description was (e.g. “% White”), what colour mapping to use, and what min/max numeric values to use. I now have all that information in an XML file on the server, and my client code calls to the server to get the information for the various layers with AJAX techniques. It is thus really easy for me to insert a new layer into a map or even to create a new map with different layers on it. (For example, I have dithered about making a map that shows only the unemployment rate by county, for each of the past twelve months.)
Some time ago, I also added the ability for me to specify how to calcualte a new number with two different attributes. Before, if I wanted to plot something like %white, I had to add a column to the database of (white population / total population) and map that. Instead, I added the ability to do divisions on-the-fly. Subtracting two attributes was also obviously useful for things like the difference in unemployment from year to year. While I don’t ever add two attributes together yet, I can see that I might want to, like to show the percentage of people who are either Evangelical or Morman. (If you come up with an idea for how multiplying two attributes might be useful, please let me know.)
Loading Data
Something else that isn’t obvious is that I have developed some tools to make it much easier for me to load attribute data. I now use a data definition file to spell out the mapping between fields in an input data file and where the data should go in the database. This makes it much faster for me to add data.
The process still isn’t completely turnkey, alas, because there are a million-six different oddnesses in the data. Here are some of the issues that I’ve faced with data that makes it non-straightforward:
- Sometimes the data is ambiguous. For example, there are a number of states that have two jurisdictions with the same name. For example, the census records separately a region that has Bedford City, VA and Bedford County, VA. Both are frequently just named “Bedford” in databases, so I have to go through by hand and figure out which Bedford it is and assign the right code to it. (And sometimes when the code is assigned, it is wrong.)
- Electoral results are reported by county everywhere except Alaska, where they are reported by state House district. That meant that I had to copy the county shapes to a US federal electoral districts database, then delete all the Alaskan polygons, load up the state House district polygons, and copy those to the US federal electoral districts database.
- I spent some time trying to reverse-engineer the (undocumented) Census Bureau site so that I could automate downloading Census Bureau data. No luck so far. (If you can help, please let me know!) This means that I have to go through an annoyingly manual process to download census tract attributes.
- Federal congressional districts have names like “CA-32″ and “IL-7″, and the databases reflect that. I thought I’d just use the state jurisdiction ID (the FIPS code, for mapping geeks) for two digits and two digits for the district ID, so CA-32 would turn into 0632 and IL-7 would turn into 1707. Unfortunately, if a state has a small enough population, they only get one congressional rep; the data file had entries like “AK-At large” which not only messed up my parsing, but raised the question of whether at-large congresspeople should be district 0 or district 1. I scratched my head and decided assign 0 to at-large districts. (So AK-At large became 0200.) Well, I found out later that data files seem to assign at-large districts the number 1, so I had to redo it.
None of these data issues are hard problems, they are just annoying and mean that I have to do some hand-tweaking of the process for almost every new jurisdiction type or attribute. It also takes time just to load the data up to my database server.
I am really excited to get the on-the-fly census tract maps working. I’ve been wanting it for about three years, and working on it off and on (mostly off) for about six months. It really closes a chapter for me.
Now there is one more quickie mapping application that I want to do, and then I plan to dive into adding Canadian information. If you know of good Canadian data that I can use freely, please let me know. (And yes, I already know about GeoGratis.)
Permalink
05.10.09
Posted in Maps, Politics at 6:38 pm by ducky
I added a state legislatures partisanship layer to my election map, and also modified a metric which shows kind of how liberal an area is. For every governor, US senator, or US congressman in a district that is a Democrat, I added one. For every legislator who is a Republican, I subtracted one. Now, with the new data, I also add one point for each state legislative chamber that is controlled by Democrats, and subtract one for each that is controlled by Republicans.
This gives me a range of -6 to plus 6 (governor, two US senators, one US congressman, one state senate, one state lower chamber), which I can show in shades of red to blue:

Some things are not surprising: the northeast is very blue; Idaho and Utah are very red. However, I don’t get Arkansas. I wouldn’t have thought that it would be culturally very different from its neighbours, yet most of the state has the maximum value of +6.
Is this all due to Clinton? Did he build a really strong Democratic Party operation in Arkansas? Or did he throw a bunch of money towards Arkansas, for which they are still grateful?
Can anyone familiar with Arkansas shed any light on this?
UPDATE:
A reader from Arkansas explained that the Arkansas Democratic party is very entrenched and strong, but that the populace is not particularly liberal. Essentially, people who are Democrats in Arkansas would be Republicans just about anywhere else. (This is similar to the Liberal Party in BC, which is the most conservative of the three viable parties in BC. The Liberal Party in BC is much more conservative than the Canadian federal Liberal party.)
Permalink
« Previous entries Next Page » Next Page »