10.07.09

Median household income map done

Posted in Maps at 7:11 pm by ducky

I just added the median household income to a demographics map, and my oh my you see so much more at the census tract level than you do at the county level.  (I recommend making it a bit more opaque to help you see better.)

The map makes me think of mosquito bites: cities have a white center (low-income), surrounded by an angry red ring (the wealthy suburbs), with white again out in the rural areas:

medianHouseholdIncome1999

In this image, full white is a median household income of $30,000 per year (in 1999 dollars), while full red is $150,000.  Grey is for areas that the Census Bureau didn’t report a median income for — presumably because too few people lived there.  The data is from the 2000 census.

08.15.09

Progress! Including census tracts!

Posted in Hacking, Maps at 9:54 pm by ducky

It might not look like I have done much with my maps in a while, but I have been doing quite a lot behind the scenes.

Census Tracts

I am thrilled to say that I now have demographic data at the census tract level now on my electoral map!  Unlike my old demographic maps (e.g. my old racial demographics map), the code is fast enough that I don’t have to cache the overlay images.  This means that I can allow people to zoom all the way out if they choose, while before I only let people zoom back to zoom level 5 (so you could only see about 1/4 of the continental US at once).

These speed improvements were not easy, and it’s still not super-fast, but it is acceptable.  It takes between 5-30 seconds to show a thematic map for 65,323 census tracts. (If you think that is slow, go to Geocommons, the only other site I’ve found to serve similarly complex maps on-the-fly.  They take about 40 seconds to show a thematic map for the 3,143 counties.)

A number of people have suggested that I could make things faster by aggregating the data — show things per-state when way zoomed out, then switch to per-county when closer in, then per-census tract when zoomed in even more.  I think that sacrifices too much.  Take, for example, these two slices of a demographic map of the percent of the population that is black.  The %black by county is on the left, the %black by census tract is on the right.  The redder an area is, the higher the percentage of black people is.

Percent of population that is black; by counties on left, by census tracts on the right

Percent of population that is black; by counties on left, by census tracts on the right

You’ll notice that the map on the right makes it much clearer just how segregated black communities are outside of the “black belt” in the South.  It’s not just that black folks are a significant percentage of the population in a few Northern counties, they are only significantly present in tiny little parts of Northern counties.  That’s visible even at zoom level 4 (which is the zoom level that my electoral map opens on).  Aggregating the data to the state level would be even more misleading.

Flexibility

Something else that you wouldn’t notice is that my site is now more buzzword-compliant!  When I started, I hard-coded the information layers that I wanted: what the name of the attribute was in the database (e.g. whitePop), what the English-language description was (e.g. “% White”), what colour mapping to use, and what min/max numeric values to use.  I now have all that information in an XML file on the server, and my client code calls to the server to get the information for the various layers with AJAX techniques.  It is thus really easy for me to insert a new layer into a map or even to create a new map with different layers on it.  (For example, I have dithered about making a map that shows only the unemployment rate by county, for each of the past twelve months.)

Some time ago, I also added the ability for me to specify how to calcualte a new number with two different attributes.  Before, if I wanted to plot something like %white, I had to add a column to the database of (white population / total population) and map that.  Instead, I added the ability to do divisions on-the-fly.   Subtracting two attributes was also obviously useful for things like the difference in unemployment from year to year. While I don’t ever add two attributes together yet, I can see that I might want to, like to show the percentage of people who are either Evangelical or Morman.  (If you come up with an idea for how multiplying two attributes might be useful, please let me know.)

Loading Data

Something else that isn’t obvious is that I have developed some tools to make it much easier for me to load attribute data.  I now use a data definition file to spell out the mapping between fields in an input data file and where the data should go in the database.  This makes it much faster for me to add data.

The process still isn’t completely turnkey, alas, because there are a million-six different oddnesses in the data.  Here are some of the issues that I’ve faced with data that makes it non-straightforward:

  • Sometimes the data is ambiguous.  For example, there are a number of states that have two jurisdictions with the same name.  For example, the census records separately a region that has Bedford City, VA and Bedford County, VA.  Both are frequently just named “Bedford” in databases, so I have to go through by hand and figure out which Bedford it is and assign the right code to it.  (And sometimes when the code is assigned, it is wrong.)
  • Electoral results are reported by county everywhere except Alaska, where they are reported by state House district.  That meant that I had to copy the county shapes to a US federal electoral districts database, then delete all the Alaskan polygons, load up the state House district polygons, and copy those to the US federal electoral districts database.
  • I spent some time trying to reverse-engineer the (undocumented) Census Bureau site so that I could automate downloading Census Bureau data.  No luck so far.  (If you can help, please let me know!)  This means that I have to go through an annoyingly manual process to download census tract attributes.
  • Federal congressional districts have names like “CA-32” and “IL-7”, and the databases reflect that.  I thought I’d just use the state jurisdiction ID (the FIPS code, for mapping geeks) for two digits and two digits for the district ID, so CA-32 would turn into 0632 and IL-7 would turn into 1707.  Unfortunately, if a state has a small enough population, they only get one congressional rep; the data file had entries like “AK-At large” which not only messed up my parsing, but raised the question of whether at-large congresspeople should be district 0 or district 1.  I scratched my head and decided assign 0 to at-large districts.  (So AK-At large became 0200.)  Well, I found out later that data files seem to assign at-large districts the number 1, so I had to redo it.

None of these data issues are hard problems, they are just annoying and mean that I have to do some hand-tweaking of the process for almost every new jurisdiction type or attribute.  It also takes time just to load the data up to my database server.

I am really excited to get the on-the-fly census tract maps working.  I’ve been wanting it for about three years, and working on it off and on (mostly off) for about six months.  It really closes a chapter for me.

Now there is one more quickie mapping application that I want to do, and then I plan to dive into adding Canadian information.  If you know of good Canadian data that I can use freely, please let me know.  (And yes, I already know about GeoGratis.)

05.10.09

Arkansas liberalism?

Posted in Maps, Politics at 6:38 pm by ducky

I added a state legislatures partisanship layer to my election map, and also modified a metric which shows kind of how liberal an area is.  For every governor, US senator, or US congressman in a district that is a Democrat, I added one.  For every legislator who is a Republican, I subtracted one.  Now, with the new data, I also add one point for each state legislative chamber that is controlled by Democrats, and subtract one for each that is controlled by Republicans.

This gives me a range of -6 to plus 6 (governor, two US senators, one US congressman, one state senate, one state lower chamber), which I can show in shades of red to blue:

demlegislators-2009

Some things are not surprising: the northeast is very blue; Idaho and Utah are very red.  However, I don’t get Arkansas.  I wouldn’t have thought that it would be culturally very different from its neighbours, yet most of the state has the maximum value of +6.

Is this all due to Clinton?  Did he build a really strong Democratic Party operation in Arkansas?  Or did he throw a bunch of money towards Arkansas, for which they are still grateful?

Can anyone familiar with Arkansas shed any light on this?

UPDATE:

A reader from Arkansas explained that the Arkansas Democratic party is very entrenched and strong, but that the populace is not particularly liberal.  Essentially, people who are Democrats in Arkansas would be Republicans just about anywhere else.  (This is similar to the Liberal Party in BC, which is the most conservative of the three viable parties in BC.  The Liberal Party in BC is much more conservative than the Canadian federal Liberal party.)

04.19.09

Unemployment maps

Posted in Maps, Politics at 4:56 pm by ducky

I have been looking at unemployment figures today.  Here’s unemployment rate by county for 2008, from the Bureau of Labor Statistics.  Pure white corresponds to a rate of 2%; pure green corresponds to a rate of 10%.

Unemployment Rate 2008

Unemployment Rate 2008

Note that the unemployment rate is dimensionless, i.e. it’s the number of people looking for work divided by the number of people in the workforce.  The BLS makes its estimates by interviewing thousands of people each week and carefully asking them questions about their employment.  If they have worked at all in the past week (even part-time), that counts as employed.  If they haven’t looked for work in four weeks, they do not count as being in the workforce.  This means that retired people, stay-at-home moms, and people who have given up do not count.  (The BLS has a good explanation of their methods.)

Here is the unemployment rate by county for 2007:

Unemployment Rate 2007

Unemployment Rate 2007

Again, I think the more interesting picture is the difference between the two years; red where unemployment has gone up, blue where unemployment has gone down.  Full red means a change of +5 percentage points or more; full blue means a change of -5 percentage points or more.

2008 unemployment rate minus 2007 unemployment rate

2008 unemployment rate minus 2007 unemployment rate

There’s an awful lot of red there, alas.  The unemployment rate fell in 272 counties and rose in 2767 counties.

Things worth noting about the above maps:

  • There are a fair number of state boundaries visible in the 2008 minus 2007 map.  For examples, Wisconsin is particularly visible, Wyoming is visible to a lesser extent, and there is also a line visible running along the north side of Oklahoma, Arkansas, Alabama, Mississippi, and South Carolina, and a line along the east side of Alabama.  I think this means that state policies actually do matter.
  • West Virginia had lower unemployment in 2008 than 2007.  I presume that was demand for coal relating to the high price of oil in most of 2008.  The Oklahoma drops in unemployment also be due to the higher value of oil; I don’t know why North Dakota did better.
  • Woods County, OK, which is the bright blue spot in near the center of the country in the 2008 minus 2007 map, has a relatively small population: 120 people were unemployed in 2008 versus 261 in 2007.  Oklahoma’s economy has a large fossil fuel component.

It’s also interesting to look at the difference between 2008 and 1998:

2008 unemployment rate minus 2007 unemployment rate

2008 unemployment rate minus 1998 unemployment rate

Thoughts:

  • Poor Michigan.
  • In addition to West Virginia, the rural West had higher unemployment in 1998 than in 2008.  I presume this has to do with the very strong market for resources (trees, minerals, coal, etc.) for most of 2008.

I have added the first three maps to my elections map page.   (Note that I have data such that I can put even more maps on the page, but I worry about the UI getting too cluttered.  Thus, if you really want to see some map, let me know.)

Update: A friend of a friend pointed me at Wisconsin Business Climate Statistics.  That page points out the Wisconsin has a very lean government, low taxes, low crime,and  tax exemption for energy used in manufacturing.  Frankly, it sounds like a Republican party platform — even though Wisconsin is a very Democratic-leaning state.

04.13.09

Historical presidential maps

Posted in Maps, Politics at 5:51 pm by ducky

I recently got historical data on presidential election results by county from Robert Vanderbei, for presidential elections 1960-2004.  While it is interesting to look at the raw data, I find it even more interesting to look at the differences between years, like the 2008 vs. 2004 map I commented on already.  This helps separate how people felt about a particular pair of politicians from how liberal/conservative they are in general.  For example, here’s the 1960 (Nixon vs. Kennedy) map, with Democratic counties in blue and Republican counties in red:

dempresidentialmargin-1960

1960 -- Kennedy vs. Nixon

And here’s the 1964 (Johnson vs. Goldwater) map:

1964 -- Johnson vs. Goldwater

1964 -- Johnson vs. Goldwater

1964 Difference

You can see even from the 1964 map that LBJ was not very popular in the South (presumably because of his civil rights work), but the difference map below really hammers it home.  In this map, it is blue if LBJ did better than Kennedy and red if the reverse.  You can see from the difference map that the South really hated LBJ:

1964 results minus 1960 results

1964 results minus 1960 results

Another interesting thing about the 1960/1964 maps is that there is no evidence at all of “the black belt”.  Here is a map of counties which were majority black in 2000, with darker green the stronger their majority:

Majority-black counties (2000)

Majority-black counties (2000)

I have to believe that blacks would have overwhelmingly voted for LBJ — if they were able.  I think this is a pretty vivid demonstration of how thoroughly their voting rights were repressed.

1968 vs. 1960

The 1968 (McGovern-Nixon) minus 1964 map is basically an inverse of the 1964 minus 1960 map, basically because the southern antipathy towards Johnson was so strong that it skews everything.  A more interesting map is to compare Humphrey vs. Nixon to Kennedy vs. Nixon:

1968 (Humphrey-Nixon) minus 1690 (Kennedy-Nixon)

1968 (Humphrey-Nixon) minus 1960 (Kennedy-Nixon)

Humphrey explicitly called for the Democrats to move away from states’ rights and towards civil rights, and that apparently played well in the upper Midwest and Northeast but not as well in the Southeast or West.  You can also see a faint outline of Minnesota (where Humphrey was from) and a strong outline of Maine (where Muskie, the Democratic VP, was from).  (Maryland, where Nixon’s VP Spiro Agnew was from, is too small to see in this picture.)  You can maaaybe start to see the majority-black counties in some states, but not in Georgia.

There are some blue areas in the above map, but those probably would be red if it weren’t for George Wallace.  Wallace ran as an independent, and did extremely well in southern states.  It is unlikely that he took any votes away from Humphrey, as he was an outspoken proponent of segregation.  While third-party candidates usually struggle to get over 10% of the vote, Wallace won a number of states outright.  Here is a map of counties that he won outright:

Counties won by Wallace in 1968

Counties won by Wallace in 1968

1972

Nixon was re-elected in a landslide.  Not only was McGovern staunchly anti-war during the Vietnam War, he was criticized for his first choice of running mate (who he fired).  The only obvious counties on this map that voted more for McGovern than for Humphrey were in McGovern’s home state of South Dakota:

1972 (McGovern vs. Nixon) minus 1968 (Humphrey vs. Nixon)

1972 (McGovern vs. Nixon) minus 1968 (Humphrey vs. Nixon)

1976

The Carter/Ford minus McGovern/Nixon map looks almost exactly the opposite, as the Watergate scandal destroyed Nixon’s and Ford’s standing.  The South also rallied to Jimmy Carter, the first post-Civil War Southerner to be elected President.

1976 (Carter-Ford) minus 1972 (McGovern-Nixon)

1976 (Carter-Ford) minus 1972 (McGovern-Nixon)

1980

Jimmy Carter had his own troubles: the economy was in dire shape, in large part because of the rise in gas prices because of the second oil crisis.  Carter also was not a strong leader: my memory of the time is that he suffered from what I called “Democrat’s dilemma”: being able to see all sides to all issues and thus unable to take a strong stand.  Ronald Reagan, who exuded a forceful, “can-do” attitude, was more successful than the disgraced Ford almost everywhere:

1980 (Carter vs. Reagan) minus 1976 (Carter vs. Ford)

1980 (Carter vs. Reagan) minus 1976 (Carter vs. Ford)

1984

Reagan got even more popular in large swaths of the country.  Mondale could only manage to erode some of Reagan’s support in spots.

1984 (Mondale vs. Reagan) minus 1980 (Carter vs. Reagan)

1984 (Mondale vs. Reagan) minus 1980 (Carter vs. Reagan)

Note that many of the blue counties above are areas of high Native American population.  The map below shows counties where more than 30% of the people identify as Native Americans:

Counties with more than 30% Native American

Counties with more than 30% Native American

I suspect that Reagan did something to upset Native Americans, but I don’t know what that was.

1988

George H.W. Bush was able to get elected in 1988, but he was pretty uniformly less successful than Reagan.

1988 (Dukakis vs. Bush41) minus 1984 (Mondale vs. Reagan)

1988 (Dukakis vs. Bush41) minus 1984 (Mondale vs. Reagan)

1992

Bush continued to do worse in 1992, again pretty much across the whole country, losing to Clinton.  Note that you can see the outline of Arkansas (home of Bill Clinton) clearly and Tennessee (home of Clinton’s VP Al Gore) somewhat.

1992 (Clinton vs. Bush41) minus 1988 (Dukakis vs. Bush41)

1992 (Clinton vs. Bush41) minus 1988 (Dukakis vs. Bush41)

Ross Perot made a strong third-party run in 1992.  I’m not sure who he took more votes from.

1992 third-party votes (mosty Perot)

1992 third-party votes (mosty Perot)

1996

The Republicans made some inroads in 1996 in the West — especially in Bob Dole’s native Kansas (outline visible in the center of the country) — but it wasn’t enough.  Clinton gained support in the upper Midwest, Northeast, Florida, Louisiana, and Southern Texas (which is heavily Latino).

1996 (Clinton vs. Dole) minus 1992 (Clinton vs. Bush41)

1996 (Clinton vs. Dole) minus 1992 (Clinton vs. Bush41)

2000

Bush43 and Gore had a famously close race, but Bush43 did better than Dole almost everywhere (or Gore did worse than Clinton, depending on how you look at it).

2000 (Gore vs. Bush43) minus 1996 (Clinton vs. Dole)

2000 (Gore vs. Bush43) minus 1996 (Clinton vs. Dole)

2004

Bush43 strengthened his lead in the middle and southeast of the country in 2004, but lost support in some Northern and Western places:

2004 (Kerry vs. Bush43) minus 2000 (Gore vs. Bush43)

2004 (Kerry vs. Bush43) minus 2000 (Gore vs. Bush43)

I’ve written about the 2008 vs. 2004 map already, so I won’t talk about it here.   Instead, I think it is interesting to compare the 2008 election to the 1960 election, to see how the country’s party affiliations have changed:

2008 (Obama vs. McCain) minus 1960 (Kennedy vs. Nixon)

2008 (Obama vs. McCain) minus 1960 (Kennedy vs. Nixon)

The biggest difference is that the Southeast is much, much more Republican now (except for minority-heavy areas: the Black belt and parts of Florida).

The New England states and the upper Midwest are much more Democratic.  Native Americans voted heavily for Obama.  Most importantly, perhaps, is that the Pacific coastal areas are much, much more Democratic than they were in 1960.  (Those areas have also experienced a great deal of population growth, so this change is bigger than it looks.)  The only area that seems like it stayed sort of the same is a belt running through Mossouri, Illinois, Indiana, and Kansas.

Note: The difference maps aren’t up on my maps page yet, but they hopefully will be soon.

02.09.09

Bush-Kerry, 2004 vs. 2008 layers added to map

Posted in Maps, Politics at 4:11 pm by ducky

I have added the 2004 presidential election results to my politics/demographics map.  Loading that data also let me provide a map of the difference between the 2004 and 2008 election:

2008 v. 2004 presidential election results

2008 v. 2004 presidential election results

The map is red in counties that voted more Republican in 2008 and blue for counties that voted more Democratic.  I think it is interesting that Arizona — McCain’s home state — pops out very clearly, while Illinois does not.

Anthony Robinson did a heroic job pulling out all the 2004 data, but there were a lot of problems with the data.  I presume he had to work with preliminary data, while I could get final data.  I scraped information from AZ, CA, GA, IN, LA, MO, NC, NY, TX, and VA, so I have high confidence in those states.  I want to look a little more closely at Illinois — I need to be in Windows to get at that data, ugh.  Alaska is a pain because it is done by house district and not by county, so I will get that in a little later.

A friend of mine had seen a similar map in the New York Times, and was surprised that Louisiana was so red.  He thought that it was odd that, given the snafus after Katrina, it would have been strongly anti-Republican.  I thought perhaps the data was just incorrect, but no — I verified the Louisiana data.  My best guess is that the white people outside New Orleans saw a lot of images about people behaving badly in New Orleans in the Katrina aftermath.  Those people were predominantly black, and I can believe that the white people nearby associated bad behaviour with “black people” instead of with “desperate poor people” or “strung-out drug addicts without a fix” or even “criminals”.  They might be more inclined to vote against Obama for being black.

I actually find Oklahoma odder: Oklahoma has a relatively high percentage of Native Americans.  In many places, they voted very strongly for Obama.  Not in Oklahoma, they didn’t.

I also added information on governors’ and senators’ party affiliations because it was really easy to do so.

Update1: Looking at the red zones, I notice that there are very hard edges on the redness at the northern boundaries of Arkansas and Tennesee.  Perhaps it isn’t really that Obama did particularly poorly there, but that Kerry did particularly well, perhaps by association with Arkansan Bill Clinton and Tenneseean Al Gore?  It would be interesting to compare Obama’s margins with Jimmy Carter’s, but I don’t think it will be easy to get the 1976 election results by county.

Update2: I did find better data — all the way back to 1960! I blogged about it.

Update3: I found someone in Oklahoma.  He said that Oklahoma has a low educational level, and that while many people claimed Native heritage, many of the people were only a minority Native, and were culturally white.

02.08.09

New map: stimulus spending

Posted in Maps, Politics at 5:23 pm by ducky

I made a new map of where the stimulus spending was forecast to go. I pulled the data from the White House Web site a few days ago, so note that it is for the proposed stimulus bill, not the bill as it passed. Also, it shows per capita amounts, using 2006 population figures from Wikipedia, so there will be a minor error due to the years being different.

I didn’t make a lot of noise about this one because I hope to do another, better one when the numbers are more stable.

Jobs created or saved per capita

Jobs created or saved per capita

02.01.09

Election 2008 / Demographics map

Posted in Maps at 11:46 am by ducky

I made a demo of my new mapping framework, a choropleth map of the 2008 presidential election that you can combine with various demographic layers, zoom in, zoom out, change the colour mapping, and all kinds of good things.

The base layer is a map of the election results by county.  Here it is with all the controls clipped off:

Vote margin layer

Voting margin layer

Blue counties are ones where more people voted for Obama; red counties voted for McCain.  The intensity of the colour shows how big the spread between the two candidates was.  Counties are white if the race was close.

A liberal friend of mine said with some dismay, “That’s an awful lot of red!”  That’s true, but most of those red counties have very few people in them.   One way to get a sense of how few people live in the center of the country is to overlay ZIP code locations (presumably the centers of the ZIP codes):

Voting layer with ZIP codes

Voting layer with ZIP codes overlay

(The ZIP layer is even more interesting in bigger images.  I’m showing cropped, tiny versions of the map; you can select medium and huge sizes too.)

Another way to see the population density is to overlay the population density layer on top of the voting layer.  I set the population density layer to go from full white at zero people per square mile to full green at 400 people per square mile:

Voting layer plus population density

Voting layer plus population density overlay

Note that 400 people per square mile is still very, very low compared to urban areas.  (New York City has around 50,000 people per square mile.  Some census tracts in San Francisco have around 200,000 people per square mile; San Quentin has a density of 250,000 people per square mile.)  I recommend playing with the colour mapping to get a sense of how dramatically unevenly the country is populated.

The next thing to look at is where the Obama voters were.  In addition to urban areas, racially diverse areas went for Obama pretty stunningly consistently.  If you lay the percent-White layer on top of the voting layer, it is really striking how often the blue of an Obama county coincides with the white of a low-percent-White county.  In the percent-White overlay in this map, full white corresponds to 70% of the population or less being White, while full green corresponds to 100% White:

Voting Layer Plus Percent-White

Voting layer plus percent-White overlay

You can see that the West Coast and the South are more racially diverse, as well as some odd, isolated islands of blue in the middle of the country — mostly in the Mountain time zone.  What are those odd blocks?

Those isolated patches of high-Obama, low-White correlate with Native American population.  Here is a map of the percent-Native, with full white corresponding to 70% Native population and full green corresponding to 100% Native population.  (I turned off the voting layer in the next picture.)

Percent Native American

Percent Native American Layer

Not every place with a high Native population went for Obama.  For example, Oklahoma has a high Native population, but Oklahoma went for McCain pretty reliably.

The sharp-eyed viewer will note a few blue counties in Idaho, Montana, and Wyoming that do not have a high percentage of Native Americans.  These are the urban areas of Butte and Missoula, plus the resort areas of Sun Valley and Jackson Hole.

A prominent feature on the electoral layer is the blue crescent in the South.  That corresponds neatly with high concentrations of African-Americans.  On the demographic layer on this map, full white corresponds to 0% African-American and full green corresponds to 50% African-American:

Voting Layer Plus Percent Black

Voting layer plus percent Black layer

You can see correlation between Latino populations and Obama voters, although it isn’t as strong a correlation as elsewhere.  In the demographic layer of the map below, full white is 0% Latino and full green is 50% Latino:

Voting Layer Plus Percent Latino Layer

Voting layer plus percent Latino layer

Finally, there are some pale counties in the Midwest that are surrounded by red counties.  I haven’t looked at each and every one, but I think they are college towns.  Here’s a zoomed-in overlay of median age over the voting layer, with full white corresponding to a median age of 25 and full green to a median age of 45:

Voting Layer Plus Median Age Layer

Voting layer plus median age layer

The areas that voted for Obama that can’t be explained by population density, racial diversity, or college towns are the upper midwest (e.g. Minnesota, Wisconsin, and Iowa) and the upper Northeast (e.g. Maine and Vermont).   Obama spent a lot of time in Iowa, but I guess the others are just intrinsically liberal.

I also have point-based overlays for locations of Walmarts and health food stores.  During the campaign, Fivethirtyeight.com talked frequently talked about the Walmart-to-Starbucks ratio, so I wanted to show those.  Unfortunately, although I found a database of Starbucks locations, I wasn’t able to get permission to use it on my maps.  (It’s a pity, because Starbucks locations seemed to correlate very well with Obama votes.)  I bought data of Walmart locations and health food stores (figuring that health food stores would be a squishy liberal kind of thing) but eyeballing the data didn’t really convince me there was a correlation.  But you can go try it out for yourself.

You can also see the locations of all 756 Native / Alaskan / Hawai’ian reservations from the dots layer.

Finally, I have a layer that shows the total votes (Obama + McCain + everyone else) divided by the population.   I don’t know how much to trust this layer, as the demographic data I have is from the 2000 census.  Areas that gained or lost a lot of people in the past eight years will be skewed.  While the demographic data in all the demographic layers will be slightly off, it will be worse for this layer, especially in counties that don’t have a lot of people.  (Loving County, TX, had 79 presidential votes in 2008 but only 67 total population in 2000.)  Rural counties tend to have more land area than urban counties, so visually they will be overrepresented.

I encourage you to go play with the maps.  In particular, I had to use zoom level 3 in order to get maps that would fit on my blog, but you lose a lot of detail and information.  Also, when you aren’t worrying about running afoul of Google’s copyrights, you can let the map layer peek through to give you more context for what you are seeing.

Enjoy!

11.30.08

Programming persistence

Posted in Hacking, Maps, Too Much Information at 11:21 pm by ducky

Warning: this is a long and geeky post.

From time to time in the past few years, I have mentioned that I was a little puzzled as to why more people didn’t render tiles on-the-fly for Google Maps, as I do in my U.S. Census Bureau/Google Maps mashup.

I have reappraised my attitude.  I have been redoing my mapping framework to make it easier to use.  I have reminded myself of all the hurdles I had to overcome, and discovered a number of annoying new ones.

First pass

I originally wrote my mapping framework in an extreme hurry.  It was a term project, and a month before the end of the term, I realized that it would be good for personal reasons to hand it in a week early.  The code functioned well enough to get me an A+, but I cut a huge number of corners.

Language/libraries/database choice

It was very important to minimize risk, so I wrote the framework in C++.  I would have liked to use a scripting language, but I knew that I would need to use a graphics library and a library to interpret shapefiles.  The only ones I found that looked reasonable were C-based libraries (Frank Warmerdam’s Shapelib library andThomas Boutell’s gd library).   I knew it was possible using a tool called SWIG, but I hadn’t ever used SWIG and had heard that it was touchy.  Doing it in C++ was guaranteed to be painful, but I knew what the limits of that pain were.  I didn’t know what the limits of pain of using SWIG would be.

Projection

I also had problems figuring out how to convert from latitude/longitude to pixel coordinates in the Google tile space.  At the time (December 2005), I had a hard time simply finding out what the mathematics of the Mercator transformation were.  (It is easier to find Mercator projection information now.)  I was able to figure out something that worked most of the time, but if you zoomed out past a certain level, there would be a consistent error in the y-coordinates.  The more you zoomed out, the bigger the error.  I’m pretty sure it’s some sort of rounding error.  I looked at it several times, trying to figure out where I could possibly have a roundoff error, but never did figure it out.  I just restricted how far people could zoom out.  (It also took a very long time to render tiles if you were way zoomed out, so it seemed reasonable to restrict it.)

Polygon intersection

I remember that I spent quite a lot of time on my polygon intersection code. I believe that I looked around the Web and didn’t find any helpful code, so developed it from scratch on little sleep. (Remember, I was doing this in a frantic, frantic hurry.) I ended up with eight comparisons that needed to be done for each polygon in the database for every tile. More on this later.

Rendering bug

The version I handed in had a bug where horizontal lines would show up at the bottom of tiles frequently, as you can see in the bottom left tile:

It was pretty obvious that the bug was my fault, as gd is a very mature and well-used graphics library.  My old office partner Carlos Pero had used it way back in 1994 to develop Carlos’ Coloring Book, so it was clear to me that the problem was my fault.

After I handed in my project, I spent quite a lot of time going through my code trying to figure out where the problem was with no luck.  Frustrated, I downloaded and built gd so that I could put breakpoints into the gd code.  Much to my surprise, I discovered that the bug was in the gd library!  I thus had to study and understand the gd code, fix it, report the bug (and patch), and of course blog about it so that other people wouldn’t have the same problem.

Pointing my code to the fixed gd

Then, in order to actually get the fix, I had to figure out how to statically link gd into my binaries. I like my ISP (Dreamhost) and wasn’t particularly interested in changing, but that meant I couldn’t use the system-installed gd libraries.  Statically linking wasn’t a huge deal, but it took me at least several hours to figure out which flag to insert where in my makefile to get it to build statically.  It was just one more thing.

Second pass

I have graduated, but haven’t found a job yet, so I decided to revamp my mapping framework. In addition to the aesthetic joy of making nice clean code:

  • It would be an opportunity to learn and demonstrate competence in another technology.
  • I had ideas for how I could improve the performance by pre-computing some things.
  • With a more flexible framework, I would be able to do some cool new mashups that I figured would get me more exposure, and hence lead to some consulting jobs.

Language/libraries/database choice

Vancouver is a PHP town, so I thought I’d give PHP a shot. I expected that I might have to rewrite my code in C++ eventually, but that I could get the basics of my improved algorithms shaken out first.  (I’m not done yet, but so far, I have been very very pleased with that strategy.)

I also decided to use MySQL.  While the feeling in the GIS community is that the Postgres‘ GIS extensions (PostGIS) are better than the GIS extensions to MySQL, I can’t run Postgres on my ISP, and MySQL is used more than Postgres.

I had installed PHP4 and MySQL 4 on my home computer some time ago, when I was working on Mapeteria.  However, I recently upgraded my home Ubuntu installation to Hardy Heron, and PHP4 was no longer supported.  That meant I need to install a variety of packages, and I went through a process of downloading, trying, discovering I was missing a package, downloading/installing, discovering I was missing a package, lather, rinse, repeat.  I needed to install  mysql-server-5.0,  mysql-client-5.0, php5, php5-mcrypt, php5-cli, php5-gd, libgd2-xpm-dev, php5-mysql, and php5-curl.  I also spent some time trying to figure out why php5 wouldn’t run scripts that were in my cgi-bin directory before realizing/discovering that with mod_php, it was supposed to run from everywhere but the cgi-bin directory.

Note that I could have done all my development on my ISP’s machines, but that seemed clunky.  I knew I’d want to be able to develop offline at some point, so wanted to get it done sooner rather than later.  It’s also a little faster to develop on my local system.

I did a little bit of looking around for a graphics library, but stopped when I found that PHP had hooks to the gd library.  I knew that if gd had not yet incorporated my horizontal lines bug fix, then I might have to drop back to C++ in order to link in “my” gd, but I figured I could worry about that later.

Projection

I made a conscious decision to write my Mercator conversion code from scratch, without looking at my C++ code.  I did this because I didn’t want to be influenced in a way that might lead me to get the same error at way-zoomed-out that I did before.  I was able to find equations on the Wikipedia Mercator page for transforming Mercator coordinates to X-Y coordinates, but those equations didn’t give a scale for the X-Y coordinates!  It took some trial and error to sort that out.

Data

For the initial development, I decided to use country boundaries instead of census tract boundaries. The code wouldn’t care which data it was using, and it would be nice to have tiles that would render faster when way-zoomed-out. I whipped up a script read a KML file with country boundaries (that I got from Valery Hronusov and used in my Mapeteria project) and loaded it into MySQL.  Unfortunately, I had real problems with precision.  I don’t remember whether it was PHP or MySQL, but I kept losing some precision in the latitude and longitude when I read and uploaded it.  I eventually converted to uploading integers that were 1,000,000 times the latitude and longitude, and so had no rounding difficulties.

One thing that helped me enormously when working on the projection algorithm was to gather actual data via Google.  I found a number of places on the Google maps where three territories (e.g. British Columbia, Alberta, and Montana) came together.  I would determine the latitude/longitude of those points, then figure out what the tile coordinates, pixel X, and pixel Y of that point were for various zoom levels.  That let me assemble high-quality test cases, which were absolutely essential in figuring out what the transformation algorithm should be, but it was very slow, boring, and tedious to collect that data.

Polygon intersection

When it came time to implement my polygon bounding box intersection code again, I looked at my old polygon intersection code again, saw that it took eight comparisons, and thought to myself, “That can’t be right!”  Indeed, it took me very little time to come up with a version with only four comparisons, (and was now able to find sources on the Web that describe that algorithm).

Stored procedures

One thing that I saw regularly in employment ads was a request for use of stored procedures, which became available with MySQL 5.  It seemed reasonable that using a stored procedure to calculate the bounding box intersection would be even faster, so I ran some timing tests.  In one, I used PHP to generate a complex WHERE clause string from eight values; in the other, I passed eight values to a stored procedure and used that in the WHERE clause.  Much to my suprise, it took almost 20 times more time to use the stored procedure!  I think I understand why, but it was interesting to discover that it was not always faster.

GIS extensions

My beloved husband had been harping on me to use the built-in GIS extensions.  I had been ignoring him because a large part of the point of this exercise was to learn more about MySQL, including stored procedures, but now that I found that the stored procedure was slow, it was time to time the built-in bounding box intersection routine.  If I stored the bounding box as a POLYGON type instead of as two coordinate pairs, then calculating the intersection took half the time.  Woot!

Rendering

I discovered that despite my having reported the horizontal lines bug fifteen months ago, the gd team hasn’t done anything with it yet.  Needless to say, this means that the version of libgd.a on Dreamhost has the bug in it. I thought about porting back to C++. I figured that porting back would probably take at minimum a week, and would raise the possibility of nasty pointer bugs, so it was worth spending a few days trying to get PHP to use my version of gd.

It is possible to compile your own version of PHP and use it, though it means using the CGI version of PHP instead of mod_php. I looked around for information on how to do that, and found a Dreamhost page on how to do so.. but failed utterly when I followed the directions. I almost gave up at that point, but sent a detailed message to Dreamhost customer support explaining what I was trying to do, why, and what was blocking me. On US Thanksgiving Day, I got a very thoughtful response back from Robert at Dreamhost customer support which pointed me at a different how-to-compile-PHP-on-Dreamhost page that ultimately proved successful.  (This is part of why I like Dreamhost and don’t really want to change ISPs.)

Compiling unfamiliar packages can be a real pain, and this was no different.  The Dreamhost page (on their user-generated wiki) had a few scripts that would do the install/build for me, but they weren’t the whole story.  Each of the scripts downloaded a number of projects (like openSSL, IMAP, CURL, etc) in compressed form, extracted the files, and built them.  The scripts were somewhat fragile — they would just break if something didn’t work right.  They were sometimes opaque — they didn’t always print an error message if something broke.  If there was a problem, they started over from the beginning, removing everything that had been downloaded and extracted.  Something small — like if the mirror site for mcrypt was so busy that the download timed out — would mean starting from scratch.  (I ended up iteratively commenting out large swaths of the scripts so that I wouldn’t have to redo work.)

There was some problem with the IMAP build having to do with SSL.  I finally changed one of the flags so that IMAP built without SSL — figuring that I was unlikely to be using this instance of PHP to do IMAP, let alone IMAP with SSL — but it took several false starts, each taking quite a long time to go through.

Finally, once I got it to build without my custom gd, I tried folding in my gd.  I uploaded my gd/.libs directory, but that wasn’t enough — it wanted the gd.h file.  I suppose I could have tried to figure out what it wanted, where it wanted it, but I figured it would be faster to just re-build gd on my Dreamhost account, then do a make install to some local directory. Uploading my source was fast and the build was slow but straightforward. However, I couldn’t figure out how to specify where the install should go. The makefiles were all autogenerated and very difficult to follow. I tried to figure out where in configure the install directory got set, but that too was hard to decipher. Finally, I just hand-edited the default installation directory. So there. That worked. PHP built!

Unfortunately, it wouldn’t run. It turned out that the installation script had a bug in it:

cp ${INSTALLDIR}/bin/php ${HOME}/${DOMAIN}/cgi-bin/php.cgi

instead of

cp ${INSTALLDIR}/bin/php.cgi ${HOME}/${DOMAIN}/cgi-bin/php.cgi

But finally, after all that, success!

Bottom line

So let me review what it took to get to tile rendering on the second pass:

  1. Choose a database and figure out how to extract data from it, requiring reading and learning.
  2. Find and load boundary information into the database, requiring trial and error.
  3. Choose a graphics library and figure out how to draw coloured polygons with it, requiring reading and learning.
  4. Gather test cases for converting from latitude/longitude into Google coordinate system, requiring patience.  A lot of patience.
  5. Figure out how to translate from latitude/longitude pairs into the Google coordinate system, requiring algorithmic skills.
  6. Diagnose and fix a bug in a large-ish C graphics library, requiring skill debugging in C.
  7. Download and install PHP and MySQL, requiring system administration skills.
  8. Figure out how to build a custom PHP, requiring understanding of bash scripts and makefiles.

So now, I guess it isn’t that easy to generate tiles!

Note: there is an entirely different ecosystem for generating tiles, one that comes from the mainline GIS world, one that descends from the ESRI ecosystem. I expect that I could have used PostGIS and GeoTools with uDig look like fine tools, but they are complex tools with many many features.  Had I gone that route, I would have had to wade through a lot of documentation of features I didn’t care about.  (I also would have had to figure out which ISP to move to in order to get Postgres.)  I think that it would have taken me long enough to learn / install that ecosystem’s tools that it wouldn’t have been worth it for the relatively simple things that I needed to do.  Your milage may vary.

08.12.07

robobait: gd library bug – horizontal lines

Posted in Hacking, Maps, robobait at 11:12 pm by ducky

I had had a problem with horizontal lines in my census maps mashups for a long time.  Note the line at the bottom left.

Horizontal line bug

I was sure it was a bug in my code because the graphics library that I used, gd, was extremely mature and heavily-used.  (Way back in 1994 or 1995, my then-office partner Carlos Pero was using gd for Carlos’ Coloring Book!)

It turned out to be a bug in the gd polygon fill code. The fix turned out to be a very small number of lines of code, so if you are having problems with horizontal lines occasionally appearing in your images, look there.

More keywords: libgd, gdlib gd lib, flat, fill polygon, colored polygon, horizontal line, missing line, extra line, flat line, sideways line

« Previous Page« Previous entries « Previous Page · Next Page » Next entries »Next Page »