11.30.08

Programming persistence

Posted in Hacking, Maps, Too Much Information at 11:21 pm by ducky

Warning: this is a long and geeky post.

From time to time in the past few years, I have mentioned that I was a little puzzled as to why more people didn’t render tiles on-the-fly for Google Maps, as I do in my U.S. Census Bureau/Google Maps mashup.

I have reappraised my attitude. I have been redoing my mapping framework to make it easier to use. I have reminded myself of all the hurdles I had to overcome, and discovered a number of annoying new ones.

First pass

I originally wrote my mapping framework in an extreme hurry. It was a term project, and a month before the end of the term, I realized that it would be good for personal reasons to hand it in a week early. The code functioned well enough to get me an A+, but I cut a huge number of corners.

Language/libraries/database choice

It was very important to minimize risk, so I wrote the framework in C++. I would have liked to use a scripting language, but I knew that I would need to use a graphics library and a library to interpret shapefiles. The only ones I found that looked reasonable were C-based libraries (Frank Warmerdam’s Shapelib library andThomas Boutell’s gd library). I knew it was possible using a tool called SWIG, but I hadn’t ever used SWIG and had heard that it was touchy. Doing it in C++ was guaranteed to be painful, but I knew what the limits of that pain were. I didn’t know what the limits of pain of using SWIG would be.

Projection

I also had problems figuring out how to convert from latitude/longitude to pixel coordinates in the Google tile space. At the time (December 2005), I had a hard time simply finding out what the mathematics of the Mercator transformation were. (It is easier to find Mercator projection information now.) I was able to figure out something that worked most of the time, but if you zoomed out past a certain level, there would be a consistent error in the y-coordinates. The more you zoomed out, the bigger the error. I’m pretty sure it’s some sort of rounding error. I looked at it several times, trying to figure out where I could possibly have a roundoff error, but never did figure it out. I just restricted how far people could zoom out. (It also took a very long time to render tiles if you were way zoomed out, so it seemed reasonable to restrict it.)

Polygon intersection

I remember that I spent quite a lot of time on my polygon intersection code. I believe that I looked around the Web and didn’t find any helpful code, so developed it from scratch on little sleep. (Remember, I was doing this in a frantic, frantic hurry.) I ended up with eight comparisons that needed to be done for each polygon in the database for every tile. More on this later.

Rendering bug

The version I handed in had a bug where horizontal lines would show up at the bottom of tiles frequently, as you can see in the bottom left tile:

It was pretty obvious that the bug was my fault, as gd is a very mature and well-used graphics library. My old office partner Carlos Pero had used it way back in 1994 to develop Carlos’ Coloring Book, so it was clear to me that the problem was my fault.

After I handed in my project, I spent quite a lot of time going through my code trying to figure out where the problem was with no luck. Frustrated, I downloaded and built gd so that I could put breakpoints into the gd code. Much to my surprise, I discovered that the bug was in the gd library! I thus had to study and understand the gd code, fix it, report the bug (and patch), and of course blog about it so that other people wouldn’t have the same problem.

Pointing my code to the fixed gd

Then, in order to actually get the fix, I had to figure out how to statically link gd into my binaries. I like my ISP (Dreamhost) and wasn’t particularly interested in changing, but that meant I couldn’t use the system-installed gd libraries. Statically linking wasn’t a huge deal, but it took me at least several hours to figure out which flag to insert where in my makefile to get it to build statically. It was just one more thing.

Second pass

I have graduated, but haven’t found a job yet, so I decided to revamp my mapping framework. In addition to the aesthetic joy of making nice clean code:

It would be an opportunity to learn and demonstrate competence in another technology.
I had ideas for how I could improve the performance by pre-computing some things.
With a more flexible framework, I would be able to do some cool new mashups that I figured would get me more exposure, and hence lead to some consulting jobs.

Language/libraries/database choice

Vancouver is a PHP town, so I thought I’d give PHP a shot. I expected that I might have to rewrite my code in C++ eventually, but that I could get the basics of my improved algorithms shaken out first. (I’m not done yet, but so far, I have been very very pleased with that strategy.)

I also decided to use MySQL. While the feeling in the GIS community is that the Postgres‘ GIS extensions (PostGIS) are better than the GIS extensions to MySQL, I can’t run Postgres on my ISP, and MySQL is used more than Postgres.

I had installed PHP4 and MySQL 4 on my home computer some time ago, when I was working on Mapeteria. However, I recently upgraded my home Ubuntu installation to Hardy Heron, and PHP4 was no longer supported. That meant I need to install a variety of packages, and I went through a process of downloading, trying, discovering I was missing a package, downloading/installing, discovering I was missing a package, lather, rinse, repeat. I needed to install mysql-server-5.0, mysql-client-5.0, php5, php5-mcrypt, php5-cli, php5-gd, libgd2-xpm-dev, php5-mysql, and php5-curl. I also spent some time trying to figure out why php5 wouldn’t run scripts that were in my cgi-bin directory before realizing/discovering that with mod_php, it was supposed to run from everywhere but the cgi-bin directory.

Note that I could have done all my development on my ISP’s machines, but that seemed clunky. I knew I’d want to be able to develop offline at some point, so wanted to get it done sooner rather than later. It’s also a little faster to develop on my local system.

I did a little bit of looking around for a graphics library, but stopped when I found that PHP had hooks to the gd library. I knew that if gd had not yet incorporated my horizontal lines bug fix, then I might have to drop back to C++ in order to link in “my” gd, but I figured I could worry about that later.

Projection

I made a conscious decision to write my Mercator conversion code from scratch, without looking at my C++ code. I did this because I didn’t want to be influenced in a way that might lead me to get the same error at way-zoomed-out that I did before. I was able to find equations on the Wikipedia Mercator page for transforming Mercator coordinates to X-Y coordinates, but those equations didn’t give a scale for the X-Y coordinates! It took some trial and error to sort that out.

Data

For the initial development, I decided to use country boundaries instead of census tract boundaries. The code wouldn’t care which data it was using, and it would be nice to have tiles that would render faster when way-zoomed-out. I whipped up a script read a KML file with country boundaries (that I got from Valery Hronusov and used in my Mapeteria project) and loaded it into MySQL. Unfortunately, I had real problems with precision. I don’t remember whether it was PHP or MySQL, but I kept losing some precision in the latitude and longitude when I read and uploaded it. I eventually converted to uploading integers that were 1,000,000 times the latitude and longitude, and so had no rounding difficulties.

One thing that helped me enormously when working on the projection algorithm was to gather actual data via Google. I found a number of places on the Google maps where three territories (e.g. British Columbia, Alberta, and Montana) came together. I would determine the latitude/longitude of those points, then figure out what the tile coordinates, pixel X, and pixel Y of that point were for various zoom levels. That let me assemble high-quality test cases, which were absolutely essential in figuring out what the transformation algorithm should be, but it was very slow, boring, and tedious to collect that data.

Polygon intersection

When it came time to implement my polygon bounding box intersection code again, I looked at my old polygon intersection code again, saw that it took eight comparisons, and thought to myself, “That can’t be right!” Indeed, it took me very little time to come up with a version with only four comparisons, (and was now able to find sources on the Web that describe that algorithm).

Stored procedures

One thing that I saw regularly in employment ads was a request for use of stored procedures, which became available with MySQL 5. It seemed reasonable that using a stored procedure to calculate the bounding box intersection would be even faster, so I ran some timing tests. In one, I used PHP to generate a complex WHERE clause string from eight values; in the other, I passed eight values to a stored procedure and used that in the WHERE clause. Much to my suprise, it took almost 20 times more time to use the stored procedure! I think I understand why, but it was interesting to discover that it was not always faster.

GIS extensions

My beloved husband had been harping on me to use the built-in GIS extensions. I had been ignoring him because a large part of the point of this exercise was to learn more about MySQL, including stored procedures, but now that I found that the stored procedure was slow, it was time to time the built-in bounding box intersection routine. If I stored the bounding box as a POLYGON type instead of as two coordinate pairs, then calculating the intersection took half the time. Woot!

Rendering

I discovered that despite my having reported the horizontal lines bug fifteen months ago, the gd team hasn’t done anything with it yet. Needless to say, this means that the version of libgd.a on Dreamhost has the bug in it. I thought about porting back to C++. I figured that porting back would probably take at minimum a week, and would raise the possibility of nasty pointer bugs, so it was worth spending a few days trying to get PHP to use my version of gd.

It is possible to compile your own version of PHP and use it, though it means using the CGI version of PHP instead of mod_php. I looked around for information on how to do that, and found a Dreamhost page on how to do so.. but failed utterly when I followed the directions. I almost gave up at that point, but sent a detailed message to Dreamhost customer support explaining what I was trying to do, why, and what was blocking me. On US Thanksgiving Day, I got a very thoughtful response back from Robert at Dreamhost customer support which pointed me at a different how-to-compile-PHP-on-Dreamhost page that ultimately proved successful. (This is part of why I like Dreamhost and don’t really want to change ISPs.)

Compiling unfamiliar packages can be a real pain, and this was no different. The Dreamhost page (on their user-generated wiki) had a few scripts that would do the install/build for me, but they weren’t the whole story. Each of the scripts downloaded a number of projects (like openSSL, IMAP, CURL, etc) in compressed form, extracted the files, and built them. The scripts were somewhat fragile — they would just break if something didn’t work right. They were sometimes opaque — they didn’t always print an error message if something broke. If there was a problem, they started over from the beginning, removing everything that had been downloaded and extracted. Something small — like if the mirror site for mcrypt was so busy that the download timed out — would mean starting from scratch. (I ended up iteratively commenting out large swaths of the scripts so that I wouldn’t have to redo work.)

There was some problem with the IMAP build having to do with SSL. I finally changed one of the flags so that IMAP built without SSL — figuring that I was unlikely to be using this instance of PHP to do IMAP, let alone IMAP with SSL — but it took several false starts, each taking quite a long time to go through.

Finally, once I got it to build without my custom gd, I tried folding in my gd. I uploaded my gd/.libs directory, but that wasn’t enough — it wanted the gd.h file. I suppose I could have tried to figure out what it wanted, where it wanted it, but I figured it would be faster to just re-build gd on my Dreamhost account, then do a make install to some local directory. Uploading my source was fast and the build was slow but straightforward. However, I couldn’t figure out how to specify where the install should go. The makefiles were all autogenerated and very difficult to follow. I tried to figure out where in configure the install directory got set, but that too was hard to decipher. Finally, I just hand-edited the default installation directory. So there. That worked. PHP built!

Unfortunately, it wouldn’t run. It turned out that the installation script had a bug in it:

cp ${INSTALLDIR}/bin/php ${HOME}/${DOMAIN}/cgi-bin/php.cgi

instead of

cp ${INSTALLDIR}/bin/php.cgi ${HOME}/${DOMAIN}/cgi-bin/php.cgi

But finally, after all that, success!

Bottom line

So let me review what it took to get to tile rendering on the second pass:

Choose a database and figure out how to extract data from it, requiring reading and learning.
Find and load boundary information into the database, requiring trial and error.
Choose a graphics library and figure out how to draw coloured polygons with it, requiring reading and learning.
Gather test cases for converting from latitude/longitude into Google coordinate system, requiring patience. A lot of patience.
Figure out how to translate from latitude/longitude pairs into the Google coordinate system, requiring algorithmic skills.
Diagnose and fix a bug in a large-ish C graphics library, requiring skill debugging in C.
Download and install PHP and MySQL, requiring system administration skills.
Figure out how to build a custom PHP, requiring understanding of bash scripts and makefiles.

So now, I guess it isn’t that easy to generate tiles!

Note: there is an entirely different ecosystem for generating tiles, one that comes from the mainline GIS world, one that descends from the ESRI ecosystem. I expect that I could have used PostGIS and GeoTools with uDig look like fine tools, but they are complex tools with many many features. Had I gone that route, I would have had to wade through a lot of documentation of features I didn’t care about. (I also would have had to figure out which ISP to move to in order to get Postgres.) I think that it would have taken me long enough to learn / install that ecosystem’s tools that it wouldn’t have been worth it for the relatively simple things that I needed to do. Your milage may vary.

Permalink 1 Comment

08.12.07

robobait: gd library bug – horizontal lines

Posted in Hacking, Maps, robobait at 11:12 pm by ducky

I had had a problem with horizontal lines in my census maps mashups for a long time. Note the line at the bottom left.

Horizontal line bug

I was sure it was a bug in my code because the graphics library that I used, gd, was extremely mature and heavily-used. (Way back in 1994 or 1995, my then-office partner Carlos Pero was using gd for Carlos’ Coloring Book!)

It turned out to be a bug in the gd polygon fill code. The fix turned out to be a very small number of lines of code, so if you are having problems with horizontal lines occasionally appearing in your images, look there.

More keywords: libgd, gdlib gd lib, flat, fill polygon, colored polygon, horizontal line, missing line, extra line, flat line, sideways line

Permalink 2 Comments

census maps mapplets

Posted in Hacking, Maps at 10:53 pm by ducky

James Macgill prodded me to turn my census maps into a mapplet, and so I finally made a census mapplet.

Most of you are probably wondering what a mapplet is. A mapplet is a Google map that has been encapsulated in a way that makes it easy to combine with other mashups. To see them, go to maps.google.com and select the My Maps tab. You’ll see a list of mapplets next to checkboxes.

I’ve been enjoying playing with combining my demographics maps with other mapplets, like

population density + sea level rise
various demographics + real estate listings
% black + Chcago Transit Authority lines

Permalink Comments off

05.26.07

Open-sourcing code

Posted in Hacking, Maps at 5:49 pm by ducky

I just open-sourced the code for Mapeteria. If any of you are PHP4 gods, I have a few questions…

Permalink Comments off

05.19.07

Mapeteria: user-generated thematic maps

Posted in Hacking, Maps at 8:08 pm by ducky

A year ago, while I was in the midst of working on my Census Maps mashup, my Green College colleague Jana came up to me with a question. “I have a table of data about heat pump emissions savings for each province, and I want to make a map that colors each province based on the savings for that province. What program should I use to do that?”

I thought about all the work that I’d done for the Census Maps mashup — learning the Google Maps API, digging up the shape files for census tract boundaries, downloading and learning how to use the shapelib libraries to process the shapefiles, downloading and learning how to use gd, reacquainting myself with C++, reacquainting myself with gdb, debugging, trying to figure out why certain census tracts looked strange, etc, and rendered her an authoritative response: “Use Photoshop”, I said.

I was really dismayed that I had to tell her to use a paint program. Why should she — a geographer — have to learn about vertices and alpha channels and statically loaded libraries? Why wasn’t there some service where she could feed in a spreadsheet file and get back a map?

Well, I finally got tired of waiting for Google to do it, so developed Mapeteria — my own service for users to generate their own thematic maps.

If you give Mapeteria a CSV file (which is one of the formats that any spreadsheet program will be delighted to save as) plus a little more information about how it should be displayed, it will give you back a map. You can either get a KML file (which you can look at in Google Earth) or a Google Maps mashup that shows the map directly in your web browser.

So Jana, here’s your map!

Emissions savings of heat pumps vs. natural gas

Permalink Comments off

01.18.07

Google Maps China

Posted in Maps at 9:02 pm by ducky

One of the valuable services that blogs do is to help publicize things. Well, it always takes me a while to remember/figure out where Google hides their China street maps, so I might as well help the rest of the world remember as well: it is at

http://bendi.google.com

Don’t ask me why you have to go there to find the maps, why you can’t get to them via http://maps.google.com. I don’t know.

(You can see street maps of Hong Kong and satellite imagry of everywhere on http://maps.google.com.)

Permalink 1 Comment

03.18.06

Single Operation Multiple Data

Posted in Hacking, Maps at 5:40 pm by ducky

One of the most venerable types of parallel processing is called SIMD, for Single Instruction Multiple Data. In those types of computers, you would do the exact same thing on many different pieces of data (like add two, multiply by five, etc) at the same time. There are some problems that lend themselves to SIMD processing very well. Unfortunately, there are a huge number of problems that do not lend themselves well to SIMD. It’s rare that you want to process every piece of data exactly the same.

Google has done a really neat thing with their architecture and software tools. They have abstracted things such that it looks to the developer like they have a single operation multiple data machine, where an operation can be something relatively complicated.

For example, to create one of my map tiles, I determine the coordinates of the tile, retrieve information about the geometry, retrieve information about the demographics, and draw the tile. With Google tools, once I have a list of tile coordinates, I could send one group of worker-computers (A group) off to retrieve the geometry information and a second (B group) off to retrieve the demographic information. Another group (the C group) could then draw the tiles. (Each worker in the C group would use data from exactly one A worker and one B worker.)

The A and B tasks are pretty simple, and maybe could be done by an old-style SIMD computer, but C’s job is much too complex to do in a SIMD computer. What steps are performed depends entirely on what is in the data. For a tile out at sea, the C worker doesn’t need to draw anything. For a tile in the heart of Los Angeles, it has to draw lots and lots of little polygons. But at this level of abstraction, I can think of “draw the tile” as one operation.

Under the covers, Google is does a lot of work to make it look like everything is beautifully parallel. In reality, there probably aren’t as many workers as tiles, but the Google tools take care of dispatching jobs to workers until all the jobs are finished. To the developer, it all looks really clean and tidy.

There are way more problems that lend themselves to SOMD than to SIMD, so I think this approach has enormous potential.

Permalink Comments off

03.14.06

Who are the maps for?

Posted in Maps, Random thoughts at 11:28 pm by ducky

As my maps approach something reasonable for public distribution, I’ve been talking to more people about them. People are starting to ask me, “Who do you think will use them? What do you think they will use them for?”

I’m not quite sure how to answer that. I imagine marketing people will be interested, though I have to believe that they already have this information.

Would researchers use it? Maybe for preliminary investigation, but I would hope they’d use ArcGIS for anything they want to publish. While the maps “look right” to me for most places I know about, there are a few places that don’t look right to me. ArcGIS is fundamentally better — they have many many more resources than I do to get things right.

The “value add” for my maps is not “better”, but “cheaper” and “more accessible”. Twelve-year old Katie isn’t going to buy a copy of ArcGIS for her social studies class, but maybe she could use my maps for a report on the racial demographics of Texas. The Southern Poverty Law Center probably isn’t going to buy ArcGIS, but might go create a list of links to prisons to help people understand how African-Americans are hugely overrepresented in U.S. jails. Maybe Frieda and Joe will look at it to figure out what neighborhoods in Chicago they’d like to live in.

But my hunch is that most of the “use” won’t be obviously useful. I have certainly spent an awful lot of time just wandering around in the maps, exploring the demographics of my native country. Was this productive?

My maps aren’t very good for giving me answers, but they have given me lots of questions. Why are there so few rural blacks in Florida, when there are so many just across the border in Georgia? Why are there so few Latinos in East Texas compared to West Texas? Why is the median age so low on so many Native reservations? Why are there so many vacant housing units in northern Michigan and Minnesota?

However, I feel like these are good questions to have. Maybe I can’t articulate why I feel like a richer person for having explored U.S. demographics, but I absolutely do.

And if Katie, and Frieda, and Joe, and the Southern Poverty Law Center also feel enriched, then I will feel like I have succeeded.

Permalink Comments off

02.23.06

More advice to Google about maps

Posted in Maps, Technology trends at 11:06 pm by ducky

Because all the data associated with Google Maps goes through Google, they can keep track of that information. If they wanted to, they could store enough information to tell you what the most map markers within two miles of 1212 W. Springfield, Urbana, Illinois were. Maybe one would be from Joe’s Favorite Bars mashup and maybe one would be from the Museums of the World mashup. Maybe fifty would show buildings on the university of Illinois campus from the official UIUC mashup, and maybe two would be from Josie’s History of Computing mashup.

Google could of course then use that mashup data in their location-sensitive queries, so if I asked for “history computing urbana il”, they would give me Josie’s links instead of returning the Urbana Free Library. (They would need to be careful in how they did this in a way that didn’t tromp on Josie, if they want to stick to their “Don’t be evil” motto.)

This is another argument for why they should recognize a vested interest in making it easy for developers to add their own area-based data. If Google allows people to easily put up information about specific polygons, then Google can search those polygons. Right now, because I had to do my maps as overlays, Google can’t pull any information out of them.

If Google makes polygons and their corresponding data easy to name, identify, and access, they will be able to do very powerful things in the future.

Addendum: I haven’t reverse-engineered the Google Maps javascript — I realized that it’s quite possible that the marker overlays are all done on the client side. (Desirable and likely, in fact.) In that case, they wouldn’t have the data. However, it would be trivial to insert some code to send information about the markers up to the server. Would that be evil? I’m not sure.

Permalink Comments off

02.17.06

Disaster maps

Posted in Hacking, Maps, Technology trends at 2:27 pm by ducky

I was in San Jose when the 1989 Loma Prieta earthquake hit, and I remember that nobody knew what was going on for several days. I have an idea for how to disseminate information better in a disaster, leveraging the power of the Internet and the masses.

I envision a set of maps associated with a disaster: ones for the status of phone, water, natural gas, electricity, sewer, current safety risks, etc. For example, where the phones are working just fine, the phone map shows green. Where the phone system is up, but the lines are overloaded, the phone map shows yellow. Where the phones are completely dead, the phone map shows red. Where the electricity is out, the power map shows red.

To make a report, someone with knowledge — let’s call her Betsy — would go to the disaster site, click on a location, and see a very simple pop-up form asking about phone, water, gas, electricity, etc. She would fill in what she knows about that location, and submit. That information would go to several sets of servers (geographically distributed so that they won’t all go out simultaneously), which would stuff the update in their databases. That information would be used to update the maps: a dot would appear at the location Betsy reported.

How does Betsy connect to the Internet, if there’s a disaster?

She can move herself out of the disaster area. (Many disasters are highly localized.) Perhaps she was downtown, where the phones were out, and then rode her bicycle home, to where everything was fine. She could report on both downtown and her home. Or maybe Betsy is a pilot and overflew the affected area.
She could be some place unaffected, but take a message from someone in the disaster area. Sometimes there is intermittent communication available, even in a disaster area. After the earthquake, our phone was up but had a busy signal due to so many people calling out. What you are supposed to do in that situation is to make one phone call to someone out of state, and have them contact everybody else. So I would phone Betsy, give her the information, and have her report the information.
Internet service, because of its very nature, can be very robust. I’ve heard of occasions where people couldn’t use the phones, but could use the Internet.

One obvious concern is about spam or vandalism. I think Wikipedia has shown that with the right tools, community involvement can keep spam and vandalism at a minimum. There would need to be a way for people to question a report and have that reflected in the map. For example, the dot for the report might become more transparent the more people questioned it.

The disaster site could have many more things on it, depending upon the type of disaster: aerial photographs, geology/hydrology maps, information about locations to get help, information about locations to volunteer help, topology maps (useful in floods), etc.

What would be needed to pull this off?

At least two servers, preferably at least three, that are geographically separated.
A big honkin’ database that can be synchronized between the servers.
Presentation servers, which work at displaying the information. There could be a Google Maps version, a Yahoo Maps version, a Microsoft version, etc.
A way for the database servers and the presentation servers to talk to each other.
Some sort of governance structure. Somebody is going to have to make decisions about what information is appropriate for that disaster. (Hydrology maps might not be useful in a fire.) Somebody is going to have to be in communication with the presentation servers to coordinate presenting the information. Somebody is going to have to make final decisions on vandalism. This governance structure could be somebody like the International Red Cross or something like the Wikimedia Foundation.
Buy-in from various institutions to publicize the site in the event of a disaster. There’s no point in the site existing if nobody knows about it, but if Google, Yahoo, MSN, and AOL all put links to the site when a disaster hit, that would be excellent.

I almost did this project for an MS thesis project, but decided against it, so I’m posting the idea here in the hopes that someone could run with it. I don’t foresee having the time myself.

Permalink Comments off

« Previous Page — « Previous entries « Previous Page · Next Page » Next entries » — Next Page »

11.30.08

First pass

Language/libraries/database choice

Projection

Polygon intersection

Rendering bug

Pointing my code to the fixed gd

Second pass

Language/libraries/database choice

Projection

Data

Polygon intersection

Stored procedures

GIS extensions

Rendering

Bottom line

08.12.07

05.26.07

05.19.07

01.18.07

03.18.06

03.14.06

02.23.06

02.17.06

Archives

Categories

Counter:

Meta

Pages