11.30.08

Programming persistence

Posted in Hacking, Maps, Too Much Information at 11:21 pm by ducky

Warning: this is a long and geeky post.

From time to time in the past few years, I have mentioned that I was a little puzzled as to why more people didn’t render tiles on-the-fly for Google Maps, as I do in my U.S. Census Bureau/Google Maps mashup.

I have reappraised my attitude.  I have been redoing my mapping framework to make it easier to use.  I have reminded myself of all the hurdles I had to overcome, and discovered a number of annoying new ones.

First pass

I originally wrote my mapping framework in an extreme hurry.  It was a term project, and a month before the end of the term, I realized that it would be good for personal reasons to hand it in a week early.  The code functioned well enough to get me an A+, but I cut a huge number of corners.

Language/libraries/database choice

It was very important to minimize risk, so I wrote the framework in C++.  I would have liked to use a scripting language, but I knew that I would need to use a graphics library and a library to interpret shapefiles.  The only ones I found that looked reasonable were C-based libraries (Frank Warmerdam’s Shapelib library andThomas Boutell’s gd library).   I knew it was possible using a tool called SWIG, but I hadn’t ever used SWIG and had heard that it was touchy.  Doing it in C++ was guaranteed to be painful, but I knew what the limits of that pain were.  I didn’t know what the limits of pain of using SWIG would be.

Projection

I also had problems figuring out how to convert from latitude/longitude to pixel coordinates in the Google tile space.  At the time (December 2005), I had a hard time simply finding out what the mathematics of the Mercator transformation were.  (It is easier to find Mercator projection information now.)  I was able to figure out something that worked most of the time, but if you zoomed out past a certain level, there would be a consistent error in the y-coordinates.  The more you zoomed out, the bigger the error.  I’m pretty sure it’s some sort of rounding error.  I looked at it several times, trying to figure out where I could possibly have a roundoff error, but never did figure it out.  I just restricted how far people could zoom out.  (It also took a very long time to render tiles if you were way zoomed out, so it seemed reasonable to restrict it.)

Polygon intersection

I remember that I spent quite a lot of time on my polygon intersection code. I believe that I looked around the Web and didn’t find any helpful code, so developed it from scratch on little sleep. (Remember, I was doing this in a frantic, frantic hurry.) I ended up with eight comparisons that needed to be done for each polygon in the database for every tile. More on this later.

Rendering bug

The version I handed in had a bug where horizontal lines would show up at the bottom of tiles frequently, as you can see in the bottom left tile:

It was pretty obvious that the bug was my fault, as gd is a very mature and well-used graphics library.  My old office partner Carlos Pero had used it way back in 1994 to develop Carlos’ Coloring Book, so it was clear to me that the problem was my fault.

After I handed in my project, I spent quite a lot of time going through my code trying to figure out where the problem was with no luck.  Frustrated, I downloaded and built gd so that I could put breakpoints into the gd code.  Much to my surprise, I discovered that the bug was in the gd library!  I thus had to study and understand the gd code, fix it, report the bug (and patch), and of course blog about it so that other people wouldn’t have the same problem.

Pointing my code to the fixed gd

Then, in order to actually get the fix, I had to figure out how to statically link gd into my binaries. I like my ISP (Dreamhost) and wasn’t particularly interested in changing, but that meant I couldn’t use the system-installed gd libraries.  Statically linking wasn’t a huge deal, but it took me at least several hours to figure out which flag to insert where in my makefile to get it to build statically.  It was just one more thing.

Second pass

I have graduated, but haven’t found a job yet, so I decided to revamp my mapping framework. In addition to the aesthetic joy of making nice clean code:

  • It would be an opportunity to learn and demonstrate competence in another technology.
  • I had ideas for how I could improve the performance by pre-computing some things.
  • With a more flexible framework, I would be able to do some cool new mashups that I figured would get me more exposure, and hence lead to some consulting jobs.

Language/libraries/database choice

Vancouver is a PHP town, so I thought I’d give PHP a shot. I expected that I might have to rewrite my code in C++ eventually, but that I could get the basics of my improved algorithms shaken out first.  (I’m not done yet, but so far, I have been very very pleased with that strategy.)

I also decided to use MySQL.  While the feeling in the GIS community is that the Postgres‘ GIS extensions (PostGIS) are better than the GIS extensions to MySQL, I can’t run Postgres on my ISP, and MySQL is used more than Postgres.

I had installed PHP4 and MySQL 4 on my home computer some time ago, when I was working on Mapeteria.  However, I recently upgraded my home Ubuntu installation to Hardy Heron, and PHP4 was no longer supported.  That meant I need to install a variety of packages, and I went through a process of downloading, trying, discovering I was missing a package, downloading/installing, discovering I was missing a package, lather, rinse, repeat.  I needed to install  mysql-server-5.0,  mysql-client-5.0, php5, php5-mcrypt, php5-cli, php5-gd, libgd2-xpm-dev, php5-mysql, and php5-curl.  I also spent some time trying to figure out why php5 wouldn’t run scripts that were in my cgi-bin directory before realizing/discovering that with mod_php, it was supposed to run from everywhere but the cgi-bin directory.

Note that I could have done all my development on my ISP’s machines, but that seemed clunky.  I knew I’d want to be able to develop offline at some point, so wanted to get it done sooner rather than later.  It’s also a little faster to develop on my local system.

I did a little bit of looking around for a graphics library, but stopped when I found that PHP had hooks to the gd library.  I knew that if gd had not yet incorporated my horizontal lines bug fix, then I might have to drop back to C++ in order to link in “my” gd, but I figured I could worry about that later.

Projection

I made a conscious decision to write my Mercator conversion code from scratch, without looking at my C++ code.  I did this because I didn’t want to be influenced in a way that might lead me to get the same error at way-zoomed-out that I did before.  I was able to find equations on the Wikipedia Mercator page for transforming Mercator coordinates to X-Y coordinates, but those equations didn’t give a scale for the X-Y coordinates!  It took some trial and error to sort that out.

Data

For the initial development, I decided to use country boundaries instead of census tract boundaries. The code wouldn’t care which data it was using, and it would be nice to have tiles that would render faster when way-zoomed-out. I whipped up a script read a KML file with country boundaries (that I got from Valery Hronusov and used in my Mapeteria project) and loaded it into MySQL.  Unfortunately, I had real problems with precision.  I don’t remember whether it was PHP or MySQL, but I kept losing some precision in the latitude and longitude when I read and uploaded it.  I eventually converted to uploading integers that were 1,000,000 times the latitude and longitude, and so had no rounding difficulties.

One thing that helped me enormously when working on the projection algorithm was to gather actual data via Google.  I found a number of places on the Google maps where three territories (e.g. British Columbia, Alberta, and Montana) came together.  I would determine the latitude/longitude of those points, then figure out what the tile coordinates, pixel X, and pixel Y of that point were for various zoom levels.  That let me assemble high-quality test cases, which were absolutely essential in figuring out what the transformation algorithm should be, but it was very slow, boring, and tedious to collect that data.

Polygon intersection

When it came time to implement my polygon bounding box intersection code again, I looked at my old polygon intersection code again, saw that it took eight comparisons, and thought to myself, “That can’t be right!”  Indeed, it took me very little time to come up with a version with only four comparisons, (and was now able to find sources on the Web that describe that algorithm).

Stored procedures

One thing that I saw regularly in employment ads was a request for use of stored procedures, which became available with MySQL 5.  It seemed reasonable that using a stored procedure to calculate the bounding box intersection would be even faster, so I ran some timing tests.  In one, I used PHP to generate a complex WHERE clause string from eight values; in the other, I passed eight values to a stored procedure and used that in the WHERE clause.  Much to my suprise, it took almost 20 times more time to use the stored procedure!  I think I understand why, but it was interesting to discover that it was not always faster.

GIS extensions

My beloved husband had been harping on me to use the built-in GIS extensions.  I had been ignoring him because a large part of the point of this exercise was to learn more about MySQL, including stored procedures, but now that I found that the stored procedure was slow, it was time to time the built-in bounding box intersection routine.  If I stored the bounding box as a POLYGON type instead of as two coordinate pairs, then calculating the intersection took half the time.  Woot!

Rendering

I discovered that despite my having reported the horizontal lines bug fifteen months ago, the gd team hasn’t done anything with it yet.  Needless to say, this means that the version of libgd.a on Dreamhost has the bug in it. I thought about porting back to C++. I figured that porting back would probably take at minimum a week, and would raise the possibility of nasty pointer bugs, so it was worth spending a few days trying to get PHP to use my version of gd.

It is possible to compile your own version of PHP and use it, though it means using the CGI version of PHP instead of mod_php. I looked around for information on how to do that, and found a Dreamhost page on how to do so.. but failed utterly when I followed the directions. I almost gave up at that point, but sent a detailed message to Dreamhost customer support explaining what I was trying to do, why, and what was blocking me. On US Thanksgiving Day, I got a very thoughtful response back from Robert at Dreamhost customer support which pointed me at a different how-to-compile-PHP-on-Dreamhost page that ultimately proved successful.  (This is part of why I like Dreamhost and don’t really want to change ISPs.)

Compiling unfamiliar packages can be a real pain, and this was no different.  The Dreamhost page (on their user-generated wiki) had a few scripts that would do the install/build for me, but they weren’t the whole story.  Each of the scripts downloaded a number of projects (like openSSL, IMAP, CURL, etc) in compressed form, extracted the files, and built them.  The scripts were somewhat fragile — they would just break if something didn’t work right.  They were sometimes opaque — they didn’t always print an error message if something broke.  If there was a problem, they started over from the beginning, removing everything that had been downloaded and extracted.  Something small — like if the mirror site for mcrypt was so busy that the download timed out — would mean starting from scratch.  (I ended up iteratively commenting out large swaths of the scripts so that I wouldn’t have to redo work.)

There was some problem with the IMAP build having to do with SSL.  I finally changed one of the flags so that IMAP built without SSL — figuring that I was unlikely to be using this instance of PHP to do IMAP, let alone IMAP with SSL — but it took several false starts, each taking quite a long time to go through.

Finally, once I got it to build without my custom gd, I tried folding in my gd.  I uploaded my gd/.libs directory, but that wasn’t enough — it wanted the gd.h file.  I suppose I could have tried to figure out what it wanted, where it wanted it, but I figured it would be faster to just re-build gd on my Dreamhost account, then do a make install to some local directory. Uploading my source was fast and the build was slow but straightforward. However, I couldn’t figure out how to specify where the install should go. The makefiles were all autogenerated and very difficult to follow. I tried to figure out where in configure the install directory got set, but that too was hard to decipher. Finally, I just hand-edited the default installation directory. So there. That worked. PHP built!

Unfortunately, it wouldn’t run. It turned out that the installation script had a bug in it:

cp ${INSTALLDIR}/bin/php ${HOME}/${DOMAIN}/cgi-bin/php.cgi

instead of

cp ${INSTALLDIR}/bin/php.cgi ${HOME}/${DOMAIN}/cgi-bin/php.cgi

But finally, after all that, success!

Bottom line

So let me review what it took to get to tile rendering on the second pass:

  1. Choose a database and figure out how to extract data from it, requiring reading and learning.
  2. Find and load boundary information into the database, requiring trial and error.
  3. Choose a graphics library and figure out how to draw coloured polygons with it, requiring reading and learning.
  4. Gather test cases for converting from latitude/longitude into Google coordinate system, requiring patience.  A lot of patience.
  5. Figure out how to translate from latitude/longitude pairs into the Google coordinate system, requiring algorithmic skills.
  6. Diagnose and fix a bug in a large-ish C graphics library, requiring skill debugging in C.
  7. Download and install PHP and MySQL, requiring system administration skills.
  8. Figure out how to build a custom PHP, requiring understanding of bash scripts and makefiles.

So now, I guess it isn’t that easy to generate tiles!

Note: there is an entirely different ecosystem for generating tiles, one that comes from the mainline GIS world, one that descends from the ESRI ecosystem. I expect that I could have used PostGIS and GeoTools with uDig look like fine tools, but they are complex tools with many many features.  Had I gone that route, I would have had to wade through a lot of documentation of features I didn’t care about.  (I also would have had to figure out which ISP to move to in order to get Postgres.)  I think that it would have taken me long enough to learn / install that ecosystem’s tools that it wouldn’t have been worth it for the relatively simple things that I needed to do.  Your milage may vary.

10.20.08

Hire me!

Posted in Canadian life, Eclipse, Hacking, programmer productivity at 3:28 pm by ducky

I am looking for a job. If you know anybody in the Vancouver area who is looking for a really good hire, point them at this blog posting or send them to my resume.

Ideally, I’d like a intermediate (possibly junior, let’s talk) Java software development position, preferably where I could become an expert in Java-based web applications. (Java because I think it has the best long-term prospects. My second choice would be a Ruby position, as it looks like Ruby has long-term staying power as well.) I believe that I would advance quickly, as I have very broad experience with many languages and environments.

Development is only partially about coding, and I am very, very strong in the secondary skills. While studying programmer productivity for my MSCS. I uncovered a few unknown or poorly known, broadlly applicable, teachable skills that will improve programmer productivity. Hire me, and not only will you get a good coder, I make the coders around me better.

I am particularly good at seeing how to solve users’ problems, sometimes problems that they themselves do not see. I am good both at interaction design, at seeing the possibilities in a new technology, and seeing how to combine the two. I have won a few awards for web sites I developed.

I also have really good communication skills. I’ve written books, I blog, and I’ve written a number of articles. I even have significant on-camera television experience.

If you want a really low-stress way of checking me out, invite me to give a talk on programmer productivity. (I like giving the talk, and am happy to give it to basically any group of developers, basically any time, any where in the BC Lower Mainland. Outside of the Lower Mainland, you have to pay me for the travel time.)

Do you know of any developer opportunities that I should check out? Let me know at ducky at webfoot dot com.

09.10.08

Tripoli: Differential code coverage tool

Posted in Eclipse, Hacking, programmer productivity at 6:41 pm by ducky

In the observations of professional programmers that I did for my thesis, I frequently saw them get sidetracked because they didn’t have good information about whether the code they were looking at was executed as part of the feature they were exploring. I theorized that being able to look at the differences in code coverage between two runs — one where they exercised the feature and one where they didn’t — would be useful. Being able to see which methods executed in the first but not the second would hopefully help developers focus in quickly on the methods that were important to their task.

In my last month at UBC, I whipped up a prototype, an Eclipse plug-in called Tripoli, and ran a quickie user study. The differential code coverage information really does make a big difference. You can read a detailed account in UBC Technical Report TR-2008-14, but basically graduate students using Tripoli were frequently more successful than professional programmers who didn’t use Tripoli.

As an example, in the Size task, the subject needed to find where a pane was resized. That was hard. There is no menu item or key stroke invocation for developers to use a starting point. They also didn’t know what the pane was called: was it a canvas, a pane, a panel, a frame, a drawing, or a view? Between pilot subjects and actual subjects, at least eleven people tried and failed to find a method that was always called when a pane was resized and never called when it wasn’t. I even spent a few hours trying to figure out where the pane was resized and couldn’t.

As soon as I got Tripoli working, I tried the Size task, and in my first diff, I saw ten methods, including one named endResizingFrame(). And yes, it turned out to be the right spot. I realized that most of the methods came from not hovering over the corner of the frame such that the cursor changed into the grab handle, and reran.  This time I got exactly one method: endResizingFrame().  Wow. The graduate students in my second study were also all able to find endResizingFrame() by using Tripoli.

Tripoli does take some getting used to. Even as the author, it took me a while before I consistently remembered right away that I could use Tripoli on my problems. I’ve also clearly gotten better over time at figuring out what to do in my second run to ensure that the “diff” has very few methods in it.

If you want to try Tripoli out, I’ve posted it online. Just remember it’s a prototype.

07.29.08

geek cool alert: Triage

Posted in Hacking, programmer productivity, Technology trends, Uncategorized at 11:01 am by ducky

There’s a cool paper on a tool to do semi-automatic debugging: Triage: diagnosing production run failures at the user’s site. While Triage was designed to diagnose bugs at a customer site (where the software developers don’t have access to either the configuration or the data), I think a similar tool would be very valuable even for debugging in-house.

They use a number of different techniques to debug C++ code.

  • Checkpoint the code at a number of steps.
  • Attempt to reproduce the bug.  This tells whether it is deterministic or not.
  • Analyzes the memory by walking the heap and stack to find possible corruptions.
  • Roll back to previous checkpoints and rerun, looking for buffer overflows, dangling pointers, double frees, data races, semantic bugs, etc.
  • Fuzz the inputs: intentionally vary the inputs, thread scheduling, memory layouts, signal delivery, and even control flows and memory states to narrow the conditions that trigger the failure for easy reproduction
  • Compare the code paths from failing replays and non-failing replays to determine what code was involved in that failure.
  • Generate a report.  This gives information on the failure and a suggestion of which lines to look at to fix it.

They did a user study and found that programmers took 45% less time to debug when they used Triage than when they didn’t for “real” bugs, and 18% for “toy” bugs.  (“…although Triage still helped, the effect was not as large since the toy bugs are very simple and straightforward to diagnose even without Triage.”)

It looks like the subjects were given the Triage bug reports before they started work, so the time that it takes to run Triage wasn’t factored into the time it took.  The time it took Triage to run was significant (up to 64 min for one of the bugs), but presumably the Triage run would be done in background.  I could set up Triage to run while I went to lunch, for example.

This looks cool.

05.12.08

it's not just me!

Posted in Hacking, Random thoughts at 8:31 pm by ducky

It’s not just me! Someone else gets frustrated by Web sites that won’t allow dashes or spaces in credit card numbers!

It is just unfathomable to me why any Web site would not insert the ONE LINE OF CODE to handle spaces and dashes. I am glad to see that it’s not just me.

05.10.08

geek humour

Posted in Family, Hacking, Random thoughts at 9:38 am by ducky

My husband and I are geeks. This manifests itself in many ways. One way is that when we moved up to Canada from Palo Alto, we numbered all of the boxes and logged all of the contents of all of the boxes.

In anticipation of our move to a tiny tiny apartment in downtown Vancouver, I packed up a box of books and class notes to take down to our storage locker in the US. Jim said that he’d been assigning new boxes numbers in the 200 series — 200, 201, etc.

Me: “Jim, would you be a name service for box numbers?”

He pulled out his PDA and got ready.

Me: “Um, ‘Hello.'”

Jim said nothing but was suppressing a grin. He continued to say nothing.

Me: “Doh! Right! Carriage-return, carriage-return!”

Much laughter ensued. We are such geeks. 🙂

(When taking DIRECTLY to Web servers, i.e. not through a Web browser, you have to issue a command like “GET / HTTP/1.0” and follow it with *two* carriage returns. One won’t do, and it’s a really easy mistake to make.)

(PS, Yes, I know that HELOs (used in email protocols) don’t need two carriage returns.)

(PPS, Yes, I know that technically it’s CRLF, CRLF, not CR, CR.)

04.16.08

help? tab spam in different IDEs?

Posted in Eclipse, Hacking, Technology trends at 5:42 pm by ducky

There are a zillion graphical IDEs out there, and I really don’t want to download and try each one. I don’t even want to try twenty of them. So, dear readers, care to help me?

All the IDEs that I’ve seen have a main center panel with the source of the file you’re looking at. Above that, they seem to all have a row of tabs, one per file that you’ve opened. (Does anybody have anything different?)

Here is a link to a screenshot of Eclipse. (Sorry, it was too wide for my blog.) Eclipse puts a little “>>” to show that there are more files (as pointed to by the ugly black arrow), with the number of hidden tabs next to it (“4” in this case). Clicking on the “>>4” gives a drop-down menu (shown here in yellow).

What happens in other IDEs when you have more tabs than will fit across the horizontal width of the source file? How does your IDE handle it? Or does your IDE have a different tabbing model altogether, e.g. it opens one window per file?  I would greatly appreciate hearing from you.

You can either email me ducky at webfoot dot com or post in the comments; I’ll wait a few days and then summarize in another blog posting.  Remember to tell me the name of your IDE.

03.11.08

robobait: how to turn off microphone in Windows XP on ThinkPad

Posted in Hacking, robobait at 9:24 am by ducky

To turn off the sound on a ThinkPad running Windows XP:

Start->

Control Panel->

Adjust the system volume ->

In the Device volume area in the middle of the Volume tab, click on Advanced. I know, I know, it looks like you are adjusting speaker volume, but go ahead and do it.

If the microphone doesn’t show up as one of the devices, select the menu item Options->Properties and put a check in the Microphone box.

In the microphone sub-panel, put a check in the Mute box.

(I am soooo not a Windows person…)

02.21.08

open source funding models

Posted in Hacking, Random thoughts at 9:05 pm by ducky

Paul Ramsey posted recently that open source was funded by four basic types:

  • The altruist / tinkerer
  • The service provider
  • The systems integrator
  • The company man

He continued by saying, “Notably missing from this list is ‘the billionaire’ and ‘the venture capitalist’.”

I have to quibble a little bit.

Mitch Kapor personally financed (most of) the Open Source Applications Foundation, which pumped quite a bit of money into open source projects.

Obviously a lot of it went into the Chandler project, but some went into various framework projects in order to make them work well enough to use.  For example, OSAF supported someone full-time for several years to improve the Mac version of WxWidgets.

There is also a fair amount of work done by people who made a ton of money and are now tinkering.  I know personally a few people who made boodles on IPOs, retired, and now spend their time on open-source projects.  While you could say that these are in Paul’s “tinkerer” class, these people can invest a lot more energy than someone on nights and weekends is likely to.

I will acknowledge, however, that the fraction of open source work financed by rich folks is probably small.

However, I think Paul is missing some significant sources of open source financing.  Paul’s definition of “company man” talks of people who have a little bit of discretionary time at their work.  Maybe it’s different outside of Silicon Valley, but I don’t know many people who have much discretionary time at work.  On the other hand, there are a fair number of people whose work leads them to contribute to open source in the course of their work.

  • Sometimes they use an open-source project in their work, find that it has some bugs that they need to fix to make their project work, and contribute the fixes back.  This is self-interest: they would rather not port their fixes into each new rev of the open-source project.  My husband said that his former company contributed some fixes for Tk for this reason.
  • There are a few companies who use a technology so much that they feel it worthwhile to support work on that technology.  Google pays Guido von Rossum’s salary, for example.
  • I don’t know what category to put Mozilla into, but it makes money from partnerships (i.e. Google) to finance its development.
  • There is a non-zero amount of money that goes into open source — directly or indirectly — from governments and other non-profit funding agencies.  For example, the Common Solutions Group gave a USD$1.25M to OSAF; the Mellon Foundation gave USD$1.6M.  This made perfect sense: it is far cheaper for university consortia to give OSAF a few million dollars to develop an open-source calendar than it is to give Microsoft tens of millions every year for Exchange.
    • While I don’t have hard data, I bet that a fair amount of open-source work gets done at universities.  All of my CS grad student colleagues work with open source because it’s cheap and easy to publish with.  While they might not release entire projects, I bet I’m not the only one who has fed bug fixes back to the project.

10.26.07

seeking story source

Posted in Hacking at 12:29 pm by ducky

I heard a story once, but don’t have any idea where it came from. It’s too good a story to let die, so do any of you know where this story came from?

A computer science department had a programming competition of some sort. The winner would be the team that computed the result fastest. The contestants could enter as many times as they wanted without penalty.

One team entered a (slow) program that calculated the value, wrote it to a file, submitted it, and quit. They then entered a second program which read the value from that file and submitted it. They were way, way faster than any of the other teams.

The computer science department was split. About a third didn’t care. A third thought that the team were cheaters and should be completely disqualified and scorn heaped upon them. A third thought that the team was brilliant and should get the top prize and high praises.

The way I remember the story, the department was ripped apart by this issue. It laid bare some fundamental differences in value priorities such that the faculty was unable to work together.

Is this apocryphal? Urban legend? True? Help me out here….

« Previous Page« Previous entries « Previous Page · Next Page » Next entries »Next Page »