08.16.14

The Perfect To-Do List Manager

Posted in Hacking, Random thoughts, Technology trends at 5:11 pm by ducky

There are a huge number of to-do list managers (TDLMs) out in the world now, but none of them do what I want.  Apparently, it’s not just me: I just read an article which said that when students were asked what mobile apps they really wanted, 20% said they wanted “a comprehensive to-do + calendaring + life management app that helps them better organize their lives”.  TWENTY percent!

Is it really that hard?

I have strong opinions about what I want, and I don’t think it’s that hard, so I will describe my desires here in the hopes that somebody will make me the perfect TDLM.  (Some of the features you can see in a TDLM which I wrote for a class project.  Sometimes I think about writing my perfect TDLM, but I’m busy with other things.  I want it to exist, not to write it myself.)

The most important thing is that the TDLM should make you feel better about your tasks.  The biggest problem with TDLMs right now is that they make you feel guilty: the list grows and grows and grows because there are an infinite number of things it would be nice to do and only a finite amount of time.  This means that every time you open the TDLM, you feel overwhelmed by guilt at all the things you haven’t done yet.

1. Hide stuff you can’t work on right now because of blocking tasks.  Don’t show me “paint the bedroom” if I haven’t finished the task of “choose colour for bedroom”.  (This means you need UI for showing what tasks depend upon which other tasks, and I warn you that’s not as easy as you think.)

2. Hide stuff you won’t work on right now because you are busy with other things.  Don’t show me “paint the bedroom” if I have decided that I’m not going to start that project until I finish doing my taxes.  “Do taxes” is not truly a blocking task — it’s not like I am going to use the tax forms to apply the paint — but hide it anyway. (This means you need UI for showing what the sequencing of tasks is.)

3. Hide stuff you won’t work on right now because it is the wrong time of year.  Maybe you want a task of “buy new winter jacket”, but you want to wait until the end of winter to get take advantage of the sales on coats.  You should to be able to tell your TDLM to hide that task until March.  (Or until May, if you live in Manitoba.)  Or “rotate tires” is something which only needs to happen every six months.

Note that this implies connecting the TDLM to a calendar, at least minimally.

4. Allow recurring to-do list items.  I don’t want to have to make a new task for our wedding anniversary every year.  I want to set it once and forget it.  Usually people put things on their calendars for repeating events, but “Wedding Anniversary” goes on August 22ns and is not a task.  “Plan something for anniversary” is a recurring task but should be hidden until at about August 1st.

The TDLM should distinguish between recurring tasks which expire and those which do not.  Non-expiring tasks are ones like “pay phone bill”.  Even if you forget to pay it by the due date, you still need to deal with it.  On the other hand, “run 2km” is an expiring item: if you couldn’t do your 2km run on Monday, it probably does not mean that you should run 4km on Wednesday.

5. Make me feel super-good about finishing tasks.  A lot of TDLMs handle checking something as done by making it disappear.   This is the worst.  I’ve spent hours, weeks, or months looking at that dang task, and when I finally finish it, I want to savour the moment!  I want my TDLM to cheer, to have fireworks explode on the screen, and maybe even have the text of the task writhe in agony as it dies an ugly painful death.  I want there to be a display case in my TDLM of things that I have finished recently that I can look at with pride.  “Yeah”, I can think, “I am ***DONE*** with painting the bedroom!”  Maybe I don’t need full fireworks for a simple, one-step task which took 15 minutes, but if it was a 2000-step task which took 5 years (like getting a PhD or something like that), I want the TDLM to cheer for a full five minutes.

6. Let me see what I did. Sometimes, I feel like I didn’t get anything done, and it is reassuring to look at a list of the things that I actually did accomplish.  It might be nice to show it in a horizontal latest-first timeline form:

  • 4:47 pm Laundry
  • 3:13 pm Groceries
  • 12:11 pm Replace laptop display
  • (etc)

I would also like to be able to modify the task completion times.  “Oh, I actually finished replacing the laptop last night, I just didn’t feel like telling the TDLM because it was late and I was tired.”

7. Let me see what I am going to do.  People usually use calendars for this, but as I mentioned before, calendars are kind of the wrong tool.  I don’t really want to see “buy birthday present for Mom” in the same place as “Meet with boss, 10:30 AM”.  Plus, a strict time-base is makes zero sense if the dependencies are other tasks.

8. Let me import/modify/export task hierarchies.  Suppose you want to have a wedding.  (Mazel Tov!)  There are predictable things which you need to do: book a space for the wedding, a space for the reception, book an officiant, book a caterer, choose a menu, etc.   If, say, you want a wedding sort of like your friend Joanne’s, it would be nice if Joanne could email you the hierarchy of tasks that she did for her wedding, and you could just drop it in to your TDLM.  (Perhaps that way, you wouldn’t forget to rent a dance floor.)

But maybe you have some Greek heritage and Joanne does not, so you need to add “get a stefana” to your list.  You should be able to do that — and then export your new wedding task list for your brother when he gets married.  Even better, you ought to be able to upload it to a site which hosts lots of packaged tasks, maybe even a whole section on weddings (so your brother could pick and choose which wedding task list he likes best).

Needless to say, the exported task hierarchy should be in a form which lends itself well to version control and diffing.  🙂

9. Let me share my task list with other people.  I would like to be able to share my “home” task list with my husband, so that he could assign me tasks like “buy three kitchen sponges”.  Ideally, I’d think I’d like for there to be three task lists: his, mine, and ours.

My husband and I would probably set things up to both have read/write permission on all three — there are some things that only one of us can or should do.  I can imagine other couples might want to not have write permission on each other’s, only on the “ours” one.

10. Make it easy to discuss tasks.  This means assigning a simple ID and URL to the task.  If Jim and I are going to share tasks, we are going to discuss them.  It would be nice to be able to say, “Task #45” instead of “that one about the paintbrushes”.  It would also be nice to be able to email a link to him which will take him right to Task #45.

11. (Nice to have) Allow putting a time estimate on the task.  If you know that it takes you about two hours to get to your locker, change clothes, stretch, run 2km, stretch, shower, change clothes, and get back to your workplace, then it might be nice to put in an estimate for the “run 2km” task.

If you can put a time estimate on a task and adjust it later, the TDLM could keep track of estimated vs. actual, and start to help you adjust your estimates.  “For tasks which you estimate are going to be about 3hrs, you spend an average of 4.15 hrs.”

It would also be nice if the TDLM could help you make estimates based on similar tasks which you completed.  When entering an estimate for painting the living room, it would be nice if the TDLM mentioned, off to the side, how long it took you to paint the bathroom and the bedroom.  (It’s even okay if it also tells you how long it took you to paint the landscape or your fingernails; presumably you’d be smart enough to figure out which tasks were relevant.)

12. (Nice to have) Make the TDLM geo-aware.  It would be kind of nice to be able to hide tasks until I was at or near a particular location.  For example, if I am not in a big hurry to paint the bedroom, hide “buy paint” until I am actually at the paint store.

Something requested by the students in the article I mentioned earlier was being told to leave in order to make it to the next appointment.  “Doctor’s appointment at 3pm” is a calendar event, but “get to doctor’s office” is a task which needs to happen at a time which depends upon how long it takes to get to the doctor’s office from where you are.  That’s another way that geo-awareness could be useful.

13. (Maybe nice to have) Be able to mark urgency.  I am not actually certain how useful this is.  I have had TDLMs which allowed me to mark urgency, and I found that I almost never used it.  I think people will expect it, however.

14. (Nice to have, but difficult) Integrate with my applications.  Tasktop Technologies has a product called Tasktop Dev, which kept track of what you did in the source code editor (and some other applications, e.g. web browser and Microsoft Office) while you were working on a specific task.  (You had to tell it, “now I am working on task #47” so that it would know to start watching.)  Then, there was a record of what you worked on for that task.  That was useful if you needed to stop and restart the task (especially over a long period of time), or if you needed to go back a long time later and see what you had done.  (“What was the URL of that caterer with the really nice cheesecake?”)

In a work environment, it would be nice to integrate it with other task management systems (AKA “bug trackers”) like Jira or Asana or Bugzilla.

This is what I want.  If it persists in not existing, I might have to do it myself someday.

04.18.13

Variably-sized points on maps

Posted in Hacking, Maps at 1:20 am by ducky

The Huffington Post made a very nice interactive map of homicides and accidental gun deaths since the shooting at Sandy Hook.  It’s a very nice map, but has the (very common problem) that it mostly shows where the population density is high: of course you will have more shootings if there are more people.

I wanted to tease out geographical/political effects from population density effects, so I plotted the gun deaths on a population-based cartogram.  Here was my first try.  (Click on it to get a bigger image.)

Unfortunately, the Huffington Post data gives the same latitude/longitude for every shooting in the same city.  This makes it seem like there are fewer deaths in populated areas than there really are.  So for my next pass, I did a relatively simple map where the radius of the dots was proportional to the square root of the number of gun deaths (so that the area of the dot would be proportional to the number of gun deaths).

 

 

This also isn’t great.  Some of the dots are so big that they obscure other dots, and you can’t tell if all the deaths were in one square block or spread out evenly across an entire county.

For the above map, for New York City, I dug through news articles to find the street address of each shooting and geocoded it (i.e. determined the lat/long of that specific address). You can see that the points in New York City (which is the sort of blobby part of New York State at the south) seem more evenly distributed than for e.g. Baltimore.  Had I not done that, there would have been one big red dot centered on Manhattan.

(Side note: It was hugely depressing to read article after article about people’s — usually young men’s — lives getting cut short, usually for stupid, stupid reasons.)

I went through and geocoded many of the cities.  I still wasn’t satisfied with how it looked: the size balance between the 1-death and the multiple-death circles looked wrong.  It turns out that it is really hard — maybe impossible — to get area equivalence for small dots.  The basic problem is that with radiuses are integers, limited by pixels.  In order to get the area proportional to gun deaths, you would want the radius to be proportional to the square root of the number of gun deaths, or {1, 1.414, 1.732 2.0, 2.236, 2.449, 2.645, 2.828, 3.000}, the rounded numbers will be {1, 1, 2, 2, 2, 2, 3, 3, 3}; instead of areas of {pi, 2*pi, 3*pi, 4*pi, …}, you get {pi, pi, 4*pi, 4*pi, 4*pi, 9*pi, 9*pi, 9*pi}.

Okay, fine.  We can use a trick like anti-aliasing, but for circles: if the square root of the number of gun deaths is between two integer values (e.g. 2.236 is between 2 and 3), draw a solid circle with a radius of the just-smaller integer (for 2.236, use 2), then draw a transparent circle with a radius of the just-higher integer (for 2.236, use 3), with the opacity higher the closer the square root is to the higher number. Sounds great in theory.

In practice, however, things still didn’t look perfect.  It turns out that for very small dot sizes, the dots’ approximations to circles is pretty poor.  If you actually zoom in and count the pixels, the area in pixels is {5, 13, 37, 57, 89, 118, 165, …} instead of what pi*R^2 would give you, namely {3.1, 12.6, 28.3, 50.3, 78.5, 113.1, 153.9, …}.

 

But wait, it’s worse: combine the rounding radiuses problem with the problem of approximating circles, and the area in pixels will be {5, 5, 13, 13, 13, 13, 37, 37, 37, …}, for errors of {59.2%, -60.2% -54.0% -74.1% -83.4% -67.3% -76.0%, …}.  In other words, the 1-death dot will be too big and the other dots will be too small.  Urk.

Using squares is better.  You still have the problem of the rounding the radius, but you don’t have the circle approximation problem.  So you get areas in pixels of {1, 1, 4, 4, 4, 9, 9, 9, …} instead of {1, 2, 3, 4, 5, 6, 7, 8, …} for errors of {0.0%, -50.0%, 33.3%, 0.0%, -20.0%, -33.3%, 28.6%, …} which is better, but still not great.

Thus: rectangles:

Geocoding provided by GeoCoder.ca

 

03.12.13

Optimizing Map Tile Generation

Posted in Hacking, Maps at 11:54 am by ducky

In the past, when people asked me how I managed to make map tiles so quickly on my World Wide Webfoot Maps site, I just smiled and said, “Cleverness.” I have decided to publish how I optimized my map tile generation in hopes that others can use these insights to make snappier map services. I give a little background of the problem immediately below; mapping people can skip to the technical details.

Background

Choropleth maps colour jurisdictions based on some attribute of the jurisdiction, like population. They are almost always implemented by overlaying tiles (256×256 pixel PNG images) on some mapping framework (like Google Maps).

Map tile from a choropleth map (showing 2012 US Presidential voting results by county)

Most web sites with choropleth maps limit the user: users can’t change the colours, and frequently are restricted to a small set of zoom levels. This is because the tiles are so slow to render that the site developers must render the tile images ahead of time and store them.  My mapping framework is so fast that I do not need to pre-render all the tiles for each attribute. I can allow the users to interact with the map, going to arbitrary zoom levels and changing the colour mapping.

Similarly, when people draw points on Google Maps, 100 is considered a lot. People have gone to significant lengths to develop several different techniques for clustering markers. By contrast, my code can draw thousands very quickly.

There are 32,038 ZIP codes in my database, and my framework can show a point for each with ease. For example, these tiles were generated at the time this web page loaded them.

32,038 zip codes at zoom level 0 (entire world)
Zip codes of the Southeast US at zoom level 4

(If the images appeared too fast for you to notice, you can watch the generation here and here. If you get excited, you can change size or colour in the URL to convince yourself that the maps framework renders the tile on the fly.)

Technical Details

The quick summary of what I did to optimize the speed of the map tile generation was to pre-calculate the pixel coordinates, pre-render the geometry and add the colours later, and optimize the database. In more detail:

Note that I do NOT use parallelization or fancy hardware. I don’t do this in the cloud with seventy gajillion servers. When I first wrote my code, I was using a shared server on Dreamhost, with a 32-bit processor and operating system. Dreamhost has since upgraded to 64-bits, but I am still using a shared server.

Calculating pixel locations is expensive and frequent

For most mapping applications, buried in the midst of the most commonly-used loop to render points is a very expensive operation: translating from latitude/longitude to pixel coordinates, which almost always means translating to Mercator projection.

While converting from longitude to the x-coordinate in Mercator is computationally inexpensive, to convert from latitude to y-coordinate using the Mercator projection is quite expensive, especially for something which gets executed over and over and over again.

A spherical mercator translation (which is actually much simpler than the actual projection which Google uses) uses one logarithmic function, one trigonometric function, two multiplications, one addition, and some constants which will probably get optimized away by the compiler:

function lat2y(a) { return 180/Math.PI * Math.log(Math.tan(Math.PI/4+a*(Math.PI/180)/2)); }

(From the Open Street Maps wiki page on the Mercator projection)

Using Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs by Agner Fog, a tangent can take between 11 and 190 cycles, and a logarithm can take between 10 and 175 cycles on post-Pentium processors. Adds and multiplies are one cycle each, so converting from latitude to y will take between 24 and 368 cycles (not counting latency). The average of those is almost 200 cycles.

And note that you have to do this every single time you do something with a point. Every. Single. Time.

If you use elliptical Mercator instead of spherical Mercator, it is much worse.

Memory is cheap

I avoid this cost by pre-calculating all of the points’ locations in what I call the Vast Coordinate System (VCS for short). The VCS is essentially a pixel space at Google zoom level 23. (The diameter of the Earth is 12,756,200 meters; at zoom level 23, there are 2^23 tiles, and each tile has 256 or 2^8 pixels, so there are 2^31 pixels around the equator. This means that the pixel resolution of this coordinate system is approximately .6cm at the equator, which should be adequate for most mapping applications.)

Because the common mapping frameworks work in powers of two, to get the pixel coordinate (either x or y) at a given zoom level from a VCS coordinate only requires one right-shift (cost: 1 cycle) to adjust for zoom level and one bitwise AND (cost: 1 cycle) to pick off the lowest eight bits. The astute reader will remember that calculating the Mercator latitude takes, for the average post-Pentium processor, around 100 times as many cycles.

Designing my framework around VCS and the Mercator does make it harder to change the projection, but Mercator won: it is what Google uses, what Yahoo uses, what Bing uses, and even what the open-source Leaflet uses. If you want to make map tiles to use with the most common services, you use Mercator.

Furthermore, should I decide that I absolutely have to use a different projection, I would only have to add two columns to my points database table and do a bunch of one-time calculations.

DISTINCT: Draw only one ambiguous point

If you have two points which are only 10 kilometers apart, then when you are zoomed way in, you might see two different pixels for those two points, but when you zoom out, at some point, the two points will merge and you will only see one pixel. Setting up my drawing routine to only draw the pixel once when drawing points is a big optimization in some circumstances.

Because converting from a VCS coordinate to a pixel coordinate is so lightweight, it can be done easily by MySQL, and the DISTINCT keyword can be used to only return one record for each distinct pixel coordinate.

The DISTINCT keyword is not a big win when drawing polygons, but it is a HUGE, enormous win when drawing points. Drawing points is FAST when I use the DISTINCT keyword, as shown above.

For polygons, you don’t actually want to remove all but one of a given point (as the DISTINCT keyword would do), you want to not draw two successive points that are the same. Doing so is a medium win (shaving about 25% of the time off) for polygon drawing when way zoomed out, but not much of a win when way zoomed in.

Skeletons: Changing the colour palette

While the VCS speed improvement means that I could render most tiles in real time, I still could not render tiles fast enough for a good user experience when the tiles had very large numbers of points. For example, the 2000 Census has 65,322 census tracts; at zoom level 0, that was too many to render fast enough.

Instead, I rendered and saved the geometry into “skeletons”, with one set of skeletons for each jurisdiction type (e.g. census tract, state/province, country, county). Instead of the final colour, I filled the polygons in the skeleton with an ID for the particular jurisdiction corresponding to that polygon. When someone asked for a map showing a particular attribute (e.g. population) and colour mapping, the code would retrieve (or generate) the skeleton, look up each element in the colour palette, decode the jurisdictionId, look up the attribute value for that jurisdictionId (e.g. what is the population for Illinois?), use the colour mapping to get the correct RGBA colour, and write that back to the colour palette. When all the colour palette entries had been updated, I gave it to the requesting browser as a PNG.

While I came up with the idea of fiddling the colour palette independently, it is not unique to me. My friend also came up with this idea independently, for example. What I did was take it a bit farther: I modified the gd libraries so they had a 16-bit colour palette in the skeletons which I wrote to disk. When writing out to PNG, however, my framework uses the standard format. I then created a custom version of PHP which statically linked my custom GD libraries.

(Some people have asked why I didn’t contribute my changes back to gd. It’s because the pieces I changed were of almost zero value to anyone else, while very far-reaching. I knew from testing that my changes didn’t break anything that I needed, but GD has many many features, and I couldn’t be sure that I could make changes in such a fundamental piece of the architecture without causing bugs in far-flung places without way more effort than I was willing to invest.)

More than 64K jurisdictions

16 bits of palette works fine if you have fewer than 64K jurisdictions on a tile (which the 2000 US Census Tract count just barely slid under), but not if you have more than 64K jurisdictions. (At least not with the gd libraries, which don’t reuse a colour palette entry if that colour gets completely overwritten in the image.)

You can instead walk through all the pixels in a skeleton, decode the jurisdiction ID from the pixel colour and rewrite that pixel instead of walking the colour palette. (You need to use a true-colour image if you do that, obviously.) For large numbers of colours, changing the colour palette is no faster than walking the skeleton; it is only a win for small numbers of colours. If you are starting from scratch, it is probably not worth the headache of modifying the graphics library and statically linking in a custom version of GD into PHP to walk the colour palette instead of walking the true-colour pixels.

(I had to modify GD anyway due to a bug I fixed in GD which didn’t get incorporated into the GD release for a very long time.  My patch finally got rolled in, so you don’t need to do that yourself.)

My framework now checks to see how many jurisdiction are in the area of interest; if there are more than 64K, it creates a true-colour image, otherwise a paletted image. If the skeleton is true-colour, it walks pixels, otherwise it walks the palette.

Credits: My husband implemented the pixel-walking code.

On-demand skeleton rendering

While I did pre-render about 10-40 tiles per jurisdiction type, I did not render skeletons for the vast majority of tiles. Instead, I render and save a skeleton only when someone asks for it. I saw no sense in rendering ahead of time a high-maginification tile of a rural area. Note that I could only do this on-demand skeleton generation because the VCS speedup made it so fast.

I will also admit that I did generate final tiles (with the colour properly filled in, not a skeleton) to zoom level 8 for some of my most commonly viewed census tract attributes (e.g. population by census tract) with the default value for the colour mapping. I had noticed that people almost never change the colour mapping. I did not need to do this; the performance was acceptable without doing so. It did make things slightly snappier, but mostly it just seemed to me like a waste to repeatedly generate the same tiles. I only did this for US and Australian census jurisdictions.

MySQL vs. PostGIS

One happy sort-of accident is that my ISP, Dreamhost, provides MySQL but does not allow PostGIS. I could have found myself a different ISP, but I was happy with Dreamhost, Dreamhost was cheap, and I didn’t particularly want to change ISPs. This meant that I had to roll my own tools instead of relying upon the more fully-featured PostGiS.

MySQL is kind of crummy for GIS. Its union and intersection operators, for example, use bounding boxes instead of the full polygon. However, if I worked around that, I found that for what I needed to do, MySQL was actually faster (perhaps because it wasn’t paying the overhead of GIS functions that I didn’t need).

PostGIS’ geometries are apparently stored as serialized binary objects, which means that you have to pay the cost of deserializing the object every time you want to look it or one of its constituent elements. I have a table of points, a table of polygons, and a table of jurisdictionIds; I just ask for the points, no deserialization needed. Furthermore, at the time I developed my framework, there weren’t good PHP libraries for deserializing WKB objects, so I would have had to write my own.

Note: not having to deserialize is only a minor win. For most people, the convenience of the PostGIS functions should be worth the small speed penalty.

Database optimization

One thing that I did that was entirely non-sexy was optimizing the MySQL database. Basically, I figured out where there should be indices and put them there. This sped up the code significantly, but it did not take cleverness, just doggedness. There are many other sites which discuss MySQL optimization, so I won’t go into that here.

Future work: Feature-based maps

My framework is optimized for making polygons, but it should be possible to render features very quickly as well. It should be possible to, say, decide to show roads in green, not show mountain elevation, show cities in yellow, and not show city names.

To do this, make a true-colour skeleton where each bit of the pixel’s colour corresponds to the display of a feature. For example, set the least significant bit to 1 if a road is in that pixel. Set the next bit to 1 if there is a city there. Set the next bit to 1 if a city name is displayed there. Set the next bit to 1 if the elevation is 0-500m. Set the next bit to 1 if the elevation is 500m-1000m. Etc. You then have 32 feature elements
which you can turn on and off by adjusting your colour mapping function.

If you need more than 32 feature elements, then you could use two skeletons and merge the images after you do the colour substitutions.

You could also, if you chose, store elevation (or depth) in the pixel, and adjust the colouring of the elevation with your colour mapping function.

Addendum: I meant to mention, but forgot, UTFGrid.  UTFGrid has an array backing the tile for lookup of features, so it is almost there.  However, it is just used to make pop-up windows, not (as near as I can tell) to colour polygons.

01.02.10

Early days of the computer revolution

Posted in Hacking at 3:39 pm by ducky

I was talking to a friend of mine who I’m guessing was born in the late 1970s, and mentioned that I had been using computers since 1968ish and email since 1974.  (Yes, really.)

I saw a lightbulb go off over his head.  He knew I was in my mid-forties, and knew I was a computer geek, but hadn’t ever really put two and two together.  “Oh!  You were around for the start of the personal computer revolution?  What was that like?  That must have been totally *COOL*!!!

I said, “Not really.”

He was stunned.  How could it have possibly not been totally cool and awesome?  I could see him struggling with trying to figure out how to express his confusion, how to figure out what question to even ask.

I said, “Look.  You are old enough that you were around for the start of the mobile phone revolution, right?  That must have been totally *COOL*!!!

“Oh”, he said.  “I get it.”

Mobile phones, when they first started out were kind of cool, I guess, but they weren’t “magic” then: they took real effort and patience.  They were wickedly expensive, heavy, had lousy user interfaces, and you had to constantly worry about whether you had enough battery life for the call.  The reception was frequently (usually?) poor, so even if you could make a connection, your call frequently got dropped.  The signal quality was poor, so you had to TALK LOUDLY to be heard and really concentrate to understand the other person.  And they didn’t do much.  It took a long time for mobile phones to become “magic”, and they only got better incrementally.

When they first came out, personal computers were kind of cool, I guess, but they weren’t “magic” then: they took real effort and patience.  They were wickedly expensive.  They crashed frequently enough that you always had to worry about saving your work.  They had so little disk space that managing your storage was a constant struggle (and why floppies held on for so very long after the introduction of the hard drive).

When I started my first job out of college (working at a DRAM factory for Intel in 1984), they had only recently put in place two  data-entry clerks to input information about the materials (“lots”) as the lots traversed the manufacturing plant. (When did the lot arrive at a processing step?  When did it get processed?  Who processed it?  What were the settings and reading on the machine?)  However, to get at that information, engineers like me had to get a signature from higher level of management to authorize a request to MIS (which might get turned down!) for that information.

A few months after I started, they bought three IBM PC XTs and put them in a cramped little room for us engineers to use.  I believe the only programs on them were a word processor and a spreadsheet.  There was no storage available to us.  They had hard drives, but we were not allowed to leave anything on the hard drive; we had to take our work away on floppies.  The PCs were not networked, so not only was there no email (and of course no Web), but no way to access the data that was collected out on the manufacturing floor.

If I wanted to make a spreadsheet analyzing e.g. the relationship between measured thickness of the aluminum layer, the measured sputtering voltage, and how long it had been since the raw materials had been replenished, I would go to the factory floor, walk around to find different lots in different stages of processing, copy the information to a piece of paper, take the piece of paper and a floppy disk to the computer room (hoping that there was a free computer), copy the data from the piece of paper into the spreadsheet, print the spreadsheet (maybe making a graph, but that was a little advanced), and copy the data onto my floppy if I wanted to look at it again.

Even though we “had computers”, we had no network, no email, and no wiki.  The way I shared information was still:

  • make a bunch of photocopies and put them in people’s (physical) mailboxes,
  • make a bunch of photocopies and walk around putting them on people’s desks,
  • make a bunch of photocopies and pass them out at a meeting where I presented my results, or
  • make one photocopy, write a routing list on it (a list of names with checkboxes), and put it on the first person’s desk.  They would read it, check their name off, and pass it to someone else on the list.

At the next company I worked at (1985-7), I got a PC on my desk because I was implementing the materials tracking system (because I had bitched about how stupid it was not to have one — yeah!!).  Our company had a network, but it was expensive and complicated enough that my desktop computer wasn’t on the network, nor were the two machines on the floor.  Ethernet used a coax cable and (if I recall correctly) you had to make a physical connection by puncturing the cable just right.  The configuration was tricky and not very fault-tolerant: if one computer on the network was misconfigured, it would mess up the entire network AND it was difficult even for a skilled network technician to figure out which computer was misconfigured.

At the next company I worked for (1987), I had a Sun workstation on my desk, and we had a network file servers, but I don’t think we used email, even internally.

At my next company (1988-90), I had a Wyse 50 “glass teletype” on my desktop and full email capability.  The Wyse 50 didn’t have any graphics, but that wasn’t a real big deal because no programs I would ever want to use at work had graphics of any sort. My department used email heavily, but there were some departments in the company that did not use email, so there were lots of memos that were still issued on paper.  They would go either into my (physical) mailbox, or would be pinned to cubicle corridor walls.

While I had the theoretical ability to send email to the outside world then, almost nobody I knew outside the company had email, and figuring out how to address messages to get to the outside was difficult: you had to specify all the intermediate computers, e.g. sun!ubc!decwrl!decshr!slaney.

It wasn’t until that company imploded in about 1990 and my colleagues scattered to other computer companies that I had anyone outside my company to correspond with.  (Fortunately, at about the same time, it got easier to address external email messages.)

It wasn’t until 1991 that I stopped seeing paper memos — a full twenty fifteen years after the introduction of the Apple II computer.  My husband reports that he also stopped seeing paper memos in about 1991.

I would contend that computers didn’t really start to become “magic” until about 1996 or so, when the World Wide Web had been absorbed by the masses and various Web services were available.  Only after about 1996 could you pretty reliably assume that anyone (well, those born after WW2 started, at least) used computers or had an email address.

So when personal computers first came out, they were not totally cool.  The idea of personal computers was totally cool.  The potential was totally cool.  But that potential was unrealized for many many years.

08.15.09

Progress! Including census tracts!

Posted in Hacking, Maps at 9:54 pm by ducky

It might not look like I have done much with my maps in a while, but I have been doing quite a lot behind the scenes.

Census Tracts

I am thrilled to say that I now have demographic data at the census tract level now on my electoral map!  Unlike my old demographic maps (e.g. my old racial demographics map), the code is fast enough that I don’t have to cache the overlay images.  This means that I can allow people to zoom all the way out if they choose, while before I only let people zoom back to zoom level 5 (so you could only see about 1/4 of the continental US at once).

These speed improvements were not easy, and it’s still not super-fast, but it is acceptable.  It takes between 5-30 seconds to show a thematic map for 65,323 census tracts. (If you think that is slow, go to Geocommons, the only other site I’ve found to serve similarly complex maps on-the-fly.  They take about 40 seconds to show a thematic map for the 3,143 counties.)

A number of people have suggested that I could make things faster by aggregating the data — show things per-state when way zoomed out, then switch to per-county when closer in, then per-census tract when zoomed in even more.  I think that sacrifices too much.  Take, for example, these two slices of a demographic map of the percent of the population that is black.  The %black by county is on the left, the %black by census tract is on the right.  The redder an area is, the higher the percentage of black people is.

Percent of population that is black; by counties on left, by census tracts on the right

Percent of population that is black; by counties on left, by census tracts on the right

You’ll notice that the map on the right makes it much clearer just how segregated black communities are outside of the “black belt” in the South.  It’s not just that black folks are a significant percentage of the population in a few Northern counties, they are only significantly present in tiny little parts of Northern counties.  That’s visible even at zoom level 4 (which is the zoom level that my electoral map opens on).  Aggregating the data to the state level would be even more misleading.

Flexibility

Something else that you wouldn’t notice is that my site is now more buzzword-compliant!  When I started, I hard-coded the information layers that I wanted: what the name of the attribute was in the database (e.g. whitePop), what the English-language description was (e.g. “% White”), what colour mapping to use, and what min/max numeric values to use.  I now have all that information in an XML file on the server, and my client code calls to the server to get the information for the various layers with AJAX techniques.  It is thus really easy for me to insert a new layer into a map or even to create a new map with different layers on it.  (For example, I have dithered about making a map that shows only the unemployment rate by county, for each of the past twelve months.)

Some time ago, I also added the ability for me to specify how to calcualte a new number with two different attributes.  Before, if I wanted to plot something like %white, I had to add a column to the database of (white population / total population) and map that.  Instead, I added the ability to do divisions on-the-fly.   Subtracting two attributes was also obviously useful for things like the difference in unemployment from year to year. While I don’t ever add two attributes together yet, I can see that I might want to, like to show the percentage of people who are either Evangelical or Morman.  (If you come up with an idea for how multiplying two attributes might be useful, please let me know.)

Loading Data

Something else that isn’t obvious is that I have developed some tools to make it much easier for me to load attribute data.  I now use a data definition file to spell out the mapping between fields in an input data file and where the data should go in the database.  This makes it much faster for me to add data.

The process still isn’t completely turnkey, alas, because there are a million-six different oddnesses in the data.  Here are some of the issues that I’ve faced with data that makes it non-straightforward:

  • Sometimes the data is ambiguous.  For example, there are a number of states that have two jurisdictions with the same name.  For example, the census records separately a region that has Bedford City, VA and Bedford County, VA.  Both are frequently just named “Bedford” in databases, so I have to go through by hand and figure out which Bedford it is and assign the right code to it.  (And sometimes when the code is assigned, it is wrong.)
  • Electoral results are reported by county everywhere except Alaska, where they are reported by state House district.  That meant that I had to copy the county shapes to a US federal electoral districts database, then delete all the Alaskan polygons, load up the state House district polygons, and copy those to the US federal electoral districts database.
  • I spent some time trying to reverse-engineer the (undocumented) Census Bureau site so that I could automate downloading Census Bureau data.  No luck so far.  (If you can help, please let me know!)  This means that I have to go through an annoyingly manual process to download census tract attributes.
  • Federal congressional districts have names like “CA-32” and “IL-7”, and the databases reflect that.  I thought I’d just use the state jurisdiction ID (the FIPS code, for mapping geeks) for two digits and two digits for the district ID, so CA-32 would turn into 0632 and IL-7 would turn into 1707.  Unfortunately, if a state has a small enough population, they only get one congressional rep; the data file had entries like “AK-At large” which not only messed up my parsing, but raised the question of whether at-large congresspeople should be district 0 or district 1.  I scratched my head and decided assign 0 to at-large districts.  (So AK-At large became 0200.)  Well, I found out later that data files seem to assign at-large districts the number 1, so I had to redo it.

None of these data issues are hard problems, they are just annoying and mean that I have to do some hand-tweaking of the process for almost every new jurisdiction type or attribute.  It also takes time just to load the data up to my database server.

I am really excited to get the on-the-fly census tract maps working.  I’ve been wanting it for about three years, and working on it off and on (mostly off) for about six months.  It really closes a chapter for me.

Now there is one more quickie mapping application that I want to do, and then I plan to dive into adding Canadian information.  If you know of good Canadian data that I can use freely, please let me know.  (And yes, I already know about GeoGratis.)

05.20.09

Gender and programming

Posted in Hacking, programmer productivity, robobait, Technology trends at 11:44 am by ducky

I had a very brief but very interesting talk with Prof. Margaret Burnett.  She does research on gender and programming. at Oregon State University, but was in town for the International Conference on Software Engineering.  She said that many studies have shown that women are — in general — more risk averse than men are.  (I’ve also commented on this.)  She said that her research found that risk-averse people (most women and some men) are less likely to tinker, to explore, to try out novel features in both tools and languages when programming.

I extrapolate that this means that risk-seeking people (most men and some women) were more likely to have better command of tools, and this ties into something that I’ve been voicing frustration with for some time — there is no instruction on how to use tools in the CS curriculum — but I had never seen it as a gender-bias issue before.  I can see how a male universe would think there was no need to explain how to use tools because the figured that the guys would just figure it out on their own.  And the most guys might — but most of the women and some of the men might not figure out how to use tools on their own.

In particular, there is no instruction on how to use the debugger: not on what features are available, not on when you should use a debugger vs. not, and none on good debugging strategy.  (I’ve commented on that here.)  Some of using the debugger is art, true, but there are teachable strategies – practically algorithms — for how to use the debugger to achieve specific ends.   (For example, I wrote up how to use the debugger to localize the causes of hangs.)

Full of excitement from Prof. Burnett’s revelations, I went to dinner with a bunch of people connected to the research lab I did my MS research in.  All men, of course.  I related how Prof. Burnett said that women didn’t tinker, and how this obviously implied to me that CS departments should give some instruction on how to use tools.  The guys had a different response: “The departments should teach the women how to tinker.”

That was an unsatisfying response to me, but it took me a while to figure out why.  It suggests that the risk-averse pool doesn’t know how to tinker, while in my risk-averse model, it is not appropriate to tinker: one shouldn’t goof off fiddling with stuff that has a risk of not being useful when there is work to do!

(As a concrete example, it has been emotionally very difficult for me to write this blog post today.  I think it is important and worthwhile, but I have a little risk-averse agent in my head screaming, screaming at me that I shouldn’t be wasting my time on this: I should be applying for jobs, looking for an immigration lawyer, doing laundry, or working on improving the performance of my maps code.  In other words, writing this post is risky behaviour: it takes time for no immediate payoff, and only a low chance of a future payoff.  It might also be controversial enough that it upsets people.  Doing laundry, however, is a low-risk behaviour: I am guaranteed that it will make my life fractionally better.)

To change the risk-averse population’s behaviour, you would have to change their entire model of risk-reward.  I’m not sure that’s possible, but I also think that you shouldn’t want to change the attitude.  You want some people to be risk-seeking, as they are the ones who will get you the big wins.  However, they will also get you the big losses.  The risk-averse people are the ones who provide stability.

Also note that because there is such asymmetry in task completion time between above-median and below-median, you might expect that a bunch of median programmers are, in the aggregate, more productive than a group at both extremes.  (There are limits to how much faster you can get at completing a task, but there are no limits to how much slower you can get.)  It might be that risk aversion is a good thing!

There was a study I heard of second-hand (I wish I had a citation — anybody know?) that found that startups with a lot of women (I’m remembering 40%) had much MUCH higher survival rates than ones with lower proportions of women.  This makes perfect sense to me; a risk-averse population would rein in the potentially destructive tendencies of a risk-seeking population.

Thus I think it does make sense to provide academic training in how to use tools.  This should perhaps be coupled with some propaganda about how it is important to set aside some time in the future to get comfortable with tools.  (Perhaps it should be presented as risky to not spend time tinkering with tools!)

UPDATE: There’s an interesting (though all-too-brief!) article that mentions differences in the biochemical responses to risk that men and women produce.  It says that men produce adrenaline, which is fun.  Women produce acetylcholine, which the article says pretty much makes them want to vomit.  That could certainly change one’s reaction to risk..

04.21.09

UBC programming team takes 34th in the world!

Posted in Canadian life, Hacking, University life at 11:47 am by ducky

The UBC programming team took 34th place at the 2009 International Collegiate Programming Contest (ICPC) World Finals in Stockholm!  W00t!!

This marks the sixth year in a row that UBC has gone to the World Finals, despite being entirely undergraduates and entirely without World Finals experience.  (We have information on past teams, but don’t know the seniority of the 2004 team.)

Congratulations, team!

11.30.08

Programming persistence

Posted in Hacking, Maps, Too Much Information at 11:21 pm by ducky

Warning: this is a long and geeky post.

From time to time in the past few years, I have mentioned that I was a little puzzled as to why more people didn’t render tiles on-the-fly for Google Maps, as I do in my U.S. Census Bureau/Google Maps mashup.

I have reappraised my attitude.  I have been redoing my mapping framework to make it easier to use.  I have reminded myself of all the hurdles I had to overcome, and discovered a number of annoying new ones.

First pass

I originally wrote my mapping framework in an extreme hurry.  It was a term project, and a month before the end of the term, I realized that it would be good for personal reasons to hand it in a week early.  The code functioned well enough to get me an A+, but I cut a huge number of corners.

Language/libraries/database choice

It was very important to minimize risk, so I wrote the framework in C++.  I would have liked to use a scripting language, but I knew that I would need to use a graphics library and a library to interpret shapefiles.  The only ones I found that looked reasonable were C-based libraries (Frank Warmerdam’s Shapelib library andThomas Boutell’s gd library).   I knew it was possible using a tool called SWIG, but I hadn’t ever used SWIG and had heard that it was touchy.  Doing it in C++ was guaranteed to be painful, but I knew what the limits of that pain were.  I didn’t know what the limits of pain of using SWIG would be.

Projection

I also had problems figuring out how to convert from latitude/longitude to pixel coordinates in the Google tile space.  At the time (December 2005), I had a hard time simply finding out what the mathematics of the Mercator transformation were.  (It is easier to find Mercator projection information now.)  I was able to figure out something that worked most of the time, but if you zoomed out past a certain level, there would be a consistent error in the y-coordinates.  The more you zoomed out, the bigger the error.  I’m pretty sure it’s some sort of rounding error.  I looked at it several times, trying to figure out where I could possibly have a roundoff error, but never did figure it out.  I just restricted how far people could zoom out.  (It also took a very long time to render tiles if you were way zoomed out, so it seemed reasonable to restrict it.)

Polygon intersection

I remember that I spent quite a lot of time on my polygon intersection code. I believe that I looked around the Web and didn’t find any helpful code, so developed it from scratch on little sleep. (Remember, I was doing this in a frantic, frantic hurry.) I ended up with eight comparisons that needed to be done for each polygon in the database for every tile. More on this later.

Rendering bug

The version I handed in had a bug where horizontal lines would show up at the bottom of tiles frequently, as you can see in the bottom left tile:

It was pretty obvious that the bug was my fault, as gd is a very mature and well-used graphics library.  My old office partner Carlos Pero had used it way back in 1994 to develop Carlos’ Coloring Book, so it was clear to me that the problem was my fault.

After I handed in my project, I spent quite a lot of time going through my code trying to figure out where the problem was with no luck.  Frustrated, I downloaded and built gd so that I could put breakpoints into the gd code.  Much to my surprise, I discovered that the bug was in the gd library!  I thus had to study and understand the gd code, fix it, report the bug (and patch), and of course blog about it so that other people wouldn’t have the same problem.

Pointing my code to the fixed gd

Then, in order to actually get the fix, I had to figure out how to statically link gd into my binaries. I like my ISP (Dreamhost) and wasn’t particularly interested in changing, but that meant I couldn’t use the system-installed gd libraries.  Statically linking wasn’t a huge deal, but it took me at least several hours to figure out which flag to insert where in my makefile to get it to build statically.  It was just one more thing.

Second pass

I have graduated, but haven’t found a job yet, so I decided to revamp my mapping framework. In addition to the aesthetic joy of making nice clean code:

  • It would be an opportunity to learn and demonstrate competence in another technology.
  • I had ideas for how I could improve the performance by pre-computing some things.
  • With a more flexible framework, I would be able to do some cool new mashups that I figured would get me more exposure, and hence lead to some consulting jobs.

Language/libraries/database choice

Vancouver is a PHP town, so I thought I’d give PHP a shot. I expected that I might have to rewrite my code in C++ eventually, but that I could get the basics of my improved algorithms shaken out first.  (I’m not done yet, but so far, I have been very very pleased with that strategy.)

I also decided to use MySQL.  While the feeling in the GIS community is that the Postgres‘ GIS extensions (PostGIS) are better than the GIS extensions to MySQL, I can’t run Postgres on my ISP, and MySQL is used more than Postgres.

I had installed PHP4 and MySQL 4 on my home computer some time ago, when I was working on Mapeteria.  However, I recently upgraded my home Ubuntu installation to Hardy Heron, and PHP4 was no longer supported.  That meant I need to install a variety of packages, and I went through a process of downloading, trying, discovering I was missing a package, downloading/installing, discovering I was missing a package, lather, rinse, repeat.  I needed to install  mysql-server-5.0,  mysql-client-5.0, php5, php5-mcrypt, php5-cli, php5-gd, libgd2-xpm-dev, php5-mysql, and php5-curl.  I also spent some time trying to figure out why php5 wouldn’t run scripts that were in my cgi-bin directory before realizing/discovering that with mod_php, it was supposed to run from everywhere but the cgi-bin directory.

Note that I could have done all my development on my ISP’s machines, but that seemed clunky.  I knew I’d want to be able to develop offline at some point, so wanted to get it done sooner rather than later.  It’s also a little faster to develop on my local system.

I did a little bit of looking around for a graphics library, but stopped when I found that PHP had hooks to the gd library.  I knew that if gd had not yet incorporated my horizontal lines bug fix, then I might have to drop back to C++ in order to link in “my” gd, but I figured I could worry about that later.

Projection

I made a conscious decision to write my Mercator conversion code from scratch, without looking at my C++ code.  I did this because I didn’t want to be influenced in a way that might lead me to get the same error at way-zoomed-out that I did before.  I was able to find equations on the Wikipedia Mercator page for transforming Mercator coordinates to X-Y coordinates, but those equations didn’t give a scale for the X-Y coordinates!  It took some trial and error to sort that out.

Data

For the initial development, I decided to use country boundaries instead of census tract boundaries. The code wouldn’t care which data it was using, and it would be nice to have tiles that would render faster when way-zoomed-out. I whipped up a script read a KML file with country boundaries (that I got from Valery Hronusov and used in my Mapeteria project) and loaded it into MySQL.  Unfortunately, I had real problems with precision.  I don’t remember whether it was PHP or MySQL, but I kept losing some precision in the latitude and longitude when I read and uploaded it.  I eventually converted to uploading integers that were 1,000,000 times the latitude and longitude, and so had no rounding difficulties.

One thing that helped me enormously when working on the projection algorithm was to gather actual data via Google.  I found a number of places on the Google maps where three territories (e.g. British Columbia, Alberta, and Montana) came together.  I would determine the latitude/longitude of those points, then figure out what the tile coordinates, pixel X, and pixel Y of that point were for various zoom levels.  That let me assemble high-quality test cases, which were absolutely essential in figuring out what the transformation algorithm should be, but it was very slow, boring, and tedious to collect that data.

Polygon intersection

When it came time to implement my polygon bounding box intersection code again, I looked at my old polygon intersection code again, saw that it took eight comparisons, and thought to myself, “That can’t be right!”  Indeed, it took me very little time to come up with a version with only four comparisons, (and was now able to find sources on the Web that describe that algorithm).

Stored procedures

One thing that I saw regularly in employment ads was a request for use of stored procedures, which became available with MySQL 5.  It seemed reasonable that using a stored procedure to calculate the bounding box intersection would be even faster, so I ran some timing tests.  In one, I used PHP to generate a complex WHERE clause string from eight values; in the other, I passed eight values to a stored procedure and used that in the WHERE clause.  Much to my suprise, it took almost 20 times more time to use the stored procedure!  I think I understand why, but it was interesting to discover that it was not always faster.

GIS extensions

My beloved husband had been harping on me to use the built-in GIS extensions.  I had been ignoring him because a large part of the point of this exercise was to learn more about MySQL, including stored procedures, but now that I found that the stored procedure was slow, it was time to time the built-in bounding box intersection routine.  If I stored the bounding box as a POLYGON type instead of as two coordinate pairs, then calculating the intersection took half the time.  Woot!

Rendering

I discovered that despite my having reported the horizontal lines bug fifteen months ago, the gd team hasn’t done anything with it yet.  Needless to say, this means that the version of libgd.a on Dreamhost has the bug in it. I thought about porting back to C++. I figured that porting back would probably take at minimum a week, and would raise the possibility of nasty pointer bugs, so it was worth spending a few days trying to get PHP to use my version of gd.

It is possible to compile your own version of PHP and use it, though it means using the CGI version of PHP instead of mod_php. I looked around for information on how to do that, and found a Dreamhost page on how to do so.. but failed utterly when I followed the directions. I almost gave up at that point, but sent a detailed message to Dreamhost customer support explaining what I was trying to do, why, and what was blocking me. On US Thanksgiving Day, I got a very thoughtful response back from Robert at Dreamhost customer support which pointed me at a different how-to-compile-PHP-on-Dreamhost page that ultimately proved successful.  (This is part of why I like Dreamhost and don’t really want to change ISPs.)

Compiling unfamiliar packages can be a real pain, and this was no different.  The Dreamhost page (on their user-generated wiki) had a few scripts that would do the install/build for me, but they weren’t the whole story.  Each of the scripts downloaded a number of projects (like openSSL, IMAP, CURL, etc) in compressed form, extracted the files, and built them.  The scripts were somewhat fragile — they would just break if something didn’t work right.  They were sometimes opaque — they didn’t always print an error message if something broke.  If there was a problem, they started over from the beginning, removing everything that had been downloaded and extracted.  Something small — like if the mirror site for mcrypt was so busy that the download timed out — would mean starting from scratch.  (I ended up iteratively commenting out large swaths of the scripts so that I wouldn’t have to redo work.)

There was some problem with the IMAP build having to do with SSL.  I finally changed one of the flags so that IMAP built without SSL — figuring that I was unlikely to be using this instance of PHP to do IMAP, let alone IMAP with SSL — but it took several false starts, each taking quite a long time to go through.

Finally, once I got it to build without my custom gd, I tried folding in my gd.  I uploaded my gd/.libs directory, but that wasn’t enough — it wanted the gd.h file.  I suppose I could have tried to figure out what it wanted, where it wanted it, but I figured it would be faster to just re-build gd on my Dreamhost account, then do a make install to some local directory. Uploading my source was fast and the build was slow but straightforward. However, I couldn’t figure out how to specify where the install should go. The makefiles were all autogenerated and very difficult to follow. I tried to figure out where in configure the install directory got set, but that too was hard to decipher. Finally, I just hand-edited the default installation directory. So there. That worked. PHP built!

Unfortunately, it wouldn’t run. It turned out that the installation script had a bug in it:

cp ${INSTALLDIR}/bin/php ${HOME}/${DOMAIN}/cgi-bin/php.cgi

instead of

cp ${INSTALLDIR}/bin/php.cgi ${HOME}/${DOMAIN}/cgi-bin/php.cgi

But finally, after all that, success!

Bottom line

So let me review what it took to get to tile rendering on the second pass:

  1. Choose a database and figure out how to extract data from it, requiring reading and learning.
  2. Find and load boundary information into the database, requiring trial and error.
  3. Choose a graphics library and figure out how to draw coloured polygons with it, requiring reading and learning.
  4. Gather test cases for converting from latitude/longitude into Google coordinate system, requiring patience.  A lot of patience.
  5. Figure out how to translate from latitude/longitude pairs into the Google coordinate system, requiring algorithmic skills.
  6. Diagnose and fix a bug in a large-ish C graphics library, requiring skill debugging in C.
  7. Download and install PHP and MySQL, requiring system administration skills.
  8. Figure out how to build a custom PHP, requiring understanding of bash scripts and makefiles.

So now, I guess it isn’t that easy to generate tiles!

Note: there is an entirely different ecosystem for generating tiles, one that comes from the mainline GIS world, one that descends from the ESRI ecosystem. I expect that I could have used PostGIS and GeoTools with uDig look like fine tools, but they are complex tools with many many features.  Had I gone that route, I would have had to wade through a lot of documentation of features I didn’t care about.  (I also would have had to figure out which ISP to move to in order to get Postgres.)  I think that it would have taken me long enough to learn / install that ecosystem’s tools that it wouldn’t have been worth it for the relatively simple things that I needed to do.  Your milage may vary.

10.20.08

Hire me!

Posted in Canadian life, Eclipse, Hacking, programmer productivity at 3:28 pm by ducky

I am looking for a job. If you know anybody in the Vancouver area who is looking for a really good hire, point them at this blog posting or send them to my resume.

Ideally, I’d like a intermediate (possibly junior, let’s talk) Java software development position, preferably where I could become an expert in Java-based web applications. (Java because I think it has the best long-term prospects. My second choice would be a Ruby position, as it looks like Ruby has long-term staying power as well.) I believe that I would advance quickly, as I have very broad experience with many languages and environments.

Development is only partially about coding, and I am very, very strong in the secondary skills. While studying programmer productivity for my MSCS. I uncovered a few unknown or poorly known, broadlly applicable, teachable skills that will improve programmer productivity. Hire me, and not only will you get a good coder, I make the coders around me better.

I am particularly good at seeing how to solve users’ problems, sometimes problems that they themselves do not see. I am good both at interaction design, at seeing the possibilities in a new technology, and seeing how to combine the two. I have won a few awards for web sites I developed.

I also have really good communication skills. I’ve written books, I blog, and I’ve written a number of articles. I even have significant on-camera television experience.

If you want a really low-stress way of checking me out, invite me to give a talk on programmer productivity. (I like giving the talk, and am happy to give it to basically any group of developers, basically any time, any where in the BC Lower Mainland. Outside of the Lower Mainland, you have to pay me for the travel time.)

Do you know of any developer opportunities that I should check out? Let me know at ducky at webfoot dot com.

09.10.08

Tripoli: Differential code coverage tool

Posted in Eclipse, Hacking, programmer productivity at 6:41 pm by ducky

In the observations of professional programmers that I did for my thesis, I frequently saw them get sidetracked because they didn’t have good information about whether the code they were looking at was executed as part of the feature they were exploring. I theorized that being able to look at the differences in code coverage between two runs — one where they exercised the feature and one where they didn’t — would be useful. Being able to see which methods executed in the first but not the second would hopefully help developers focus in quickly on the methods that were important to their task.

In my last month at UBC, I whipped up a prototype, an Eclipse plug-in called Tripoli, and ran a quickie user study. The differential code coverage information really does make a big difference. You can read a detailed account in UBC Technical Report TR-2008-14, but basically graduate students using Tripoli were frequently more successful than professional programmers who didn’t use Tripoli.

As an example, in the Size task, the subject needed to find where a pane was resized. That was hard. There is no menu item or key stroke invocation for developers to use a starting point. They also didn’t know what the pane was called: was it a canvas, a pane, a panel, a frame, a drawing, or a view? Between pilot subjects and actual subjects, at least eleven people tried and failed to find a method that was always called when a pane was resized and never called when it wasn’t. I even spent a few hours trying to figure out where the pane was resized and couldn’t.

As soon as I got Tripoli working, I tried the Size task, and in my first diff, I saw ten methods, including one named endResizingFrame(). And yes, it turned out to be the right spot. I realized that most of the methods came from not hovering over the corner of the frame such that the cursor changed into the grab handle, and reran.  This time I got exactly one method: endResizingFrame().  Wow. The graduate students in my second study were also all able to find endResizingFrame() by using Tripoli.

Tripoli does take some getting used to. Even as the author, it took me a while before I consistently remembered right away that I could use Tripoli on my problems. I’ve also clearly gotten better over time at figuring out what to do in my second run to ensure that the “diff” has very few methods in it.

If you want to try Tripoli out, I’ve posted it online. Just remember it’s a prototype.

« Previous entries Next Page » Next Page »