About Me

My photo

Father, hacker, partner, feminist, atheist, socialist, SJW. Ex-Russian, Canadian, Québécois par adoption; universal basic income NDP-er (and I vote!); electric-car driving pansy; lapsed artist and photographer.

Abusive comments will be deleted, and abusive posters banned without warning.

Wednesday, March 05, 2008

Fuzzy path matching

One of the gripes we often hear in our office is that we enforce case-sensitivity in URLs. In other words, "/About/" != "/about/" != "/ABOUT/". Unix professionals would just shrug to that and say "that's just the way things work," but to the average user (especially to those few who never turn off their capslock button), that seems like an arbitrary rule. If you think about it long enough, you can sort of see things from their perspective -- after all, if both www.mcgill.ca and WWW.MCGILL.CA go to the same place, why is that not true for paths as well?

Now, I'm generally very opposed to the idea that we should quietly correct people's mistakes and pretend they did nothing wrong -- that was the mantra of Internet Explorer and I don't need to remind you where that got us. Since we already use custom 404 pages on mcgill.ca, my first idea was to simply check whether a lowercase version of the path existed, and then offer the user a link to the correct page.

Then, I got to thinking -- what if we could go a bit further than that? There's a number of cases where 404 errors are not the fault of the client -- for example, when pages get moved within the same site or between sites. Even though we try to provide redirects when large changes are made, doing so every time a page is renamed is just silly and will hopelessly clutter our redirects table.

Overall, I wanted to catch the following cases (all real-life examples):
  • someone makes a small phonetic misspelling, e.g. /mcdonald/ vs /macdonald/
  • a page is moved to a different site, e.g. /students/ask/health/ to /studenthealth/ask/health/
  • a page is moved somewhere within the same site, e.g. /macdonald/about/shuttle/ to /macdonald/contact/shuttle/
Clearly, the solution is some kind of "fuzzy" matching, and thankfully postgresql-contrib comes with a fuzzy-matching library containing a number of useful functions, among them soundex, levenshtein, and a number of metaphone implementations. Installing these functions is easy -- just read the README file that comes with postgresql-contrib and run one simple command.

Now, the easiest solution is to use levenshtein() to calculate the "difference" between the path that the user provided and the paths that we do have on record, and return all results that are less than, say, 4:
# SELECT path FROM page WHERE levenshtein('/mcdonald/', path) < 4;
path
-------------
/macdonald/
(1 row)
Easy, but quite heavy -- we have over 20,000 records in the page table, so the cost of this operation makes such searches very impractical. Moreover, this wouldn't cover the other two cases where I wanted to also match against subpaths -- I could rig something up with regexes, but that would end up being even heavier.

Next, I concentrated on soundex() and metaphone() functions. They are fairly similar -- given a text string, they will return a reduced phonetic representation of it, for example:
# SELECT soundex('/mcdonald/'), soundex('/macdonald/');
soundex | soundex
---------+---------
M235 | M235
(1 row)
Or:
# SELECT dmetaphone('/mcdonald/'), dmetaphone('/macdonald/');
dmetaphone | dmetaphone
------------+------------
MKTN | MKTN
(1 row)
The exception here is metaphone() which requires an additional parameter, which lets you set the maximum length of the resulting string. E.g.:
# SELECT metaphone('Elvis has left the building', 12);
metaphone
--------------
ELFSHSLFT0BL
(1 row)
At first, I tried using metaphone() to generate a phonetic representation of the complete path, capped at 50 characters and store it in a separate column called "fuzzy_path." E.g.:
# SELECT metaphone('/students/ask/health/', 50) AS fuzzy_path;
fuzzy_path
-------------
STTNTSSKHL0
(1 row)
However, it proved to be not quite suitable for the job. Because I wanted to also catch the cases when a page got moved to a slightly different location. In other words, for "/students/ask/health/," I wanted to also look for:
  • %/ask/health/
  • /students/%/health/
or their close equivalent. The trouble is, metaphone('/ask/health/', 50) will not actually be a substring of metaphone('/students/ask/health/', 50):
# SELECT metaphone('/students/ask/health/', 50), metaphone('/ask/health/', 50);
metaphone | metaphone
-------------+-----------
STTNTSSKHL0 | ASKHL0
(1 row)
Because metaphone is geared towards phonetics, the output will differ depending on where in the "sentence" the letters occur. Moreover, paths may not necessarily be phonetic -- trying to run metaphone functions on something like "/wwa/" (stands for "who we are") returns an empty string:
# SELECT '/' || dmetaphone('wwa') || '/';
?column?
----------
//
(1 row)
After trying several possible solutions, I ended up writing a small function in plsql that returned a soundex version of each of the path "chunks," like so:
# SELECT get_fuzzy_path('/collaboration/wwa/');
get_fuzzy_path
----------------
/C416/W000/
(1 row)
Then I wrote a trigger that auto-populates the "fuzzy_path" column of the page table whenever a new row is inserted, or whenever the "path" column of an existing record is modified, and so far I am quite happy with the solution. Because I pre-calculate the fuzzy paths and then use simple LIKE comparisons against indexed values, finding what I'm looking for is very quick and light on the server.

The end-result: happy clients and a sane solution. What's not to like? :)


EDIT: I had to disable the check for */lasttwo/chunks/, as that was causing full seq scans on the fuzzy_path column. I should have remembered that LIKE matches are tricky with indexes, even when they are defined as varchar_pattern_ops. So, for now I'm only doing fuzzy matching on misspellings and pages within the same top-level dir.

2 comments:

joshuadf said...

Very nice. When I worked at a Major US Medical Library we had a somewhat similar solution but used a big flat file (RewriteMap I think) which was regularly updated by someone watching search logs and 404s. Clearly your solution has the advantage of not requiring human intervention, though it is tied to PostgreSQL which was not our content DB at the time.

gtermars said...

How about using a PostgreSQL array type for your fuzzy_match column? Then, you could populate the array elements with the path components.

I -think- that'd help by allowing you to use an index on the column instead of doing LIKE searches (which IIRC don't use indexes, but that could just be an "older Pg version" thing).