Now, I'm generally very opposed to the idea that we should quietly correct people's mistakes and pretend they did nothing wrong -- that was the mantra of Internet Explorer and I don't need to remind you where that got us. Since we already use custom 404 pages on mcgill.ca, my first idea was to simply check whether a lowercase version of the path existed, and then offer the user a link to the correct page.
Then, I got to thinking -- what if we could go a bit further than that? There's a number of cases where 404 errors are not the fault of the client -- for example, when pages get moved within the same site or between sites. Even though we try to provide redirects when large changes are made, doing so every time a page is renamed is just silly and will hopelessly clutter our redirects table.
Overall, I wanted to catch the following cases (all real-life examples):
- someone makes a small phonetic misspelling, e.g. /mcdonald/ vs /macdonald/
- a page is moved to a different site, e.g. /students/ask/health/ to /studenthealth/ask/health/
- a page is moved somewhere within the same site, e.g. /macdonald/about/shuttle/ to /macdonald/contact/shuttle/
Now, the easiest solution is to use levenshtein() to calculate the "difference" between the path that the user provided and the paths that we do have on record, and return all results that are less than, say, 4:
Easy, but quite heavy -- we have over 20,000 records in the page table, so the cost of this operation makes such searches very impractical. Moreover, this wouldn't cover the other two cases where I wanted to also match against subpaths -- I could rig something up with regexes, but that would end up being even heavier.
# SELECT path FROM page WHERE levenshtein('/mcdonald/', path) < 4;
Next, I concentrated on soundex() and metaphone() functions. They are fairly similar -- given a text string, they will return a reduced phonetic representation of it, for example:
# SELECT soundex('/mcdonald/'), soundex('/macdonald/');
soundex | soundex
M235 | M235
The exception here is metaphone() which requires an additional parameter, which lets you set the maximum length of the resulting string. E.g.:
# SELECT dmetaphone('/mcdonald/'), dmetaphone('/macdonald/');
dmetaphone | dmetaphone
MKTN | MKTN
At first, I tried using metaphone() to generate a phonetic representation of the complete path, capped at 50 characters and store it in a separate column called "fuzzy_path." E.g.:
# SELECT metaphone('Elvis has left the building', 12);
However, it proved to be not quite suitable for the job. Because I wanted to also catch the cases when a page got moved to a slightly different location. In other words, for "/students/ask/health/," I wanted to also look for:
# SELECT metaphone('/students/ask/health/', 50) AS fuzzy_path;
Because metaphone is geared towards phonetics, the output will differ depending on where in the "sentence" the letters occur. Moreover, paths may not necessarily be phonetic -- trying to run metaphone functions on something like "/wwa/" (stands for "who we are") returns an empty string:
# SELECT metaphone('/students/ask/health/', 50), metaphone('/ask/health/', 50);
metaphone | metaphone
STTNTSSKHL0 | ASKHL0
After trying several possible solutions, I ended up writing a small function in plsql that returned a soundex version of each of the path "chunks," like so:
# SELECT '/' || dmetaphone('wwa') || '/';
Then I wrote a trigger that auto-populates the "fuzzy_path" column of the page table whenever a new row is inserted, or whenever the "path" column of an existing record is modified, and so far I am quite happy with the solution. Because I pre-calculate the fuzzy paths and then use simple LIKE comparisons against indexed values, finding what I'm looking for is very quick and light on the server.
# SELECT get_fuzzy_path('/collaboration/wwa/');
The end-result: happy clients and a sane solution. What's not to like? :)
EDIT: I had to disable the check for */lasttwo/chunks/, as that was causing full seq scans on the fuzzy_path column. I should have remembered that LIKE matches are tricky with indexes, even when they are defined as varchar_pattern_ops. So, for now I'm only doing fuzzy matching on misspellings and pages within the same top-level dir.