URI Comparison Functions
Investigating URI parsing related issues in various products, I’ve run across many instances of code erroneously attempting to compare two URIs for equality. In some cases the author writes their own comparison and seems to be unaware of URI semantics and in other cases the author delegates to a Windows provided function that doesn’t quite work for the author’s scenario. In this blog post I’ll describe some of the unmanaged URI comparison functions available to Win32 developers, and a few common mistakes to avoid.
The latest URI RFC 3986 does an excellent job of describing a ladder of URI comparisons. The range on the ladder trades off comparison speed for number of false negatives. False negative in this case means that the URI comparison function says two URIs are not equivalent when they are. However, nowhere on the ladder will a comparison generate a false positive. That is, a URI comparison function should never incorrectly report that two URIs are equivalent.
IUri::IsEqual
The IUri::IsEqual method is the comparison method provided by the IUri associated APIs. IUri::IsEqual is able to perform potentially very fast since its based on state parsed out of the string URI during the creation of the IUri object. This comparison method is semantically equivalent to taking two URIs, performing the canonicalization methods described in the CreateUri documentation, and comparing the result character by character. Knowledge of common schemes such as http and ftp is built-in and so by the URI RFC’s terminology this is a Scheme-Based Normalization equality comparison. IUri and associated APIs are available on systems with IE7 which includes all Vista systems. If this method is available and you don’t need a comparison that takes into account protocol specific information then this is the preferred method of URI equality comparison.
IMoniker::IsEqual
You can use CreateURLMonikerEx to create an IMoniker object that represents your URI and use IMoniker::IsEqual to compare it with another such IMoniker. The comparison used here is a case sensitive string comparison of the display strings of the IMonikers. These display strings are a normalized form of the URIs passed into CreateURLMonikerEx, so the comparison is a Scheme-Based Normalization equality comparison similar to IUri::IsEqual. The difference between the two is that the UrlMoniker implementation of IMoniker::IsEqual may not perform all of the normalizations that IUri::IsEqual does including percent-encoding normalization. CreateURLMonikerEx and IMoniker::IsEqual have been available since Windows 95 so it is an acceptable alternative to IUri::IsEqual if the IUri APIs are not available to you. If you do use CreateURLMonikerEx be sure to pass the correct flags to avoid creating legacy file URIs.
String Comparison
At one extreme of the URI comparison ladder is the simple but trusty string comparison. A function such as wcscmp can say if two URIs are equal. If IUri::IsEqual is unavailable or you use a URI normalization function that is specific to your own URI scheme you can create a URI comparison function around any normalization function. Simply apply your favorite URI normalization function to two string URIs and then use a case sensitive string comparison on the results.
URIs Are Case Sensitive
There’s not much more to say on this topic that the URI RFC hasn’t already, except to warn against using case insensitive string comparisons. Only the scheme, hostname, and percent-encoded octets of URIs are case insensitive so a case insensitive string comparison is not appropriate in general and will generate false positives when comparing URIs. Note that this is a difference from Windows file paths which are case insensitive throughout.
UrlCompare Issues
The function UrlCompare is deceptively named in that it sounds like it compares two URIs for equality. Unfortunately it has a couple of significant issues that result in false positives and as a result you should avoid using it when possible, or at least be aware of and compensate for the cases when it can generate an incorrect result.
Percent-Encoding Makes a Difference
The function takes two input URIs, decodes all percent-encoded octets and compares the results character by character. This is inappropriate because you cannot necessarily decode arbitrary percent-encoded octets in a URI and get an equivalent URI as a result. See the URI RFC’s section when to encode or decode for more information. For example, the following two non-equivalent URIs would be declared equivalent by UrlCompare:
https://example.com%2Fwww.contoso.com/
https://example.com/www.contoso.com/
Even though the first URI has a sub-domain of contoso.com and the second URI has example.com they are declared to be equivalent. A worse consequence of the same issue is that because the percent-encoded sequence %00 is decoded to a NULL terminator anything following a %00 is ignored by the comparison. For example, the following two non-equivalent URIs would be declared equivalent:
https://%00.example.com/
https://%00www.contoso.com/downloads/details.aspx?foo=baz
Trailing Slashes Are Important Too
The second issue concerns the function’s fIgnoreSlash parameter which when set TRUE tells UrlCompare to ignore any trailing ‘/’ characters on either of the input URIs. This is not appropriate for general use with URIs because in general URI comparison trailing slashes cannot be ignored. From Windows file paths that refer to non root directories you can generally remove trailing backslashes without worrying about changing the path’s semantics because a file and a directory in the same path cannot have the same name. Accordingly there’s no ambiguity between “C:Users” and “C:Users”. This is not the case for URIs. Two URIs that are equal except for a trailing slash on the path may resolve to completely different resources. I should point out too that UrlCompare ignores the slash that literally trails at the end of the URI and not slashes at the end of the path component of the URI.
Do not depend on UrlCompare to correctly say whether two URIs are equivalent. Relying on UrlCompare for general URI comparison could result in security issues.
CoInternetCompareUrl Issues
The function CoInternetCompareUrl delegates its comparison to an interface registered for the URI scheme but unfortunately in some cases CoInternetCompareUrl has the same issues as UrlCompare.
CoInternetCompareUrl delegates to the IInternetProtocolInfo::CompareUrl method of the pluggable protocol registered for the URI scheme of CoInternetCompareUrl’s first parameter. This means that the comparison is as good as the pluggable protocol’s implementer made it. A CompareUrl could be tied to the protocol’s caching implementation and report two URIs as being equal if it knows that the same content will be delivered for two different URIs. On the other hand, a CompareUrl could report that two character for character equal URIs are not equal. That’s a hypothetical example but it illustrates that CompareUrl and consequently CoInternetCompareUrl don’t necessarily follow URI comparison rules.
As noted in the documentation for CompareUrl, CompareUrl may return INET_E_DEFAULT_ACTION to let CoInternetCompareUrl take care of the comparison in a generic fashion. Sadly, the method of comparison used in this case is exactly the same as UrlCompare.
Accordingly, for the same reasons why you shouldn’t use UrlCompare, if you must use the CompareUrl methods defined by pluggable protocol handlers you should use them directly rather than relying on CoInternetCompareUrl. But in general, if you don’t care about pluggable protocol handlers avoid CoInternetCompareUrl and IInternetProtocolInfo::CompareUrl.
Conclusion
To summarize, IUri::IsEqual is a good Scheme-Based Normalization URI comparison function, UrlCompare and CoInternetCompareUrl should be avoided for fear of security bugs, and with no better choices a simple case sensitive string comparison will suffice.
If you know of other URI comparison functions or have other related comments or questions please let us know!
Dave Risney
Software Design Engineer
Comments
Anonymous
October 24, 2007
I think it's silly for you to request feedback on this blog, since you get tons of feedback about the issues people really care and want to know about every time you post something new, and for the most part you just flat ignore it.Anonymous
October 24, 2007
Thanks for the update on IE8! Glad to see all those things are being fixed! Oh, and those new features are really cool! ZZZZZzzzzzzzzzzzzzz...... Back to dreamland I guess!Anonymous
October 24, 2007
The comment has been removedAnonymous
October 24, 2007
But how do I use these in IE in my JavaScript application? Thanks!Anonymous
October 24, 2007
Just <a href="http://www.crn.com/it-channel/183701230">keep your promise</a>Anonymous
October 25, 2007
The comment has been removedAnonymous
October 25, 2007
There is also the open-source Google URL Parsing and Canonicalization library: http://code.google.com/p/google-url/ It sounds like this library is more thorough than the IURI approach. It will handle IDN, escaping, unescaping, etc. and tries to be compatible with IE. It is on the conservative side, for example, it doesn't normalize the case of percent-escaped characters that should remain escaped, although this particular behavior may be changed.Anonymous
October 25, 2007
@Robbert Broersma, these are all unmanaged win32 Windows functions and aren't available in JavaScript. I don't know of any URI APIs available via JavaScript running in IE. You will have to use a string comparison or look for some sort of JavaScript library that does this.Anonymous
October 25, 2007
The comment has been removedAnonymous
October 25, 2007
> Is anyone here making use of the :active pseudo-class? I've found out that it works great as an alternative to :focus The main benefit of :focus is that it's activated when you tab-select an element, :active is not an alternative/workaround for :focus.Anonymous
October 26, 2007
This is regarding the IE developer toolbar. Please change the shortcut combination for the ruler to something other than Shift+R, because many time when we need to type capital R, the ruler comes up. I downloaded the final version and still this issue is not fixed.Anonymous
October 26, 2007
The comment has been removedAnonymous
October 26, 2007
@anonymuos: Thanks for the feedback, and sorry for the inconvenience. We've got a bug on this.Anonymous
October 26, 2007
Regarding: http://www.jabcreations.net/ The goggles, they do nothing!Anonymous
October 26, 2007
@EricLaw [MSFT] Glad to see this is being tracked. Is it being tracked in the Web Developer Toolbar DB or in the generic IE DB? Either way, can you post a URL to it so that we can vote on it, and track it too. I'd also like to enter some issue for the various focus stealing bugs in the tool but I don't know where the bug tracking page is. thanksAnonymous
October 28, 2007
hi ie dev team! make sure to include lot's of unusable filters to replace todays W3C standards in IE8 too, so you just prove once again the ego of microsoft!Anonymous
October 28, 2007
This article is really interesting. Thanks for the information.Anonymous
October 28, 2007
When is the Title of this post going to be the title of the Blog? We've been waiting patiently, and impatiently for news on this for a year. I think rc is right. You guys have stopped development AGAIN, and won't start up again until you start to feel your browser market share slide under 50% again. Its a shame. IE has such potential to become a decent browser. So sad.Anonymous
October 29, 2007
Hi Dave, Thanks for the summary of functions and details regarding this issue. Much appreciated.Anonymous
October 29, 2007
The comment has been removedAnonymous
October 29, 2007
@Alan, Schemes are described as case-insensitive in RFC 3986 section 3.1 <http://tools.ietf.org/html/rfc3986#section-3.1>: "Although schemes are case-insensitive, the canonical form is lowercase and documents that specify schemes must do so with lowercase letters." And percent-encoded octets in section 2.1 <http://tools.ietf.org/html/rfc3986#section-2.1>: "The uppercase hexadecimal digits 'A' through 'F' are equivalent to the lowercase digits 'a' through 'f', respectively. If two URIs differ only in the case of hexadecimal digits used in percent-encoded octets, they are equivalent. For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings." If you'd like an accurate URI comparison you should use a case sensitive string comparison over insensitive. In some specific cases it may make sense to use a case insensitive comparison but those would be exceptions to the rules defined by the URI RFC and if you're breaking those rules you should be aware of how that comparison is used, what URIs will be incorrectly determined to be equivalent, and what bugs could come from this. That is, if you're not sure what you should use, in general, a case sensitive string comparison is the safer choice.Anonymous
October 29, 2007
@@howard: IE marketshare hasn't been below 50% in more than 8 years or so, and they're waaaay above that number now: http://en.wikipedia.org/wiki/Image:Layout_engine_usage_share.svgAnonymous
October 29, 2007
Here's a good JavaScript URL parser I've used: http://blog.stevenlevithan.com/archives/parseuri I think Dojo has one too.Anonymous
October 29, 2007
The comment has been removedAnonymous
October 30, 2007
@Jon I'm afraid you can't view your link in IE. Microsoft browsers don't support SVG. BruceAnonymous
October 30, 2007
@Bruce What are you talking about? I can view it fine and I'm in IE6. After bringing it up in Firefox it doesn't look much different, either.Anonymous
October 30, 2007
@huh Wikipedia shows a pixelized PNG version of the actual SVG file when it detects IE. Click on the link (either the text below the image or the image itself) to go to the actual SVG file and IE will ask "What do you want to do with this?..." Opera, Firefox, Safari all show the SVG layout. If you see the image in IE then you have a plug-in from Adobe installed.Anonymous
November 06, 2007
I'm afraid you can't view your link in IE.Anonymous
November 23, 2007
Having a path to boundless authorities concerned with this is incomparable.Anonymous
January 01, 2008
This post is directly related to some work I'm going to be doing so I was happy to stumble across...