Home » 2010 » February » 4 » Browser Security Handbook, part 1
3:32 PM
Browser Security Handbook, part 1

Browser Security Handbook, part 1

Table of Contents

Basic concepts behind web browsers

This section provides a review of core standards and technologies behind current browsers, and their security-relevant properties. No specific attention is given to features implemented explicitly for security purposes; these are discussed in later in the document.

Uniform Resource Locators

All web resources are addressed with the use of uniform resource identifiers. Being able to properly parse the format, and make certain assumptions about the data present therein, is of significance to many server-side security mechanisms.

The abstract syntax for URIs is described in RFC 3986. The document defines a basic hierarchical URI structure, defining a white list of unreserved characters that may appear in URIs as-is as element names, with no particular significance assigned (0-9 A-Z a-z - . _ ~), spelling out reserved characters that have special meanings and can be used in some places only in their desired function (: / ? # [ ] @ ! $ & ' ( ) * + , ; =), and establishing a hexadecimal percent-denoted encoding (%nn) for everything outside these sets (including the stray % character itself).

Some additional mechanisms are laid out in RFC 1738, which defines URI syntax within the scope of HTTP, FTP, NNTP, Gopher, and several other specific protocols. Together, these RFCs define the following syntax for common Internet resources (the compliance with a generic naming strategy is denoted by the // prefix):


Since the presence of a scheme is the key differentiator between relative references permitted in documents for usability reasons, and fully-qualified URLs, and since : itself has other uses later in the URL, the set of characters permitted for scheme name must be narrow and clearly defined (0-9 A-Z a-z + - .) so that all implementations may make the distinction accurately.

On top of the aforementioned documents, a W3C draft RFC 1630 and a non-HTTP RFC 2368 de facto outline some additional concepts, such as the exact HTTP search string syntax (param1=val1[&param2=val2&...]), or the ability to use the + sign as a shorthand notation for spaces (the character itself does not function in this capacity elsewhere in the URL, which is somewhat counterintuitive).

Although a broad range of reserved characters is defined as delimiters in generic URL syntax, only a subset is given a clear role in HTTP addresses at any point; the function of [, ], !, $, ', (, ), *, ;, or , is not explicitly defined anywhere, but the characters are sometimes used to implement esoteric parameter passing conventions in oddball web application frameworks. The RFC itself sometimes implies that characters with no specific function within the scheme should be treated as regular, non-reserved ASCII characters; and elsewhere, suggests they retain a special meaning - in both cases creating ambiguities.

The standards that specify the overall URL syntax are fairly laid back - for example, they permit IP addresses such as to be written in completely unnecessary and ambiguous ways such as 74.0x7d.023.99 (mixing decimal, octal, and hexadecimal notation) or 74.8196963 (24 bits coalesced). To add insult to injury, on top of this, browsers deviate from these standards in random ways, for example accepting URLs with technically illegal characters, and then trying to escape them automatically, or passing them as-is to underlying implementations - such as the DNS resolver, which itself then rejects or passes through such queries in a completely OS-specific manner.

A particularly interesting example of URL parsing inconsistencies is the two following URLs. The first one resolves to a different host in Firefox, and to a different one in most other browsers; the second one behaves uniquely in Internet Explorer instead:


Below is a more detailed review of the key differences that often need to be accounted for:

Test description MSIE6 MSIE7 MSIE8 FF2 FF3 Safari Opera Chrome Android
Characters ignored in front of URL schemes \x01-\x20 \x01-\x20 \x01-\x20 \t \r \n \x20 \t \r \n \x20 \x20 \t \r \n \x0B \x0C \xA0 \x00-\x20 \x20
Non-standard characters permitted in URL scheme names (excluding 0-9 A-Z a-z + - .) \t \r \n \t \r \n \t \r \n \t \r \n \t \r \n none \r \n +UTF8 \0 \t \r \n none
Non-standard characters kept as-is, with no escaping, in URL query strings (excluding 0-9 A-Z a-z - . _ ~ : / ? # [ ] @ ! $ & ' ( ) * + , ; =)* " < > \ ^ ` { | } \x7F " < > \ ^ ` { | } \x7F " < > \ ^ ` { | } \x7F \ ^ { | } \ ^ { | } ^ { | } ^ { | } \x7F " \ ^ ` { | } n/a
Non-standard characters fully ignored in host names \t \r \n \t \r \n \xAD \t \r \n \xAD \t \r \n \xAD \t \r \n \xAD \xAD \x0A-\x0D \xA0 \xAD \t \r \n \xAD none
Types of partial or broken URLs auto-corrected to fully qualified ones //y \\y //y \\y //y \\y //y x:///y x://[y] //y x:///y x://[y] //y \\y x:/y x:///y //y \\y x://[y] //y \\y x:///y //y \\y
Is fragment ID (hash) encoded by applying RFC-mandated URL escaping rules? NO NO NO PARTLY PARTLY YES NO NO YES
Are non-reserved %nn sequences in URL path decoded in address bar? NO YES YES NO YES NO YES YES n/a
Are non-reserved %nn sequences in URL path decoded in location.href? NO YES YES NO NO YES YES YES YES
Are non-reserved %nn sequences in URL path decoded in actual HTTP requests sent? NO YES YES NO NO NO NO YES NO
Characters rejected in URL login or password (excluding / # ; ? : % @) \x00 \ \x00 \ \x00 \ none none \x00-\x20 " < > [ \ ] ^ ` { | } \x7f-\xff \x00 \x00 \x01 \ \x00-\x20 " < > [ \ ] ^ ` { | } \x7f-\xff
URL authentication data splitting behavior with multiple @ characters leftmost leftmost leftmost rightmost rightmost leftmost rightmost rightmost leftmost

* Interestingly, Firefox 3.5 takes a safer but non-RFC-compliant approach of encoding stray ' characters as %27 in URLs, in addition to the usual escaping rules.

NOTE: As an anti-phishing mechanism, additional restrictions on the use of login and password fields in URLs are imposed by many browsers; see the section on HTTP authentication later on.

Please note that when links are embedded within HTML documents, HTML entity decoding takes place before the link is parsed. Because of this, if a tab is ignored in URL schemes, a link such as javascript&#09;:alert(1) may be accepted and executed as JavaScript, just as javascript<TAB>:alert(1) would be. Furthermore, certain characters, such as \x00 in Internet Explorer, or \x08 in Firefox, may be ignored by HTML parsers if used in certain locations, even though they are not treated differently by URL-handling code itself.

Unicode in URLs

Much like several related technologies used for web content transport, URLs do not have any particular character set defined by relevant RFCs; RFC 3986 ambiguously states: "In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification." The same dismissive approach is taken in HTTP header specification.

As a result, in the context of web URLs, any high-bit user input may and should be escaped as %nn sequences, but there is no specific guidance provided on how to transcode user input in system's native code page when talking to other parties that may not share this code page. On a system that uses UTF-8 and receives a URL containing Unicode ą in the path, this corresponds to a sequence of 0xC4 0x85 natively; however, when sent to a server that uses ISO-8859-2, the correct value sent should be 0xB1 (or alternatively, additional information about client-specific encoding should be included to make the conversion possible on server side). In practice, most browsers deal with this by sending UTF-8 data by default on any text entered in the URL bar by hand, and using page encoding on all followed links.

Another limitation of URLs traces back to DNS as such; RFC 1035 permits only characters A-Z a-z 0-9 - in DNS labels, with period (.) used as a delimiter; some resolver implementations further permit underscore (_) to appear in DNS names, violating the standard, but other characters are almost universally banned. With the growth of the web, the need to accommodate non-Latin alphabets in host names was perceived by multiple parties - and the use of %nn encoding was not an option, because % as such was not on the list.

To solve this, RFC 3490 lays out a rather contrived encoding scheme that permitted Unicode data to be stored in DNS labels, and RFC 3492 outlines a specific implementation within DNS labels - commonly referred to as Punycode - that follows the notation of xn--[US-ASCII part]-[encoded Unicode data]. Any Punycode-aware browser faced with non US-ASCII data in a host name is expected to transform it to this notation first, and then perform a traditional DNS lookup for the encoded string.

Putting these two methods together, the following transformation is expected to be made internally by the browser:

http://www.ręczniki.pl/?ręcznik=1 → http://www.xn--rczniki-98a.pl/?r%C4%99cznik=1

Key security-relevant differences in high-bit URL handling are outlined below:

Test description MSIE6 MSIE7 MSIE8 FF2 FF3 Safari Opera Chrome Android
Request URL path encoding when following plain links UTF-8 UTF-8 UTF-8 page encoding UTF-8 UTF-8 UTF-8 UTF-8 UTF-8
Request URL query string encoding when following plain links page encoding, no escaping page encoding, no escaping page encoding, no escaping page encoding page encoding page encoding page encoding page encoding page encoding
Request URL path encoding for XMLHttpRequest calls page encoding page encoding page encoding page encoding page encoding page encoding page encoding page encoding page encoding
Request URL query string encoding for XMLHttpRequest calls page encoding, no escaping page encoding, no escaping page encoding, no escaping page encoding page encoding mangled page encoding mangled mangled
Request URL path encoding for manually entered URLs UTF-8 UTF-8 UTF-8 UTF-8 UTF-8 UTF-8 UTF-8 UTF-8 UTF-8
Request URL query string encoding for manually entered URLs transcoded to 7 bit transcoded to 7-bit transcoded to 7-bit UTF-8 UTF-8 UTF-8 stripped to ? UTF-8 UTF-8
Raw Unicode in host names auto-converted to Punycode? NO YES YES YES YES YES YES YES NO
Is percent-escaped UTF-8 in host names auto-converted to Punycode? (NO) YES YES NO on retry YES YES YES (NO)
URL bar Unicode display method for host names (Punycode) Unicode Unicode Unicode Unicode Unicode Unicode Unicode (Punycode)
URL bar Unicode display method outside host names Unicode Unicode Unicode %nn Unicode Unicode as ? Unicode n/a

NOTE 1: Firefox generally uses UTF-8 to encode most URLs, but will send characters supported by ISO-8859-1 using that codepage.

NOTE 2: As an anti-phishing mechanism, additional restrictions on the use of some or all Unicode characters in certain top-level domains are imposed by many browsers; see domain name restrictions chapter later on.

True URL schemes

URL schemes are used to indicate what general protocol needs to be used to retrieve the data, and how the information needs to be processed for display. Schemes may also be tied to specific default values of some URL fields - such as a TCP port - or specific non-standard URL syntax parsing rules (the latter is usually denoted by the absence of a // string after scheme identifier).

Back in 1994, RFC 1738 laid out several URL schemes in the context of web browsing, some of which are not handled natively by browsers, or have fallen into disuse. As the web matured, multiple new open and proprietary schemes appeared with little or no consistency, and the set of protocols supported by each browser began to diverge. RFC 2718 attempted to specify some ground rules for the creation of new protocols, and RFC 4395 mandated that new schemes be registered with IANA (their list is available here). In practice, however, few parties could be bothered to follow this route.

To broadly survey the capabilities of modern browsers, it makes sense to divide URL schemes into a group natively supported by the software, and another group supported by a plethora of plugins, extensions, or third-party programs. The first list is relatively short:

Scheme name MSIE6 MSIE7 MSIE8 FF2 FF3 Safari Opera Chrome Android
file (RFC 1738) YES YES YES YES (local) YES (local) YES (local) YES YES NO
Gopher (RFC 4266) NO NO NO YES YES NO defunct? NO NO
news (draft RFC) NO NO NO NO NO NO YES NO NO

These protocols may be used to deliver natively rendered content that is interpreted and executed using the security rules implemented within the browsers.

The other list, third-party protocols routed to other applications, depends quite heavily on system configuration. The set of protocols and their handlers is usually maintained in a separate system-wide registry. Browsers may whitelist some of them by default (executing external programs without prompting), or blacklist some (preventing their use altogether); the default action is often to display a mildly confusing prompt. Most common protocols in this family include:

acrobat                           - Acrobat Reader
- some instant messengers, IP phones
/ itpc / itms / ...          - Apple iTunes
FirefoxURL                        - Mozilla Firefox
- Microsoft Windows Help subsystem
- address book functionality
- various mail agents
/ mmsu / msbd / rtsp / ...   - streaming media of all sorts
-offdap                        - Microsoft Office
/ snews / nntp               - various news clients
/ stssync                 - Microsoft Outlook
/ telnet / tn3270          - telnet client
- Windows Explorer
- various IP phone software

New handlers might be registered in OS- and application-specific ways, for example by registering new HKCR\Protocols\Handlers keys in Windows registry, or adding network.protocol-handler.* settings in Firefox.

As a rule, the latter set of protocols is not honored within the renderer when referencing document elements such as images, script or applet sources, and so forth; they do work, however, as <IFRAME> and link targets, and will launch a separate program as needed. These programs may sometimes integrate seamlessly with the renderer, but the rules by which they render content and impose security restrictions are generally unrelated to the browser as such.

Historically, the ability to plant such third-party schemes in links proved to be a significant exploitation vector, as poorly written, vulnerable external programs often register protocol handlers without user's knowledge or consent; as such, it is prudent to reject unexpected schemes where possible, and exercise caution when introducing new ones on client side (including observing safe parameter passing conventions).

Views: 733 | Added by: b1zz4rd | Rating: 0.0/0
Total comments: 0
Name *:
Email *:
Code *: