Home » » Browser Security Handbook, part 1

19.32.02

Browser Security Handbook, part 1

Written and maintained by Michal Zalewski <lcamtuf@google.com>.
Copyright 2008, 2009 Google Inc, rights reserved.
Released under terms and conditions of the CC-3.0-BY license.

← Back to introduction

Basic concepts behind web browsers

Uniform Resource Locators

Unicode in URLs

HTML entity encoding

Javascript character encoding

CSS character encoding

→ Forward to browser security features

Basic concepts behind web browsers

This section provides a review of core standards and technologies behind current browsers, and their security-relevant properties. No specific attention is given to features implemented explicitly for security purposes; these are discussed in later in the document.

Uniform Resource Locators

All web resources are addressed with the use of uniform resource identifiers. Being able to properly parse the format, and make certain assumptions about the data present therein, is of significance to many server-side security mechanisms.

The abstract syntax for URIs is described in RFC 3986. The document defines a basic hierarchical URI structure, defining a white list of unreserved characters that may appear in URIs as-is as element names, with no particular significance assigned (0-9 A-Z a-z - . _ ~), spelling out reserved characters that have special meanings and can be used in some places only in their desired function (: / ? # [ ] @ ! $ & ' ( ) * + , ; =), and establishing a hexadecimal percent-denoted encoding (%nn) for everything outside these sets (including the stray % character itself).

Some additional mechanisms are laid out in RFC 1738, which defines URI syntax within the scope of HTTP, FTP, NNTP, Gopher, and several other specific protocols. Together, these RFCs define the following syntax for common Internet resources (the compliance with a generic naming strategy is denoted by the // prefix):

scheme://[login[:password]@](host_name|host_address)[:port][/hierarchical/path/to/resource[?search_string][#fragment_id]]

Since the presence of a scheme is the key differentiator between relative references permitted in documents for usability reasons, and fully-qualified URLs, and since : itself has other uses later in the URL, the set of characters permitted for scheme name must be narrow and clearly defined (0-9 A-Z a-z + - .) so that all implementations may make the distinction accurately.

On top of the aforementioned documents, a W3C draft RFC 1630 and a non-HTTP RFC 2368 de facto outline some additional concepts, such as the exact HTTP search string syntax (param1=val1[&param2=val2&...]), or the ability to use the + sign as a shorthand notation for spaces (the character itself does not function in this capacity elsewhere in the URL, which is somewhat counterintuitive).

Although a broad range of reserved characters is defined as delimiters in generic URL syntax, only a subset is given a clear role in HTTP addresses at any point; the function of [, ], !, $, ', (, ), *, ;, or , is not explicitly defined anywhere, but the characters are sometimes used to implement esoteric parameter passing conventions in oddball web application frameworks. The RFC itself sometimes implies that characters with no specific function within the scheme should be treated as regular, non-reserved ASCII characters; and elsewhere, suggests they retain a special meaning - in both cases creating ambiguities.

The standards that specify the overall URL syntax are fairly laid back - for example, they permit IP addresses such as 74.125.19.99 to be written in completely unnecessary and ambiguous ways such as 74.0x7d.023.99 (mixing decimal, octal, and hexadecimal notation) or 74.8196963 (24 bits coalesced). To add insult to injury, on top of this, browsers deviate from these standards in random ways, for example accepting URLs with technically illegal characters, and then trying to escape them automatically, or passing them as-is to underlying implementations - such as the DNS resolver, which itself then rejects or passes through such queries in a completely OS-specific manner.

A particularly interesting example of URL parsing inconsistencies is the two following URLs. The first one resolves to a different host in Firefox, and to a different one in most other browsers; the second one behaves uniquely in Internet Explorer instead:

http://example.com\@coredump.cx/
http://example.com;.coredump.cx/

Below is a more detailed review of the key differences that often need to be accounted for:

Test description	MSIE6	MSIE7	MSIE8	FF2	FF3	Safari	Opera	Chrome	Android
Characters ignored in front of URL schemes	\x01-\x20	\x01-\x20	\x01-\x20	\t \r \n \x20	\t \r \n \x20	\x20	\t \r \n \x0B \x0C \xA0	\x00-\x20	\x20
Non-standard characters permitted in URL scheme names (excluding `0-9 A-Z a-z + - .`)	\t \r \n	\t \r \n	\t \r \n	\t \r \n	\t \r \n	none	\r \n +UTF8	\0 \t \r \n	none
Non-standard characters kept as-is, with no escaping, in URL query strings (excluding `0-9 A-Z a-z - . _ ~ : / ? # [ ] @ ! $ & ' ( ) * + , ; =`)^*	" < > \ ^ ` { \| } \x7F	" < > \ ^ ` { \| } \x7F	" < > \ ^ ` { \| } \x7F	\ ^ { \| }	\ ^ { \| }	^ { \| }	^ { \| } \x7F	" \ ^ ` { \| }	n/a
Non-standard characters fully ignored in host names	\t \r \n	\t \r \n \xAD	\t \r \n \xAD	\t \r \n \xAD	\t \r \n \xAD	\xAD	\x0A-\x0D \xA0 \xAD	\t \r \n \xAD	none
Types of partial or broken URLs auto-corrected to fully qualified ones	//y \\y	//y \\y	//y \\y	//y x:///y x://`[y]`	//y x:///y x://`[y]`	//y \\y x:/y x:///y	//y \\y x://`[y]`	//y \\y x:///y	//y \\y
Is fragment ID (hash) encoded by applying RFC-mandated URL escaping rules?	NO	NO	NO	PARTLY	PARTLY	YES	NO	NO	YES
Are non-reserved `%nn` sequences in URL path decoded in address bar?	NO	YES	YES	NO	YES	NO	YES	YES	n/a
Are non-reserved `%nn` sequences in URL path decoded in `location.href`?	NO	YES	YES	NO	NO	YES	YES	YES	YES
Are non-reserved `%nn` sequences in URL path decoded in actual HTTP requests sent?	NO	YES	YES	NO	NO	NO	NO	YES	NO
Characters rejected in URL login or password (excluding `/ # ; ? : % @`)	\x00 \	\x00 \	\x00 \	none	none	\x00-\x20 " < > `[` \ `]` ^ ` { \| } \x7f-\xff	\x00	\x00 \x01 \	\x00-\x20 " < > `[` \ `]` ^ ` { \| } \x7f-\xff
URL authentication data splitting behavior with multiple `@` characters	leftmost	leftmost	leftmost	rightmost	rightmost	leftmost	rightmost	rightmost	leftmost

^* Interestingly, Firefox 3.5 takes a safer but non-RFC-compliant approach of encoding stray ' characters as %27 in URLs, in addition to the usual escaping rules.

NOTE: As an anti-phishing mechanism, additional restrictions on the use of login and password fields in URLs are imposed by many browsers; see the section on HTTP authentication later on.

Please note that when links are embedded within HTML documents, HTML entity decoding takes place before the link is parsed. Because of this, if a tab is ignored in URL schemes, a link such as javascript	:alert(1) may be accepted and executed as JavaScript, just as javascript<TAB>:alert(1) would be. Furthermore, certain characters, such as \x00 in Internet Explorer, or \x08 in Firefox, may be ignored by HTML parsers if used in certain locations, even though they are not treated differently by URL-handling code itself.

Unicode in URLs

Much like several related technologies used for web content transport, URLs do not have any particular character set defined by relevant RFCs; RFC 3986 ambiguously states: "In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification." The same dismissive approach is taken in HTTP header specification.

As a result, in the context of web URLs, any high-bit user input may and should be escaped as %nn sequences, but there is no specific guidance provided on how to transcode user input in system's native code page when talking to other parties that may not share this code page. On a system that uses UTF-8 and receives a URL containing Unicode ą in the path, this corresponds to a sequence of 0xC4 0x85 natively; however, when sent to a server that uses ISO-8859-2, the correct value sent should be 0xB1 (or alternatively, additional information about client-specific encoding should be included to make the conversion possible on server side). In practice, most browsers deal with this by sending UTF-8 data by default on any text entered in the URL bar by hand, and using page encoding on all followed links.

Another limitation of URLs traces back to DNS as such; RFC 1035 permits only characters A-Z a-z 0-9 - in DNS labels, with period (.) used as a delimiter; some resolver implementations further permit underscore (_) to appear in DNS names, violating the standard, but other characters are almost universally banned. With the growth of the web, the need to accommodate non-Latin alphabets in host names was perceived by multiple parties - and the use of %nn encoding was not an option, because % as such was not on the list.

To solve this, RFC 3490 lays out a rather contrived encoding scheme that permitted Unicode data to be stored in DNS labels, and RFC 3492 outlines a specific implementation within DNS labels - commonly referred to as Punycode - that follows the notation of xn--[US-ASCII part]-[encoded Unicode data]. Any Punycode-aware browser faced with non US-ASCII data in a host name is expected to transform it to this notation first, and then perform a traditional DNS lookup for the encoded string.

Putting these two methods together, the following transformation is expected to be made internally by the browser:

http://www.ręczniki.pl/?ręcznik=1 → http://www.xn--rczniki-98a.pl/?r%C4%99cznik=1

Key security-relevant differences in high-bit URL handling are outlined below:

Test description	MSIE6	MSIE7	MSIE8	FF2	FF3	Safari	Opera	Chrome	Android
Request URL path encoding when following plain links	UTF-8	UTF-8	UTF-8	page encoding	UTF-8	UTF-8	UTF-8	UTF-8	UTF-8
Request URL query string encoding when following plain links	page encoding, no escaping	page encoding, no escaping	page encoding, no escaping	page encoding	page encoding	page encoding	page encoding	page encoding	page encoding
Request URL path encoding for `XMLHttpRequest` calls	page encoding	page encoding	page encoding	page encoding	page encoding	page encoding	page encoding	page encoding	page encoding
Request URL query string encoding for `XMLHttpRequest` calls	page encoding, no escaping	page encoding, no escaping	page encoding, no escaping	page encoding	page encoding	mangled	page encoding	mangled	mangled
Request URL path encoding for manually entered URLs	UTF-8	UTF-8	UTF-8	UTF-8	UTF-8	UTF-8	UTF-8	UTF-8	UTF-8
Request URL query string encoding for manually entered URLs	transcoded to 7 bit	transcoded to 7-bit	transcoded to 7-bit	UTF-8	UTF-8	UTF-8	stripped to `?`	UTF-8	UTF-8
Raw Unicode in host names auto-converted to Punycode?	NO	YES	YES	YES	YES	YES	YES	YES	NO
Is percent-escaped UTF-8 in host names auto-converted to Punycode?	(NO)	YES	YES	NO	on retry	YES	YES	YES	(NO)
URL bar Unicode display method for host names	(Punycode)	Unicode	Unicode	Unicode	Unicode	Unicode	Unicode	Unicode	(Punycode)
URL bar Unicode display method outside host names	Unicode	Unicode	Unicode	`%nn`	Unicode	Unicode	as `?`	Unicode	n/a

NOTE 1: Firefox generally uses UTF-8 to encode most URLs, but will send characters supported by ISO-8859-1 using that codepage.

NOTE 2: As an anti-phishing mechanism, additional restrictions on the use of some or all Unicode characters in certain top-level domains are imposed by many browsers; see domain name restrictions chapter later on.

True URL schemes

URL schemes are used to indicate what general protocol needs to be used to retrieve the data, and how the information needs to be processed for display. Schemes may also be tied to specific default values of some URL fields - such as a TCP port - or specific non-standard URL syntax parsing rules (the latter is usually denoted by the absence of a // string after scheme identifier).

Back in 1994, RFC 1738 laid out several URL schemes in the context of web browsing, some of which are not handled natively by browsers, or have fallen into disuse. As the web matured, multiple new open and proprietary schemes appeared with little or no consistency, and the set of protocols supported by each browser began to diverge. RFC 2718 attempted to specify some ground rules for the creation of new protocols, and RFC 4395 mandated that new schemes be registered with IANA (their list is available here). In practice, however, few parties could be bothered to follow this route.

To broadly survey the capabilities of modern browsers, it makes sense to divide URL schemes into a group natively supported by the software, and another group supported by a plethora of plugins, extensions, or third-party programs. The first list is relatively short:

Scheme name	MSIE6	MSIE7	MSIE8	FF2	FF3	Safari	Opera	Chrome	Android
HTTP (RFC 2616)	YES	YES	YES	YES	YES	YES	YES	YES	YES
HTTPS (RFC 2818)	YES	YES	YES	YES	YES	YES	YES	YES	YES
SHTTP (RFC 2660)	as HTTP	as HTTP	as HTTP	NO	NO	NO	NO	NO	NO
FTP (RFC 1738)	YES	YES	YES	YES	YES	YES	YES	YES	NO
file (RFC 1738)	YES	YES	YES	YES (local)	YES (local)	YES (local)	YES	YES	NO
Gopher (RFC 4266)	NO	NO	NO	YES	YES	NO	defunct?	NO	NO
news (draft RFC)	NO	NO	NO	NO	NO	NO	YES	NO	NO

These protocols may be used to deliver natively rendered content that is interpreted and executed using the security rules implemented within the browsers.

The other list, third-party protocols routed to other applications, depends quite heavily on system configuration. The set of protocols and their handlers is usually maintained in a separate system-wide registry. Browsers may whitelist some of them by default (executing external programs without prompting), or blacklist some (preventing their use altogether); the default action is often to display a mildly confusing prompt. Most common protocols in this family include:

acrobat                           - Acrobat Reader
callto                            - some instant messengers, IP phones
daap / itpc / itms / ...          - Apple iTunes
FirefoxURL                        - Mozilla Firefox
hcp                               - Microsoft Windows Help subsystem
ldap                              - address book functionality
mailto                            - various mail agents
mmst / mmsu / msbd / rtsp / ...   - streaming media of all sorts
mso-offdap                        - Microsoft Office
news / snews / nntp               - various news clients
outlook / stssync                 - Microsoft Outlook
rlogin / telnet / tn3270          - telnet client
shell                             - Windows Explorer
sip                               - various IP phone software

New handlers might be registered in OS- and application-specific ways, for example by registering new HKCR\Protocols\Handlers keys in Windows registry, or adding network.protocol-handler.* settings in Firefox.

As a rule, the latter set of protocols is not honored within the renderer when referencing document elements such as images, script or applet sources, and so forth; they do work, however, as <IFRAME> and link targets, and will launch a separate program as needed. These programs may sometimes integrate seamlessly with the renderer, but the rules by which they render content and impose security restrictions are generally unrelated to the browser as such.

Historically, the ability to plant such third-party schemes in links proved to be a significant exploitation vector, as poorly written, vulnerable external programs often register protocol handlers without user's knowledge or consent; as such, it is prudent to reject unexpected schemes where possible, and exercise caution when introducing new ones on client side (including observing safe parameter passing conventions).

Views: 1288 | Added by: b1zz4rd | Rating: 0.0/0

Total comments: 0


Name *:
Email *:

Code *:

Login:
Password:

« February 2010 »
Su	Mo	Tu	We	Th	Fr	Sa
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28

Browser Security Handbook, part 1

Table of Contents

Basic concepts behind web browsers

Uniform Resource Locators

Unicode in URLs

True URL schemes