3:39 PMBrowser Security Handbook, part 1.2
HTML features a special encoding scheme called HTML entities. The purpose of this scheme is to make it possible to safely render certain reserved HTML characters (e.g., < > &) within documents, as well as to carry high bit characters safely over 7-bit media. The scheme nominally permits three types of notation:
Unfortunately, various browsers follow different parsing rules to these HTML entity notations; all rendering engines recognize entities with no proper ; terminator, and all permit entities with excessively long, zero-padded notation, but with various thresholds:
* Entries one byte longer than this limit still get parsed, but incorrectly; for example, A becomes a sequence of three characters, \x06 5 ;. Two characters and more do not get parsed at all - A is displayed literally).
An interesting piece of trivia is that, as per HTML entity encoding requirements, links such as:
<a href="http://example.com/?p1=v1&p2=v2">Click here</a>
In practice, however, the convention is almost never followed by web developers, and browsers compensate for it by treating invalid HTML entities as literal &-containing strings.
As the web matured and the concept of client-side programming gained traction, the need arose for HTML document to be programatically accessible and modifiable on the fly in response to user actions. One technology, Document Object Model (also referred to as "dynamic HTML"), emerged as the prevailing method for accomplishing this task.
Document Object Model also permits third-party documents to be referenced, subject to certain security checks discussed later on; for example, the following code accesses the <INPUT> tag in the second <IFRAME> on a current page:
DOM object hierarchy - as seen by programmers - begins with an implicit "root" object, sometimes named defaultView or global; all the scripts running on a page have defaultView as their default scope. This root object has the following members:
Trivia: traversing document.* is a possible approach to look up specific input or output fields when modifying complex pages from within scripts, but it is not a very practical one; because of this, most scripts resort to document.getElementById() method to uniquely identify elements regardless of their current location on the page. Although ID= parameters for tags within HTML documents are expected to be unique, there is no such guarantee. Currently, all major browsers seem to give the first occurrence a priority.
Strictly speaking, HTML4 standard specifies that any </... sequence may be used to break out of non-HTML blocks such as <SCRIPT> (reference); in practice, this suit is not followed by browsers, and a literal </SCRIPT> string is required instead. This property is not necessarily safe to rely on, however.
Various differences between browser implementations are outlined below:
The fourth scheme is most space-efficient, although generally error-prone; for example, a failure to escape \ as \\ may lead to a string of \" + alert(1);" being converted to \\" + alert(1), and getting interpreted incorrectly; it also partly collides with the remaining \-prefixed escaping schemes, and is not compatible with C syntax.
The parsing of these schemes is uniform across all browsers if fewer than the expected number of input digits is seen: the value is interpreted correctly, and no subsequent text is consumed ("\1test" is accepted and becomes "\001test").
|Total comments: 1|