浏览器是如何工作的

Web browsers are probably the most widely used software. In this book I will explain how they work behind the scenes. We will see what happens when you type ‘google.com’ in the address bar until you see the Google page on the browser screen.

浏览器可能是使用最广泛的软件了。我将在这本书里解释浏览器后台是如何工作的。我们将会看到从你在浏览器的地址栏里敲下 ‘google.com’ 直到你看到浏览器窗口里显示出Google的网页之间发生了什么。

The browsers we will talk about 我们将要谈到的浏览器

There are five major browsers used today - Internet Explorer, Firefox, Safari, Chrome and Opera.
I will give examples from the open source browsers - Firefox,Chrome and Safari, which is partly open source.
According to the W3C browser statistics, currently(October 2009), the usage share of Firefox, Safari and Chrome together is nearly 60%.

So nowdays open source browsers are a substantial part of the browser business。

当今主流的浏览器一共有5种,分别是 Internet Explorer, FireFox, Safari, Chrome 以及 Opera。

我将从一些开源浏览器种给出示例,FireFox,Chrome,Safari这几种是部分开源的。

根据W3C 浏览器统计,目前(2009年10月),FireFox,Safari以及Chrome一共占据了浏览器将近60%的市场份额。

所以现在开源浏览器是浏览器业务的重要组成部分。

The browser’s main functionality 浏览器的主要功能

The browser main functionality is to present the web resource you choose, by requesting it from the server and displaying it on the browser window. The resource format is usually HTML but also PDF, image and more. The location of the resource is specified by the user using a URI (Uniform resource Identifier). More on that in the network chapter.

浏览器的主要功能是通过向服务器发出请求并将结果显示在浏览器窗口的方式展示用户选择的网络资源。常见网络资源的格式是HTML,当然也有PDF,image等等。这些资源在网络中的位置是用户通过URI(Uniform resource Identifier)指定的。在网络的章节会详细说明。

The way the browser interprets and displays HTML files is specified in the HTML and CSS specifications. These specifications are maintained by the W3C (World Wide Web Consortium) organization, which is the standards organization for the web.

浏览器解释和显示HTML文件的方式是在HTML和CSS规范中指定的。这些规范是由W3C(World Wide Web Consortium)组织维护的,该组织是Web的标准组织。

The current version of HTML is 4 (http://www.w3.org/TR/html401/). Version 5 is in progress. The current CSS version is 2 (http://www.w3.org/TR/CSS2/) and version 3 is in progress.

当前HTML的版本是4(http://www.w3.org/TR/html401/),第5版正在进行中。目前CSS版本是2(http://www.w3.org/TR/CSS2/) ,第3版正在进行中。

For years browsers conformed to only a part of the specifications and developed their own extensions. That caused serious compatibility issues for web authors. Today most of the browsers more or less conform to the specifications.

多年来,各家浏览器只遵循了部分规范,并各自开发这自己的扩展功能。这给Web作者造成了严重的兼容性问题。今天大多数浏览器都遵循了规范。

Browsers’ user interface have a lot in common with each other. Among the common user interface elements are:

浏览器用户界面彼此有很多共同之处,常见的用户元素如下:

  • Address bar for inserting the URI 用于插入URI的地址栏
  • Back and forward buttons 前进后退按钮
  • Bookmarking options 书签选项
  • A refresh and stop buttons for refreshing and stopping the loading of current documents 用于刷新和停止加载当前文档的刷新和停止按钮。
  • Home button that gets you to your home page 主页按钮,可以访问您的主页。

Strangely enough, the browser’s user interface is not specified in any formal specification, it is just good practices shaped over years of experience and by browsers imitating each other. The HTML5 specification doesn’t define UI elements a browser must have, but lists some common elements. Among those are the address bar, status bar and tool bar. There are, of course, features unique to a specific browser like Firefox downloads manager.

奇怪的是,浏览器的用户界面并没有在任何规范中声明,这仅仅是多年实践经验和浏览器之间互相模仿的结果。HTML5规范没有定义浏览器必须有的UI元素,但是列出了一些常见的元素。其中包括地址栏,状态栏以及工具栏。当然还有像FireFox下载管理器这样的特定浏览器所特有的功能。

More on that in the user interface chapter.

更多内容会在用户界面的章节讲解。

The browser’s high level structure 浏览器的高级结构

The browser’s main components are (1.1):

浏览器的主要组件有(1.1):

  1. The user interface - this includes the address bar, back/forward button, bookmarking menu etc. Every part of the browser display except the main window where you see the requested page.

    用户界面,即包括地址栏,前进/后退按钮,书签菜单等。除了您看到所请求的页面的主窗口之外,浏览器的每个部分都会显示。

  2. The browser engine - the interface for querying and manipulating the rendering engine.

    浏览器引擎,即用于查询和操纵渲染引擎的接口

  3. The rendering engine - responsible for displaying the requested content. For example if the requested content is HTML, it is responsible for parsing the HTML and CSS and displaying the parsed content on the screen.

    渲染引擎,负责显示请求的内容。例如如果请求的内容是HTML,渲染引擎就负责解析HTML和CSS并将解析的内容显示在屏幕上。

  4. Networking - used for network calls, like HTTP requests. It has platform independent interface and underneath implementations for each platform.

    网络组件,用于执行网络调用,如HTTP请求。它具有独立于平台的接口和每个平台的底层实现。

  5. UI backend - used for drawing basic widgets like combo boxes and windows. It exposes a generic interface that is not platform specific. Underneath it uses the operating system user interface methods.

    UI后台,用于绘制类似于组合框和窗口这样的基本部件。它暴露出一些与平台无关的通用接口。在底层调用操作系统的用户界面方法。

  6. JavaScript interpreter. Used to parse and execute the JavaScript code.

    JavaScript解释器,用于解析和执行JavaScript代码。

  7. Data storage. This is a persistence layer. The browser needs to save all sorts of data on the hard disk, for examples, cookies. The new HTML specification (HTML5) defines ‘web database’ which is a complete (although light) database in the browser.

    数据存储,这是一个持久层。浏览器需要将所有类型的数据保存到磁盘上,像cookie这种。新的HTML规范(HTML5)定义了 ‘web database’ ,这是一个完善(而轻量)的浏览器数据库。

Figure 1: Browser main components.

Figure 1: Browser main components. 图1:浏览器主要组件

It is important to note that Chrome, unlike most browsers, holds multiple instances of the rendering engine - one for each tab,. Each tab is a separate process.

值得注意的是,Chrome与大多数浏览器不同,它会生成多个渲染引擎的实例,即一个标签页对应一个渲染引擎实例。每个标签也是一个单独的进程。

I will devote a chapter for each of these components.

我将为每一个组件设置一个章节。

Communication between the components 组件之间的通信

Both Firefox and Chrome developed a special communication infrastructures.

FireFox和Chrome都各自开发了特殊的通信基础框架。

They will be discussed in a special chapter.

我们将在特定章节中讨论。

The rendering engine 渲染引擎

The responsibility of the rendering engine is well… Rendering, that is display of the requested contents on the browser screen.

渲染引擎的职责就是。。。渲染,就是说将用户请求的内容显示到浏览器屏幕上。

By default the rendering engine can display HTML and XML documents and images. It can display other types through a plug-in (a browser extension). An example is displaying PDF using a PDF viewer plug-in. We will talk about plug-ins and extensions in a special chapter. In this chapter we will focus on the main use case - displaying HTML and images that are formatted using CSS.

渲染引擎默认可以显示HTML和XML文档以及图片。它可以通过插件(浏览器插件)显示其他类型的资源。例如使用一个PDF阅读器插件在浏览器中显示PDF文档。我们将会在特定的章节谈到插件和扩展。这一章,我们重点介绍主要用例 -显示使用CSS格式化的HTML和图像。

Rendering engines 渲染引擎

Our reference browsers - Firefox, Chrome and Safari are built upon two rendering engines. Firefox uses Gecko - a “home made” Mozilla rendering engine. Both Safari and Chrome use Webkit.

我们参考的浏览器-FireFox,Chrome和Safari都是基于两个渲染引擎。FireFox使用的是Gecko-由Mozilla公司自研的渲染引擎。Safari和Chrome都是用的Webkit。

Webkit is an open source rendering engine which started as an engine for the Linux platform and was modified by Apple to support Mac and Windows. See http://webkit.org/ for more details.

Webkit是一个始于Linux平台的开源的渲染引擎,被Apple公司修改为支持Mac和Windows操作系统。

The main flow 主要工作流程

The rendering engine will start getting the contents of the requested document from the networking layer. This will usually be done in 8K chunks.

渲染引擎将会从网络层接收用户请求到的文档的内容。通畅是按照8K大小的块进行的。

After that this is the basic flow of the rendering engine: 之后渲染引擎的主要工作流程如下:

Figure 2:Rendering engine basic flow

Figure 2:Rendering engine basic flow. 图2: 渲染引擎主要工作流

The rendering engine will start parsing the HTML document and turn the tags to DOM nodes in a tree called the “content tree”. It will parse the style data, both in external CSS files and in style elements. The styling information together with visual instructions in the HTML will be used to create another tree - the render tree.

渲染引擎将会解析HTML文档,并将HTML中的标签转化成一个被叫做“content tree(内容树)”的树上的DOM节点。它会解析样式数据,不管是来自外部CSS文件还是内部样式元素上的。所有的样式信息连同HTML中的可视化指令将被一起用于创建另一棵树-渲染树。

The render tree contains rectangles with visual attributes like color and dimensions. The rectangles are in the right order to be displayed on the screen.

渲染树包含了具有视觉属性(如颜色和尺寸)的矩形。这些矩形以正确的顺序排列,可以显示在屏幕上。

After the construction of the render tree it goes through a “layout“ process. This means giving each node the exact coordinates where it should appear on the screen. The next stage is painting - the render tree will be traversed and each node will be painted using the UI backend layer.

在创建完渲染树以后渲染引擎要经历一个布局的处理。这就意味着要给每个节点提供它们应该出现在屏幕上的确切座标。下一阶段是绘制 - 渲染树将会被遍历,并使用UI后端 层绘制每一个节点。

It’s important to understand that this is a gradual process. For better user experience, the rendering engine will try to display contents on the screen as soon as possible. It will not wait until all HTML is parsed before starting to build and layout the render tree. Parts of the content will be parsed and displayed, while the process continues with the rest of the contents that keeps coming from the network.

理解这是一个渐进的过程很重要。为了更佳的用户体验,渲染引擎将会尝试尽快在屏幕上显示内容。它不会等到所有HTML被解析就会开始构建和布局渲染树。部分内容会被解析和显示,同时继续处理来自网络的剩余内容。

Main flow examples

Figure 3: Webkit main flow

Figure 3: Webkit main flow 图3:Webkit 主要工作流

Figure 4: Mozilla's Gecko rendering engine main flow

Figure 4: Mozilla’s Gecko rendering engine main flow(3.6) 图4:Mozilla的Gecko渲染引擎的主要工作流

From figures 3 and 4 you can see that although Webkit and Gecko use slightly different terminology, the flow is basically the same.

从 图3 和 图4 中你可以看到,尽管 WebKit 和 Gecko 使用的术语略有不同,但是流程大体相同。

Gecko calls the tree of visually formatted elements - Frame tree. Each element is a frame. Webkit uses the term “Render Tree” and it consists of “Render Objects”. Webkit uses the term “layout” for the placing of elements, while Gecko calls it “Reflow”. “Attachment” is Webkit’s term for connecting DOM nodes and visual information to create the render tree. A minor non semantic difference is that Gecko has an extra layer between the HTML and the DOM tree. It is called the “content sink” and is a factory for making DOM elements. We will talk about each part of the flow:

Gecko 将这个可视化格式的元素树称之为 - Frame tree(帧树)。每个元素都是一个frame(帧)。Webkit 用 “Render Tree” (渲染树)来称呼它,而且它包含的是 “Render Objects”(渲染对象)。Webkit 用 “layout” (布局)来放置元素,但是 Gecko 称之为 “Reflow”(渲染)。”Attachment” 是 Webkit 用于连接 DOM 节点与可视信息以创建渲染树的术语。一个小的非语义化差异是 Gecko 在 HTML 和 DOM 之间还有一个额外的层。被称作 “content sink”(内容接收器),是制造 DOM 元素的工厂。流程的各个部分我们都会谈到:

Parsing - general

Since parsing is a very significant process within the rendering engine, we will go into it a little more deeply. Let’s begin with a little introduction about parsing.Parsing a document means translating it to some structure that makes sense - something the code can understand and use. The result of parsing is usually a tree of nodes that represent the structure of the document. It is called a parse tree or a syntax tree.Example - parsing the expression “2 + 3 - 1” could return this tree:

img
Figure 5: mathematical expression tree node

Grammars

Parsing is based on the syntax rules the document obeys - the language or format it was written in. Every format you can parse must have deterministic grammar consisting of vocabulary and syntax rules. It is called a context free grammar. Human languages are not such languages and therefore cannot be parsed with conventional parsing techniques.

Parser - Lexer combination

Parsing can be separated into two sub processes - lexical analysis and syntax analysis.

Lexical analysis is the process of breaking the input into tokens. Tokens are the language vocabulary - the collection of valid building blocks. In human language it will consist of all the words that appear in the dictionary for that language.

Syntax analysis is the applying of the language syntax rules.

Parsers usually divide the work between two components - the lexer(sometimes called tokenizer) that is responsible for breaking the input into valid tokens, and the parser that is responsible for constructing the parse tree by analyzing the document structure according to the language syntax rules. The lexer knows how to strip irrelevant characters like white spaces and line breaks.

img

Figure 6: from source document to parse trees

The parsing process is iterative. The parser will usually ask the lexer for a new token and try to match the token with one of the syntax rules. If a rule is matched, a node corresponding to the token will be added to the parse tree and the parser will ask for another token.
If no rule matches, the parser will store the token internally, and keep asking for tokens until a rule matching all the internally stored tokens is found. If no rule is found then the parser will raise an exception. This means the document was not valid and contained syntax errors.

Translation

Many times the parse tree is not the final product. Parsing is often used in translation - transforming the input document to another format. An example is compilation. The compiler that compiles a source code into machine code first parses it into a parse tree and then translates the tree into a machine code document.

img
Figure 7: compilation flow

Parsing example

In figure 5 we built a parse tree from a mathematical expression. Let’s try to define a simple mathematical language and see the parse process.

Vocabulary: Our language can include integers, plus signs and minus signs.

Syntax:

  1. The language syntax building blocks are expressions, terms and operations.
  2. Our language can include any number of expressions.
  3. A expression is defined as a “term” followed by an “operation” followed by another term
  4. An operation is a plus token or a minus token
  5. A term is an integer token or an expression

Let’s analyze the input “2 + 3 - 1”.
The first substring that matches a rule is “2”, according to rule #5 it is a term. The second match is “2 + 3” this matches the second rule - a term followed by an operation followed by another term. The next match will only be hit at the end of the input. “2 + 3 - 1” is an expression because we already know that ?2+3? is a term so we have a term followed by an operation followed by another term. “2 + + “will not match any rule and therefore is an invalid input.

Formal definitions for vocabulary and syntax

Vocabulary is usually expressed by regular expressions.

For example our language will be defined as:

1
2
3
INTEGER :0|[1-9][0-9]*
PLUS : +
MINUS: -

As you see, integers are defined by a regular expression.

Syntax is usually defined in a format called BNF. Our language will be defined as:

1
2
3
expression :=  term  operation  term
operation := PLUS | MINUS
term := INTEGER | expression

We said that a language can be parsed by regular parsers if its grammar is a context frees grammar. An intuitive definition of a context free grammar is a grammar that can be entirely expressed in BNF. For a formal definition see http://en.wikipedia.org/wiki/Context-free_grammar

Types of parsers

There are two basic types of parsers - top down parsers and bottom up parsers. An intuitive explanation is that top down parsers see the high level structure of the syntax and try to match one of them. Bottom up parsers start with the input and gradually transform it into the syntax rules, starting from the low level rules until high level rules are met.

Let’s see how the two types of parsers will parse our example:

Top down parser will start from the higher level rule - it will identify “2 + 3” as an expression. It will then identify “2 + 3 - 1” as an expression (the process of identifying the expression evolves matching the other rules, but the start point is the highest level rule).

The bottom up parser will scan the input until a rule is matched it will then replace the matching input with the rule. This will go on until the end of the input. The partly matched expression is placed on the parsers stack.

StackInput
2 + 3 - 1
term+ 3 - 1
term operation3 - 1
expression- 1
expression operation1
expression

This type of bottom up parser is called a shift reduce parser, because the input is shifted to the right (imagine a pointer pointing first at the input start and moving to the right) and is gradually reduced to syntax rules.

Generating parsers automatically

There are tools that can generate a parser for you. They are called parser generators. You feed them with the grammar of your language - its vocabulary and syntax rules and they generate a working parser. Creating a parser requires a deep understanding of parsing and its not easy to create an optimized parser by hand, so parser generators can be very useful.

Webkit uses two well known parser generators - Flex for creating a lexer and Bison for creating a parser (you might run into them with the names Lex and Yacc). Flex input is a file containing regular expression definitions of the tokens. Bison’s input is the language syntax rules in BNF format.

HTML Parser

The job of the HTML parser is to parse the HTML markup into a parse tree.

The HTML grammar definition

The vocabulary and syntax of HTML are defined in specifications created by the w3c organization. The current version is HTML4 and work on HTML5 is in progress.

Not a context free grammar

As we have seen in the parsing introduction, grammar syntax can be defined formally using formats like BNF.
Unfortunately all the conventional parser topics do not apply to HTML (I didn’t bring them up just for fun - they will be used in parsing CSS and JavaScript). HTML cannot easily be defined by a context free grammar that parsers need.
There is a formal format for defining HTML - DTD (Document Type Definition) - but it is not a context free grammar.
This appears strange at first site - HTML is rather close to XML .There are lots of available XML parsers. There is an XML variation of HTML - XHTML - so what’s the big difference?
The difference is that HTML approach is more “forgiving”, it lets you omit certain tags which are added implicitly, sometimes omit the start or end of tags etc. On the whole it’s a “soft” syntax, as opposed to XML’s stiff and demanding syntax.
Apparently this seemingly small difference makes a world of a difference. On one hand this is the main reason why HTML is so popular - it forgives your mistakes and makes life easy for the web author. On the other hand, it makes it difficult to write a format grammar. So to summarize - HTML cannot be parsed easily, not by conventional parsers since its grammar is not a context free grammar, and not by XML parsers.

HTML DTD

HTML definition is in a DTD format. This format is used to define languages of the SGML family. The format contains definitions for all allowed elements, their attributes and hierarchy. As we saw earlier, the HTML DTD doesn’t form a context free grammar.

There are a few variations of the DTD. The strict mode conforms solely to the specifications but other modes contain support for markup used by browsers in the past. The purpose is backwards compatibility with older content. The current strict DTD is here:http://www.w3.org/TR/html4/strict.dtd

DOM

The output tree - the parse tree is a tree of DOM element and attribute nodes. DOM is short for Document Object Model. It is the object presentation of the HTML document and the interface of HTML elements to the outside world like JavaScript.
The root of the tree is the “Document“ object.

The DOM has an almost one to one relation to the markup. Example, this markup:

1
2
3
4
5
6
7
8
<html>
<body>
<p>
Hello World
</p>
<div> <img src="example.png"/></div>
</body>
</html>

Would be translated to the following DOM tree:

img
Figure 8: DOM tree of the example markup

Like HTML, DOM is specified by the w3c organization. See http://www.w3.org/DOM/DOMTR. It is a generic specification for manipulating documents. A specific module describes HTML specific elements. The HTML definitions can be found here:http://www.w3.org/TR/2003/REC-DOM-Level-2-HTML-20030109/idl-definitions.html.

When I say the tree contains DOM nodes, I mean the tree is constructed of elements that implement one of the DOM interfaces. Browsers use concrete implementations that have other attributes used by the browser internally.

The parsing algorithm

As we saw in the previous sections, HTML cannot be parsed using the regular top down or bottom up parsers.

The reasons are:

  1. The forgiving nature of the language.
  2. The fact that browsers have traditional error tolerance to support well known cases of invalid HTML.
  3. The parsing process in reentrant. Usually the source doesn’t change during parsing, but in HTML, script tags containing “document.write” can add extra tokens, so the parsing process actually modifies the input.

Unable to use the regular parsing techniques, browsers create custom parsers for parsing HTML.

The parsing algorithm is described in details by the HTML5 specification. The algorithm consists of two stages - tokenization and tree construction.

Tokenization is the lexical analysis, parsing the input into tokens. Among HTML tokens are start tags, end tags, attribute names and attribute values.

The tokenizer recognizes the token, gives it to the tree constructor and consumes the next character for recognizing the next token and so on until the end of the input.

img
Figure 6: HTML parsing flow (taken from HTML5 spec)

The tokenization algorithm

The algorithm’s output is an HTML token. The algorithm is expressed as a state machine. Each state consumes one or more characters of the input stream and updates the next state according to those characters. The decision is influenced by the current tokenization state and by the tree construction state. This means the same consumed character will yield different results for the correct next state, depending on the current state. The algorithm is too complex to bring fully, so let’s see a simple example that will help us understand the principal.

Basic example - tokenizing the following HTML:

1
2
3
4
5
<html>
<body>
Hello world
</body>
</html>

The initial state is the “Data state”. When the “<” character is encountered, the state is changed to

“Tag open state”

. Consuming an “a-z” character causes creation of a “Start tag token”, the state is change to

“Tag name state”

. We stay in this state until the “>” character is consumed. Each character is appended to the new token name. In our case the created token is an “html” token.

When the “>” tag is reached, the current token is emitted and the state changes back to the

“Data state”

. The “

“ tag will be treated by the same steps. So far the “html” and “body” tags were emitted. We are now back at the

“Data state”

. Consuming the “H” character of “Hello world” will cause creation and emitting of a character token, this goes on until the “<” of ““ is reached. We will emit a character token for each character of “Hello world”.

We are now back at the

“Tag open state”

. Consuming the next input “/“ will cause creation of an “end tag token” and a move to the

“Tag name state”

. Again we stay in this state until we reach “>”.Then the new tag token will be emitted and we go back to the

“Data state”

. The ““ input will be treated like the previous case.

img
Figure 9: Tokenizing the example input

Tree construction algorithm

When the parser is created the Document object is created. During the tree construction stage the DOM tree with the Document in its root will be modified and elements will be added to it. Each node emitted by the tokenizer will be processed by the tree constructor. For each token the specification defines which DOM element is relevant to it and will be created for this token. Except of adding the element to the DOM tree it is also added to a stack of open elements. This stack is used to correct nesting mismatches and unclosed tags. The algorithm is also described as a state machine. The states are called “insertion modes”.

Let’s see the tree construction process for the example input:

1
2
3
4
5
<html>
<body>
Hello world
</body>
</html>

The input to the tree construction stage is a sequence of tokens from the tokenization stage The first mode is the “initial mode”. Receiving the html token will cause a move to the “before html” mode and a reprocessing of the token in that mode. This will cause a creation of the HTMLHtmlElement element and it will be appended to the root Document object.
The state will be changed to “before head”. We receive the “body” token. An HTMLHeadElement will be created implicitly although we don’t have a “head” token and it will be added to the tree.
We now move to the “in head” mode and then to “after head”. The body token is reprocessed, an HTMLBodyElement is created and inserted and the mode is transferred to “in body”.
The character tokens of the “Hello world” string are now received. The first one will cause creation and insertion of a “Text” node and the other characters will be appended to that node.
The receiving of the body end token will cause a transfer to “after body” mode. We will now receive the html end tag which will move us to “after after body” mode. Receiving the end of file token will end the parsing.

img
Figure 10: tree construction of example html

Actions when the parsing is finished

At this stage the browser will mark the document as interactive and start parsing scripts that are in “deferred” mode - those who should be executed after the document is parsed. The document state will be then set to “complete” and a “load” event will be fired.

You can see the full algorithms for tokenization and tree construction in HTML5 specification - http://www.w3.org/TR/html5/syntax.html#html-parser

Browsers error tolerance

You never get an “Invalid Syntax” error on an HTML page. Browsers fix an invalid content and go on.
Take this HTML for example:

1
2
3
4
5
6
7
8
9
<html>
<mytag>
</mytag>
<div>
<p>
</div>
Really lousy HTML
</p>
</html>

I must have violated about a million rules (“mytag” is not a standard tag, wrong nesting of the “p” and “div” elements and more) but the browser still shows it correctly and doesn’t complain. So a lot of the parser code is fixing the HTML author mistakes.

The error handling is quite consistent in browsers but amazingly enough it’s not part of HTML current specification. Like bookmarking and back/forward buttons it’s just something that developed in browsers over the years. There are known invalid HTML constructs that repeat themselves in many sites and the browsers try to fix them in a conformant way with other browsers.

The HTML5 specification does define some of these requirements. Webkit summarizes this nicely in the comment at the beginning of the HTML parser class

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
The parser parses tokenized input into the document, building up the document tree. If the document is well-formed, parsing it is straightforward.

Unfortunately, we have to handle many HTML documents that are not well-formed, so the parser has to be tolerant about errors.

We have to take care of at least the following error conditions:

1. The element being added is explicitly forbidden inside some outer tag.
In this case we should close all tags up to the one, which forbids the element, and add it afterwards.

2. We are not allowed to add the element directly.
It could be that the person writing the document forgot some tag in between (or that the tag in between is optional).
This could be the case with the following tags: HTML HEAD BODY TBODY TR TD LI (did I forget any?).

3. We want to add a block element inside to an inline element. Close all inline elements up to the next higher block element.

4. If this doesn't help, close elements until we are allowed to add the element or ignore the tag.

Let’s see some Webkit error tolerance examples:

1
</br> instead of <br>

Some sites use
instead of
. In order to be compatible with IE and Firefox Webkit treats this like
.
The code:

1
2
3
4
if (t->isCloseTag(brTag) && m_document->inCompatMode()) {
reportError(MalformedBRError);
t->beginTag = true;
}

Note - the error handling is internal - it won’t be presented to the user.

A stray table

A stray table is a table inside another table contents but not inside a table cell.
Like this example:

1
2
3
4
5
6
<table>
<table>
<tr><td>inner table</td></tr>
</table>
<tr><td>outer table</td></tr>
</table>

Webkit will change the hierarchy to two sibling tables:

1
2
3
4
5
6
<table>
<tr><td>outer table</td></tr>
</table>
<table>
<tr><td>inner table</td></tr>
</table>

The code:

1
2
if (m_inStrayTableContent && localName == tableTag)
popBlock(tableTag);

Webkit uses a stack for the current element contents - it will pop the inner table out of the outer table stack. The tables will now be siblings.

Nested form elements

In case the user puts a form inside another form, the second form is ignored.
The code:

1
2
3
if (!m_currentFormElement) {
m_currentFormElement = new HTMLFormElement(formTag, m_document);
}
A too deep tag hierarchy

The comment speaks for itself.

1
2
www.liceo.edu.mx is an example of a site that achieves a level of nesting of about 1500 tags, all from a bunch of <b>s.
We will only allow at most 20 nested tags of the same type before just ignoring them all together.
1
2
3
4
5
6
7
8
9
bool HTMLParser::allowNestedRedundantTag(const AtomicString& tagName)
{

unsigned i = 0;
for (HTMLStackElem* curr = m_blockStack;
i < cMaxRedundantTagDepth && curr && curr->tagName == tagName;
curr = curr->next, i++) { }
return i != cMaxRedundantTagDepth;
}
Misplaced html or body end tags

Again - the comment speaks for itself.

1
2
3
Support for really broken html.
We never close the body tag, since some stupid web pages close it before the actual end of the doc.
Let's rely on the end() call to close things.
1
2
if (t->tagName == htmlTag || t->tagName == bodyTag )
return;

So web authors beware - unless you want to appear as an example in a Webkit error tolerance code - write well formed HTML.

CSS parsing

Remember the parsing concepts in the introduction? Well, unlike HTML, CSS is a context free grammar and can be parsed using the types of parsers described in the introduction. In fact the CSS specification defines CSS lexical and syntax grammar (http://www.w3.org/TR/CSS2/grammar.html).

Let’s see some examples:
The lexical grammar (vocabulary) is defined by regular expressions for each token:

1
2
3
4
5
6
7
comment		\/\*[^*]*\*+([^/*][^*]*\*+)*\/
num [0-9]+|[0-9]*"."[0-9]+
nonascii [\200-\377]
nmstart [_a-z]|{nonascii}|{escape}
nmchar [_a-z0-9-]|{nonascii}|{escape}
name {nmchar}+
ident {nmstart}{nmchar}*

“ident” is short for identifier, like a class name. “name” is an element id (that is referred by “#” )

The syntax grammar is described in BNF.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
ruleset
: selector [ ',' S* selector ]*
'{' S* declaration [ ';' S* declaration ]* '}' S*
;
selector
: simple_selector [ combinator selector | S+ [ combinator selector ] ]
;
simple_selector
: element_name [ HASH | class | attrib | pseudo ]*
| [ HASH | class | attrib | pseudo ]+
;
class
: '.' IDENT
;
element_name
: IDENT | '*'
;
attrib
: '[' S* IDENT S* [ [ '=' | INCLUDES | DASHMATCH ] S*
[ IDENT | STRING ] S* ] ']'
;
pseudo
: ':' [ IDENT | FUNCTION S* [IDENT S*] ')' ]
;

Explanation: A ruleset is this structure:

1
2
3
4
div.error , a.error {
color:red;
font-weight:bold;
}

div.error and a.error are selectors. The part inside the curly braces contains the rules that are applied by this ruleset. This structure is defined formally in this definition:

1
2
3
4
ruleset
: selector [ ',' S* selector ]*
'{' S* declaration [ ';' S* declaration ]* '}' S*
;

This means a ruleset is a selector or optionally number of selectors separated by a coma and spaces (S stands for white space). A ruleset contains curly braces and inside them a declaration or optionally a number of declarations separated by a semicolon. “declaration” and “selector” will be defined in the following BNF definitions.

Webkit CSS parser

Webkit uses Flex and Bison parser generators to create parsers automatically from the CSS grammar files. As you recall from the parser introduction, Bison creates a bottom up shift reduce parser. Firefox uses a top down parser written manually. In both cases each CSS file is parsed into a StyleSheet object, each object contains CSS rules. The CSS rule objects contain selector and declaration objects and other object corresponding to CSS grammar.

img
Figure 7: parsing CSS

Parsing scripts

This will be dealt with in the chapter about JavaScript

The order of processing scripts and style sheets

Scripts

The model of the web is synchronous. Authors expect scripts to be parsed and executed immediately when the parser reaches a

本文标题:浏览器是如何工作的

文章作者:kinboy

发布时间:2018年07月17日 - 13:24:33

最后更新:2019年07月15日 - 18:05:10

原始链接:http://kinboyw.github.io/2018/07/17/How-Does-Browser-Work/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

------ Passage Ending ------