@@ -7,21 +7,22 @@ compiled into machine-readable code ahead of time. Instead, the source files are
7
7
interpreted when the program is executed. This can be very convenient for developers for rapid
8
8
prototyping, as it skips a lengthy compilation phase. However, it also poses some unique challenges
9
9
to performance, which is one of the primary reasons interpreters can be complex. php-src borrows
10
- many concepts from compilers and other interpreters.
10
+ many concepts from other compilers and interpreters.
11
11
12
12
**********
13
- Concepts
13
+ Pipeline
14
14
**********
15
15
16
- The goal of the interpreter is to read the users source files from disk , and to simulate the users
17
- intent. This process can be split into distinct phases that are easier to understand and implement.
16
+ The goal of the interpreter is to read the users source files, and to simulate the users intent.
17
+ This process can be split into distinct phases that are easier to understand and implement.
18
18
19
19
- Tokenization - splitting whole source files into words, called tokens.
20
20
- Parsing - building a tree structure from tokens, called AST (abstract syntax tree).
21
- - Compilation - turning the tree structure into a list of operations, called opcodes.
21
+ - Compilation - traversing the AST and building a list of operations, called opcodes.
22
22
- Interpretation - reading and executing opcodes.
23
23
24
- php-src as a whole can be seen as a pipeline consisting of these stages.
24
+ php-src as a whole can be seen as a pipeline consisting of these stages, using the input of the
25
+ previous phase and producing some output for the next.
25
26
26
27
.. code :: haskell
27
28
@@ -31,7 +32,7 @@ php-src as a whole can be seen as a pipeline consisting of these stages.
31
32
|> compiler -- opcodes
32
33
|> interpreter
33
34
34
- Let's go into these phases in a bit more detail.
35
+ Let's go into each phase in a bit more detail.
35
36
36
37
**************
37
38
Tokenization
@@ -76,97 +77,73 @@ stream of characters. The definition for PHP lives in ``Zend/zend_language_scann
76
77
*********
77
78
78
79
Parsing is the process of reading the tokens generated from the tokenizer and building a tree
79
- structure from it. To humans, nesting seems obvious when looking at source code, given indentation
80
- through whitespace and the usage of symbols like ``() `` and ``{} ``. The tokens are transformed into
81
- a tree structure to more closely reflect the source code the way humans see it. In PHP, the AST is
82
- represented by generic AST nodes with a ``kind `` field. There are "normal" nodes with a
83
- predetermined number of children, lists with an arbitrary number of children, and
84
- :doc: `../core/data-structures/zval ` nodes that store some underlying primitive value, like a string.
80
+ structure from it. To humans, how source code elements are grouped seems obvious through whitespace
81
+ and the usage of symbols like ``() `` and ``{} ``. However, computers cannot visually glance over the
82
+ code to determine these boundaries quickly. To make it easier and faster to work with, we build a
83
+ tree structure from the tokens to more closely reflect the source code the way humans see it.
85
84
86
85
Here is a simplified example of what an AST from the tokens above might look like.
87
86
88
87
.. code :: text
89
88
90
- zend_ast_list {
91
- kind: ZEND_AST_IF,
92
- children: 1,
93
- child: [
94
- zend_ast {
95
- kind: ZEND_AST_IF_ELEM,
96
- child: [
97
- zend_ast {
98
- kind: ZEND_AST_VAR,
99
- child: [
100
- zend_ast_zval {
101
- kind: ZEND_AST_ZVAL,
102
- zval: "cond",
103
- },
104
- ],
105
- },
106
- zend_ast_list {
107
- kind: ZEND_AST_STMT_LIST,
108
- children: 1,
109
- child: [
110
- zend_ast {
111
- kind: ZEND_AST_ECHO,
112
- child: [
113
- zend_ast_zval {
114
- kind: ZEND_AST_ZVAL,
115
- zval: "Cond is true\n",
116
- },
117
- ],
118
- },
119
- ],
120
- },
121
- ],
89
+ ZEND_AST_IF {
90
+ ZEND_AST_IF_ELEM {
91
+ ZEND_AST_VAR {
92
+ ZEND_AST_ZVAL { "cond" },
122
93
},
123
- ],
94
+ ZEND_AST_STMT_LIST {
95
+ ZEND_AST_ECHO {
96
+ ZEND_AST_ZVAL { "Cond is true\n" },
97
+ },
98
+ },
99
+ },
124
100
}
125
101
126
- The nodes may also store additional flags in the ``attr `` field for various purposes depending on
127
- the node kind. They also store their original position in the source code in the ``lineno `` field.
128
- These fields are omitted in the example for brevity.
102
+ Each AST node has a type and may have children. They also store their original position in the
103
+ source code, and may define some arbitrary flags. These are omitted for brevity.
129
104
130
105
Like with tokenization, we use a tool called ``Bison `` to generate the parser implementation from a
131
106
grammar specification. The grammar lives in the ``Zend/zend_language_parser.y `` file. Check the
132
107
`Bison documentation `_ for details. Luckily, the syntax is quite approachable.
133
108
134
109
.. _bison documentation : https://www.gnu.org/software/bison/manual/
135
110
111
+ Parsing is described in more detail in its `dedicated chapter <todo >`__.
112
+
136
113
*************
137
114
Compilation
138
115
*************
139
116
140
117
Computers don't understand human language, or even programming languages. They only understand
141
118
machine code, which are sequences of simple, mostly atomic instructions for doing one thing. For
142
119
example, they may add two numbers, load some memory from RAM, jump to an instruction under a certain
143
- condition, etc. It turns out that even complex expressions can be reduced to a number of these
144
- simple instructions.
120
+ condition, etc. It turns out that even the most complex expressions can be reduced to a number of
121
+ these simple instructions.
145
122
146
123
PHP is a bit different, in that it does not execute machine code directly. Instead, instructions run
147
124
on a "virtual machine", often abbreviated to VM. This is just a fancy way of saying that there is no
148
- physical machine that understands these instructions, but that this machine is implemented in
149
- software. This is our interpreter. This also means that we are free to make up instructions
150
- ourselves at will. Some of these instructions look very similar to something you'd find in an actual
151
- CPU instruction set (e.g. adding two numbers), while others are on a much higher level (e.g. load
152
- property of object by name).
125
+ physical machine you can buy that understands these instructions, but that this machine is
126
+ implemented in software. This is our interpreter. This also means that we are free to make up
127
+ instructions ourselves at will. Some of these instructions look very similar to something you'd find
128
+ in an actual CPU instruction set (e.g. adding two numbers), while others are much more high- level
129
+ (e.g. load property of object by name).
153
130
154
131
With that little detour out of the way, the job of the compiler is to read the AST and translate it
155
- into our virtual machine instructions, also called opcodes. This code lives in
156
- ``Zend/zend_compile.c ``. The compiler is invoked for each function in your program, and generates a
157
- list of opcodes .
132
+ into our virtual machine instructions, also called opcodes. The code responsible for this
133
+ transformation lives in ``Zend/zend_compile.c ``. It essentially traverses the AST and generates a
134
+ number of instructions, before going to the next node .
158
135
159
- Here's what the opcodes for the AST above might look like:
136
+ Here's what the surprisingly compact opcodes for the AST above might look like:
160
137
161
138
.. code :: text
162
139
163
140
0000 JMPZ CV0($cond) 0002
164
141
0001 ECHO string("Cond is true\n")
165
142
0002 RETURN int(1)
166
143
167
- *************
168
- Interpreter
169
- *************
144
+ ****************
145
+ Interpretation
146
+ ****************
170
147
171
148
Finally, the opcodes are read and executed by the interpreter. PHPs uses `three-address code `_ for
172
149
instructions. This essentially means that each instructions may have a result value, and at most two
@@ -176,9 +153,8 @@ operands. Most modern CPUs also use this format. Both result and operands in PHP
176
153
.. _three-address code : https://en.wikipedia.org/wiki/Three-address_code
177
154
178
155
How exactly each opcode behaves depends on its purpose. You can find a complete list of opcodes in
179
- the generated ``Zend/zend_vm_opcodes.h `` file. The VM lives mostly in the ``Zend/zend_vm_def.h ``
180
- file, which contains custom DSL that is expanded by ``Zend/zend_vm_gen.php `` to generate the
181
- ``Zend/zend_vm_execute.h `` file, containing the actual VM code.
156
+ the generated ``Zend/zend_vm_opcodes.h `` file. The behavior of each instruction is defined in
157
+ ``Zend/zend_vm_def.h ``.
182
158
183
159
Let's step through the opcodes form the example above:
184
160
@@ -193,18 +169,16 @@ Let's step through the opcodes form the example above:
193
169
With these simple rules, we can see that the interpreter will ``echo `` only when ``$cond `` is
194
170
truthy, and skip over the ``echo `` otherwise.
195
171
196
- That's it! This is how PHP works, fundamentally. Of course, PHP consists of many more opcodes . The
197
- VM is quite complex, and will be discussed separately in the `virtual machine <todo >`__ chapter.
172
+ That's it! This is how PHP works, fundamentally. Of course, we skipped over a ton of details . The VM
173
+ is quite complex, and will be discussed separately in the `virtual machine <todo >`__ chapter.
198
174
199
175
*********
200
176
Opcache
201
177
*********
202
178
203
179
As you may imagine, running this whole pipeline every time PHP serves a request is time consuming.
204
- Luckily, it is also not necessary. We can cache the opcodes in memory between requests. When a file
205
- is included, we can look for the file in cache, and verify via timestamp that it has not been
206
- modified since it was compiled. If it has not, we may reuse the opcodes from cache. This
207
- dramatically speeds up the execution of PHP programs. This is precisely what the opcache extension
180
+ Luckily, it is also not necessary. We can cache the opcodes in memory between requests, to skip over
181
+ all of the phases, except for the execution phase. This is precisely what the opcache extension
208
182
does. It lives in the ``ext/opcache `` directory.
209
183
210
184
Opcache also performs some optimizations on the opcodes before caching them. As opcaches are
0 commit comments