Skip to content

Commit e9ed474

Browse files
committed
Wording
1 parent 0115548 commit e9ed474

File tree

3 files changed

+57
-87
lines changed

3 files changed

+57
-87
lines changed

docs/source/core/data-structures/zval.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
PHP is a dynamic language. As such, a variable can typically contain a value of any type, and the
66
type of the variable may even change during the execution of the program. Under the hood, this is
77
implemented through the ``zval`` struct. It is one of the most important data structures in php-src.
8-
It is essentially a tagged union, meaning it consists of an integer tag, representing the type of
8+
It is essentially a "tagged union", meaning it consists of an integer tag, representing the type of
99
the variable, and a union for the value itself. Let's look at the value first.
1010

1111
************
@@ -118,7 +118,7 @@ access the same data. The ``ZEND_ENDIAN_LOHI_3`` macro is used to guarantee orde
118118
big- and little-endian architectures.
119119

120120
If you're familiar with C, you'll know that the compiler likes to add padding to structures with
121-
odd sizes. It does that because the CPU can work with some offsets more efficiently that others.
121+
"odd" sizes. It does that because the CPU can work with some offsets more efficiently that others.
122122
Ignoring the ``zval.u2`` field for a second, our struct would be 12 bytes in total, 8 coming from
123123
``zval.value`` and 4 from ``zval.u1``. A compiler on a 64-bit architecture will generally bump this
124124
to 16 bytes by adding 4 bytes of useless padding. If this padding is added anyway, we might as well

docs/source/index.rst

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,10 @@ as various extensions that provide common functionality. This documentation is i
3030
understand how the interpreter works, how you can build and test changes, and how you can create
3131
extensions yourself.
3232

33+
This documentation is not intended to be comprehensive, but is meant to explain core concepts that
34+
are not easy to grasp by reading code alone. It describes best practices, and will frequently omit
35+
APIs that are discouraged for general use.
36+
3337
******************
3438
How to get help?
3539
******************
@@ -46,17 +50,9 @@ touch.
4650
Prerequisites
4751
***************
4852

49-
The php-src interpreter is written in C, and so are most of the bundled extensions. Extensions can
50-
also be written in C++. ext-intl is currently the only bundled extension written in C++. As such, it
51-
is advisable that you have *some* knowledge of C.
52-
53-
It is also advisable to get familiar with the semantics of PHP itself, as this will help you
54-
determine correct behavior for bugs, and desireable behavior for new language features.
53+
The php-src interpreter is written in C, and so are most of the bundled extensions. While extensions
54+
may also be written in C++, ext-intl is currently the only bundled extension to do so. It is
55+
advisable that you have *some* knowledge of C before jumping into php-src.
5556

56-
*********
57-
Content
58-
*********
59-
60-
This documentation is not intended to be comprehensive, but is meant to explain core concepts that
61-
are not easy to grasp by reading code alone. It describes best practices, and will frequently omit
62-
APIs that are discouraged for general use.
57+
It is also advisable to get familiar with the semantics of PHP itself, so that you may better
58+
differentiate between bugs and expected behavior, and model new language features.

docs/source/introduction/high-level-overview.rst

Lines changed: 46 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,22 @@ compiled into machine-readable code ahead of time. Instead, the source files are
77
interpreted when the program is executed. This can be very convenient for developers for rapid
88
prototyping, as it skips a lengthy compilation phase. However, it also poses some unique challenges
99
to performance, which is one of the primary reasons interpreters can be complex. php-src borrows
10-
many concepts from compilers and other interpreters.
10+
many concepts from other compilers and interpreters.
1111

1212
**********
13-
Concepts
13+
Pipeline
1414
**********
1515

16-
The goal of the interpreter is to read the users source files from disk, and to simulate the users
17-
intent. This process can be split into distinct phases that are easier to understand and implement.
16+
The goal of the interpreter is to read the users source files, and to simulate the users intent.
17+
This process can be split into distinct phases that are easier to understand and implement.
1818

1919
- Tokenization - splitting whole source files into words, called tokens.
2020
- Parsing - building a tree structure from tokens, called AST (abstract syntax tree).
21-
- Compilation - turning the tree structure into a list of operations, called opcodes.
21+
- Compilation - traversing the AST and building a list of operations, called opcodes.
2222
- Interpretation - reading and executing opcodes.
2323

24-
php-src as a whole can be seen as a pipeline consisting of these stages.
24+
php-src as a whole can be seen as a pipeline consisting of these stages, using the input of the
25+
previous phase and producing some output for the next.
2526

2627
.. code:: haskell
2728
@@ -31,7 +32,7 @@ php-src as a whole can be seen as a pipeline consisting of these stages.
3132
|> compiler -- opcodes
3233
|> interpreter
3334
34-
Let's go into these phases in a bit more detail.
35+
Let's go into each phase in a bit more detail.
3536

3637
**************
3738
Tokenization
@@ -76,97 +77,73 @@ stream of characters. The definition for PHP lives in ``Zend/zend_language_scann
7677
*********
7778

7879
Parsing is the process of reading the tokens generated from the tokenizer and building a tree
79-
structure from it. To humans, nesting seems obvious when looking at source code, given indentation
80-
through whitespace and the usage of symbols like ``()`` and ``{}``. The tokens are transformed into
81-
a tree structure to more closely reflect the source code the way humans see it. In PHP, the AST is
82-
represented by generic AST nodes with a ``kind`` field. There are "normal" nodes with a
83-
predetermined number of children, lists with an arbitrary number of children, and
84-
:doc:`../core/data-structures/zval` nodes that store some underlying primitive value, like a string.
80+
structure from it. To humans, how source code elements are grouped seems obvious through whitespace
81+
and the usage of symbols like ``()`` and ``{}``. However, computers cannot visually glance over the
82+
code to determine these boundaries quickly. To make it easier and faster to work with, we build a
83+
tree structure from the tokens to more closely reflect the source code the way humans see it.
8584

8685
Here is a simplified example of what an AST from the tokens above might look like.
8786

8887
.. code:: text
8988
90-
zend_ast_list {
91-
kind: ZEND_AST_IF,
92-
children: 1,
93-
child: [
94-
zend_ast {
95-
kind: ZEND_AST_IF_ELEM,
96-
child: [
97-
zend_ast {
98-
kind: ZEND_AST_VAR,
99-
child: [
100-
zend_ast_zval {
101-
kind: ZEND_AST_ZVAL,
102-
zval: "cond",
103-
},
104-
],
105-
},
106-
zend_ast_list {
107-
kind: ZEND_AST_STMT_LIST,
108-
children: 1,
109-
child: [
110-
zend_ast {
111-
kind: ZEND_AST_ECHO,
112-
child: [
113-
zend_ast_zval {
114-
kind: ZEND_AST_ZVAL,
115-
zval: "Cond is true\n",
116-
},
117-
],
118-
},
119-
],
120-
},
121-
],
89+
ZEND_AST_IF {
90+
ZEND_AST_IF_ELEM {
91+
ZEND_AST_VAR {
92+
ZEND_AST_ZVAL { "cond" },
12293
},
123-
],
94+
ZEND_AST_STMT_LIST {
95+
ZEND_AST_ECHO {
96+
ZEND_AST_ZVAL { "Cond is true\n" },
97+
},
98+
},
99+
},
124100
}
125101
126-
The nodes may also store additional flags in the ``attr`` field for various purposes depending on
127-
the node kind. They also store their original position in the source code in the ``lineno`` field.
128-
These fields are omitted in the example for brevity.
102+
Each AST node has a type and may have children. They also store their original position in the
103+
source code, and may define some arbitrary flags. These are omitted for brevity.
129104

130105
Like with tokenization, we use a tool called ``Bison`` to generate the parser implementation from a
131106
grammar specification. The grammar lives in the ``Zend/zend_language_parser.y`` file. Check the
132107
`Bison documentation`_ for details. Luckily, the syntax is quite approachable.
133108

134109
.. _bison documentation: https://www.gnu.org/software/bison/manual/
135110

111+
Parsing is described in more detail in its `dedicated chapter <todo>`__.
112+
136113
*************
137114
Compilation
138115
*************
139116

140117
Computers don't understand human language, or even programming languages. They only understand
141118
machine code, which are sequences of simple, mostly atomic instructions for doing one thing. For
142119
example, they may add two numbers, load some memory from RAM, jump to an instruction under a certain
143-
condition, etc. It turns out that even complex expressions can be reduced to a number of these
144-
simple instructions.
120+
condition, etc. It turns out that even the most complex expressions can be reduced to a number of
121+
these simple instructions.
145122

146123
PHP is a bit different, in that it does not execute machine code directly. Instead, instructions run
147124
on a "virtual machine", often abbreviated to VM. This is just a fancy way of saying that there is no
148-
physical machine that understands these instructions, but that this machine is implemented in
149-
software. This is our interpreter. This also means that we are free to make up instructions
150-
ourselves at will. Some of these instructions look very similar to something you'd find in an actual
151-
CPU instruction set (e.g. adding two numbers), while others are on a much higher level (e.g. load
152-
property of object by name).
125+
physical machine you can buy that understands these instructions, but that this machine is
126+
implemented in software. This is our interpreter. This also means that we are free to make up
127+
instructions ourselves at will. Some of these instructions look very similar to something you'd find
128+
in an actual CPU instruction set (e.g. adding two numbers), while others are much more high-level
129+
(e.g. load property of object by name).
153130

154131
With that little detour out of the way, the job of the compiler is to read the AST and translate it
155-
into our virtual machine instructions, also called opcodes. This code lives in
156-
``Zend/zend_compile.c``. The compiler is invoked for each function in your program, and generates a
157-
list of opcodes.
132+
into our virtual machine instructions, also called opcodes. The code responsible for this
133+
transformation lives in ``Zend/zend_compile.c``. It essentially traverses the AST and generates a
134+
number of instructions, before going to the next node.
158135

159-
Here's what the opcodes for the AST above might look like:
136+
Here's what the surprisingly compact opcodes for the AST above might look like:
160137

161138
.. code:: text
162139
163140
0000 JMPZ CV0($cond) 0002
164141
0001 ECHO string("Cond is true\n")
165142
0002 RETURN int(1)
166143
167-
*************
168-
Interpreter
169-
*************
144+
****************
145+
Interpretation
146+
****************
170147

171148
Finally, the opcodes are read and executed by the interpreter. PHPs uses `three-address code`_ for
172149
instructions. This essentially means that each instructions may have a result value, and at most two
@@ -176,9 +153,8 @@ operands. Most modern CPUs also use this format. Both result and operands in PHP
176153
.. _three-address code: https://en.wikipedia.org/wiki/Three-address_code
177154

178155
How exactly each opcode behaves depends on its purpose. You can find a complete list of opcodes in
179-
the generated ``Zend/zend_vm_opcodes.h`` file. The VM lives mostly in the ``Zend/zend_vm_def.h``
180-
file, which contains custom DSL that is expanded by ``Zend/zend_vm_gen.php`` to generate the
181-
``Zend/zend_vm_execute.h`` file, containing the actual VM code.
156+
the generated ``Zend/zend_vm_opcodes.h`` file. The behavior of each instruction is defined in
157+
``Zend/zend_vm_def.h``.
182158

183159
Let's step through the opcodes form the example above:
184160

@@ -193,18 +169,16 @@ Let's step through the opcodes form the example above:
193169
With these simple rules, we can see that the interpreter will ``echo`` only when ``$cond`` is
194170
truthy, and skip over the ``echo`` otherwise.
195171

196-
That's it! This is how PHP works, fundamentally. Of course, PHP consists of many more opcodes. The
197-
VM is quite complex, and will be discussed separately in the `virtual machine <todo>`__ chapter.
172+
That's it! This is how PHP works, fundamentally. Of course, we skipped over a ton of details. The VM
173+
is quite complex, and will be discussed separately in the `virtual machine <todo>`__ chapter.
198174

199175
*********
200176
Opcache
201177
*********
202178

203179
As you may imagine, running this whole pipeline every time PHP serves a request is time consuming.
204-
Luckily, it is also not necessary. We can cache the opcodes in memory between requests. When a file
205-
is included, we can look for the file in cache, and verify via timestamp that it has not been
206-
modified since it was compiled. If it has not, we may reuse the opcodes from cache. This
207-
dramatically speeds up the execution of PHP programs. This is precisely what the opcache extension
180+
Luckily, it is also not necessary. We can cache the opcodes in memory between requests, to skip over
181+
all of the phases, except for the execution phase. This is precisely what the opcache extension
208182
does. It lives in the ``ext/opcache`` directory.
209183

210184
Opcache also performs some optimizations on the opcodes before caching them. As opcaches are

0 commit comments

Comments
 (0)