-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Implement object-based token_get_all() alternative #5176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pushed another commit that adds local string interning -- may or may not want to enable that, but this is just a technical detail. Some results for token_get_all() on a large PHP file (100x):
We can see that TOKEN_AS_OBJECT drops memory usage by ~40% and improves performance by ~20%. Interning further drops memory usage by 8% and regresses performance by 9%. As token_get_all() is usually "fast enough" (not the bottleneck at all), the memory usage saving is probably more valuable. |
Should the class be different named then PhpToken to reduce potential of conflicting with already existing userland classes? |
@staabm No, I don't care about that at all. But if there is a better name (on it's own merits, not because of concerns about conflicts), then of course we can consider that -- I'm not particularly attached to this one. |
I see. I would suggest |
Hi 😄 This is indeed already much better than the // before:
if (is_string($token) && $token === ';'); // === 59 would be worst here
else (is_array(($token) && $token[0] === T_OBJECT_OPERATOR);
// now
if ($token->text === ';');
else ($token->id === T_OBJECT_OPERATOR); Would it be damaging to have a single field for both id and text? Or to have an method if ($token->is(';'));
else ($token->is(T_OBJECT_OPERATOR)); Maybe there is advantage to have all in a single field if we consider how switch is often used: switch($token->type) // I could not think of a better name
case T_CLASS:
case T_FUNCTION:
case T_INTERFACE:
case ')':
case '}':
case ';':
// [...] Anyway it will be easy to wrap the token instances into a class with new methods on userland, at the cost of having two objects per token instead of one. |
@marcioAlmada I don't see how a single field for id/text would work. For tokens with more than one character, those do not have a one-to-one relationship. I think it's fine to compare id/text depending on whether it's a single-char token or a mutli-char token. An alternative would be to declare your own Having an
This could be more efficient internally. Another potentially useful method is However, I'm not entirely sure it's a good idea to include these initially. The previous discussion on this topic (https://externals.io/message/98610) degenerated because someone decided that as soon as an "object" is somehow involved, we need some ridiculously over-engineered object-oriented interface instead. I would like to avoid this. |
RFC: https://wiki.php.net/rfc/token_as_object I've included additional methods as an open question. |
Nice :) |
@nicolas-grekas Probably makes sense... I've added it as an open question for now: https://wiki.php.net/rfc/token_as_object#construction_method I was originally concerned about "combinatorial explosion" if each new flag gets a new function (or rather, multiplies all existing functions by two), but we don't really add new flags to this functionality that often... |
Agreed. These methods seem enough as a baseline for most people doing tokenizing/parsing work without compromising too much, so +1 Maybe the Probably my last suggestion on this, as the RFC and patch already looks good IMMO. A new public property to compute the ignorable state on construction, could be doable: class PhpToken {
/** One of the T_* constants, or an integer < 256 representing a single-char token. */
public int $id;
/** The textual content of the token. */
public string $text;
/** The starting line number (1-based) of the token. */
public int $line;
/** The starting position (0-based) in the tokenized string. */
public int $pos;
+ /** Indicates T_WHITESPACE, T_COMMENT, T_DOC_COMMENT, T_OPEN_TAG */
+ public bool $ignorable;
} However, at this point, that's just a nit-pick. In practice I'm not sure how general of an 'optimization' this would be. |
Added method implementations to this PR now... @marcioAlmada So there is a trade-off between memory usage and speed here. You are right that having a separate |
a3d38bb
to
b506865
Compare
Thanks for working on this - I'm looking forward to the time and memory improvements of this. I've noticed that if you add instance properties to the subclass of PhpToken, they don't get instantiated to the default - they're uninitialized instead. If this is intended, this should be documented in the RFC. --TEST--
Extending the PhpToken class with extra properties
--FILE--
<?php
$code = <<<'PHP'
Hello, world!
PHP;
// Observed: When MyPhpToken::getAll() is called, otherProp is absent
// Expected: Properties get initialized to the default constant expressions.
// If default constant expressions are invalid (E.g. MISSING_CONSTANT), it throws.
class MyPhpToken extends PhpToken {
public $otherProp = 123;
public function getLoweredText(): string {
return strtolower($this->text);
}
}
var_export(MyPhpToken::getAll('hello, world'));
// The below is what actually happens, not the desired behavior
?>
--EXPECT--
array (
0 =>
MyPhpToken::__set_state(array(
'id' => 312,
'text' => 'hello, world',
'line' => 1,
'pos' => 0,
)),
) |
cc @sebastianbergmann re php-token-stream |
@TysonAndre Nice catch! That wasn't intentional, and initialization of extra properties is now fixed. I've taken care to implement this in a way that only adds additional runtime cost if there really are extra properties. |
1bc1617
to
fa67619
Compare
Looking good :) Wondering, when dealing with the token name. When the token name is unknown (because the id is), can we maybe update to either;
basically, what I don't like is currently the magic value <?php
$a = token_name(-1); // $a = 'UNKNOWN'; Also, does the setup allow for custom tokens? So maybe something like |
@SpacePossum Returning
This will increase memory usage, so I don't think it's a good idea. The class does allow extension though, so you can extend the |
Returning the character as name for tokens with id < 256 makes perfect sense to me, than 👍 on allowing extending the Token class, that will work fine for our use-case. |
I think the token class extension solves all the issues regarding having behavior and memory trade offs (avoid userland composition)... maybe it's better to drop the default methods or do we intend to keep them due to efficiency? |
fa67619
to
86bdc84
Compare
Return instances of "static". Make the constructor final to allow this.
86bdc84
to
ab7c362
Compare
> The new PhpToken class adds an object-based interface to the tokenizer. > It provides a more uniform and ergonomic representation, while being more memory efficient and faster. Refs: * https://wiki.php.net/rfc/token_as_object * https://github.com/php/php-src/blob/4522cbb789ea4d0b70b5a1bdf7f3c0f4648d8fb7/UPGRADING#L929-L933 * php/php-src#5176 * php/php-src@5a09b9f
> The new PhpToken class adds an object-based interface to the tokenizer. > It provides a more uniform and ergonomic representation, while being more memory efficient and faster. Refs: * https://wiki.php.net/rfc/token_as_object * https://github.com/php/php-src/blob/4522cbb789ea4d0b70b5a1bdf7f3c0f4648d8fb7/UPGRADING#L929-L933 * php/php-src#5176 * php/php-src@5a09b9f
RFC: https://wiki.php.net/rfc/token_as_object
Adds new object-based interface for token_get_all(). See RFC for details.