Implement object-based token_get_all() alternative #5176

nikic · 2020-02-12T15:38:58Z

RFC: https://wiki.php.net/rfc/token_as_object

Adds new object-based interface for token_get_all(). See RFC for details.

class PhpToken implements Stringable {
    /** One of the T_* constants, or an integer < 256 representing a single-char token. */
    public int $id;
    /** The textual content of the token. */
    public string $text;
    /** The starting line number (1-based) of the token. */
    public int $line;
    /** The starting position (0-based) in the tokenized string. */
    public int $pos;
 
    /**
     * Same as token_get_all(), but returning array of PhpToken.
     * @return static[]
     */
    public static function getAll(string $code, int $flags = 0): array;
 
    final public function __construct(int $id, string $text, int $line = -1, int $pos = -1);

    /** Get the name of the token. */
    public function getTokenName(): ?string;
 
    /**
     * Whether the token has the given ID, the given text,
     * or has an ID/text part of the given array.
     * 
     * @param int|string|array $kind
     */
    public function is($kind): bool;
 
    /** Whether this token would be ignored by the PHP parser. */
    public function isIgnorable(): bool;

    /** Same as $token->text. */
    public function __toString(): string;
}

nikic · 2020-02-12T16:17:04Z

Pushed another commit that adds local string interning -- may or may not want to enable that, but this is just a technical detail.

Some results for token_get_all() on a large PHP file (100x):

Without interning: 
    Default:
        Time: 0.42794013023376s
        Memory Usage: 14.037727355957MB
    TOKEN_AS_OBJECT:
        Time: 0.32422184944153s
        Memory Usage: 8.0559387207031MB
With interning: 
    Default:
        Time: 0.41672205924988s
        Memory Usage: 13.329658508301MB
    TOKEN_AS_OBJECT:
        Time: 0.35216188430786s
        Memory Usage: 7.3478698730469MB

We can see that TOKEN_AS_OBJECT drops memory usage by ~40% and improves performance by ~20%.

Interning further drops memory usage by 8% and regresses performance by 9%. As token_get_all() is usually "fast enough" (not the bottleneck at all), the memory usage saving is probably more valuable.

staabm · 2020-02-12T17:43:52Z

Should the class be different named then PhpToken to reduce potential of conflicting with already existing userland classes?

nikic · 2020-02-12T17:52:55Z

@staabm No, I don't care about that at all. But if there is a better name (on it's own merits, not because of concerns about conflicts), then of course we can consider that -- I'm not particularly attached to this one.

staabm · 2020-02-12T19:27:42Z

I see. I would suggest PhpParserToken because „token“ itself is a term which is used in a lot of different contexts

marcioAlmada · 2020-02-13T05:56:45Z

Hi 😄

This is indeed already much better than the array|string token mess. But the current interface still has a problem that always annoyed me, which is to decide how to compare the tokens:

// before:

if (is_string($token) && $token === ';'); // === 59 would be worst here
else (is_array(($token) && $token[0] === T_OBJECT_OPERATOR);

// now

if ($token->text === ';');
else ($token->id === T_OBJECT_OPERATOR);

Would it be damaging to have a single field for both id and text? Or to have an method is(int|string $id) method to handle the comparison?

if ($token->is(';'));
else ($token->is(T_OBJECT_OPERATOR));

Maybe there is advantage to have all in a single field if we consider how switch is often used:

switch($token->type) // I could not think of a better name
case T_CLASS:
case T_FUNCTION:
case T_INTERFACE:
case ')':
case '}':
case ';':
// [...]

Anyway it will be easy to wrap the token instances into a class with new methods on userland, at the cost of having two objects per token instead of one.

nikic · 2020-02-13T08:56:51Z

@marcioAlmada I don't see how a single field for id/text would work. For tokens with more than one character, those do not have a one-to-one relationship.

I think it's fine to compare id/text depending on whether it's a single-char token or a mutli-char token. An alternative would be to declare your own define('T_SEMICOLON', \ord(';')), if you would like to base everything on T_* constants.

Having an is() method still may be useful -- I think not so much for consolidating the check for id/text, but to support the common extension of specifying multiple allowed token types:

public function is(int|string|array $type) {
    if (is_array($type)) {
        foreach ($type as $singleType) {
            if ($this->is($singleType)) {
                return true;
            }
        }
        return false;
    } else if (is_string($type)) {
        return $this->text === $type;
    } else {
        return $this->id === $type,
    }
}

This could be more efficient internally. Another potentially useful method is isIgnorable(), which would basically be the same as is([T_WHITESPACE, T_COMMENT, T_DOC_COMMENT, T_OPEN_TAG]), just more efficient and matching PHP semantics. Another one is getTokenName(), so you don't have to go through token_name() for debug output.

However, I'm not entirely sure it's a good idea to include these initially. The previous discussion on this topic (https://externals.io/message/98610) degenerated because someone decided that as soon as an "object" is somehow involved, we need some ridiculously over-engineered object-oriented interface instead. I would like to avoid this.

nikic · 2020-02-13T09:42:46Z

RFC: https://wiki.php.net/rfc/token_as_object

I've included additional methods as an open question.

nicolas-grekas · 2020-02-13T09:59:56Z

Nice :)
Would it make sense to make the new behavior polyfillable, i.e. use a new function (or static method) instead of a flag?

nikic · 2020-02-13T10:30:11Z

@nicolas-grekas Probably makes sense... I've added it as an open question for now: https://wiki.php.net/rfc/token_as_object#construction_method

I was originally concerned about "combinatorial explosion" if each new flag gets a new function (or rather, multiplies all existing functions by two), but we don't really add new flags to this functionality that often...

marcioAlmada · 2020-02-13T11:33:06Z

RFC: https://wiki.php.net/rfc/token_as_object

I've included additional methods as an open question.

Agreed. These methods seem enough as a baseline for most people doing tokenizing/parsing work without compromising too much, so +1

Maybe the isIgnorable method could be further optimized. It's very common to see this method called at least once per token on hot paths, if the parsing method backtracks this can be much more, perhaps.

Probably my last suggestion on this, as the RFC and patch already looks good IMMO. A new public property to compute the ignorable state on construction, could be doable:

class PhpToken {
    /** One of the T_* constants, or an integer < 256 representing a single-char token. */
    public int $id;
    /** The textual content of the token. */
    public string $text;
    /** The starting line number (1-based) of the token. */
    public int $line;
    /** The starting position (0-based) in the tokenized string. */
    public int $pos;
+   /** Indicates T_WHITESPACE, T_COMMENT, T_DOC_COMMENT, T_OPEN_TAG */
+   public bool $ignorable;
}

However, at this point, that's just a nit-pick. In practice I'm not sure how general of an 'optimization' this would be.

nikic · 2020-02-13T12:38:04Z

Added method implementations to this PR now...

@marcioAlmada So there is a trade-off between memory usage and speed here. You are right that having a separate ignorable property is going to be more efficient, because accessing the property is cheaper than calling a method (even if the method implementation here is really cheap). However, having this boolean (1 bit of information...) would add 128 bits of extra memory usage to each token. As memory usage is pretty sensitive here, I think it's not a good tradeoff.

TysonAndre · 2020-02-16T17:33:31Z

Thanks for working on this - I'm looking forward to the time and memory improvements of this.

I've noticed that if you add instance properties to the subclass of PhpToken, they don't get instantiated to the default - they're uninitialized instead.
I don't know if the class this is based on would have the same issue.

If this is intended, this should be documented in the RFC.
If this is not intended, this should be fixed (e.g. take a slower code path for instantiating subclasses, check EG(exception), or make PhpToken final if that's unexpectedly impractical.)

--TEST--
Extending the PhpToken class with extra properties
--FILE--
<?php

$code = <<<'PHP'
Hello, world!
PHP;

// Observed: When MyPhpToken::getAll() is called, otherProp is absent
// Expected: Properties get initialized to the default constant expressions.
// If default constant expressions are invalid (E.g. MISSING_CONSTANT), it throws.
class MyPhpToken extends PhpToken {
    public $otherProp = 123;
    public function getLoweredText(): string {
        return strtolower($this->text);
    }
}

var_export(MyPhpToken::getAll('hello, world'));
// The below is what actually happens, not the desired behavior
?>
--EXPECT--
array (
  0 => 
  MyPhpToken::__set_state(array(
     'id' => 312,
     'text' => 'hello, world',
     'line' => 1,
     'pos' => 0,
  )),
)

ext/tokenizer/tests/PhpToken_constructor.phpt

carusogabriel · 2020-02-16T19:43:01Z

cc @gsherwood @jrfnl @kukulich @SpacePossum @julienfalque @keradus

jrfnl · 2020-02-16T21:32:20Z

cc @sebastianbergmann re php-token-stream

nikic · 2020-02-17T09:14:45Z

@TysonAndre Nice catch! That wasn't intentional, and initialization of extra properties is now fixed. I've taken care to implement this in a way that only adds additional runtime cost if there really are extra properties.

SpacePossum · 2020-02-19T10:27:54Z

Looking good :)

Wondering, when dealing with the token name. When the token name is unknown (because the id is), can we maybe update to either;

return null
throw an exception

basically, what I don't like is currently the magic value 'UNKNOWN' is returned;

<?php

$a = token_name(-1); // $a = 'UNKNOWN';

Also, does the setup allow for custom tokens?
For example we use a lot of custom id's and names so it is easier to analyze the tokens. For example all the different curly brace meanings.

So maybe something like Token::__construct(int $id, string $text, int $line, int $pos, string $name = null) ?

nikic · 2020-02-19T13:42:15Z

@SpacePossum Returning null instead of "UNKNOWN" would be a possibility. Not sure what should be returned for tokens < 256. I'm currently returning the character (so a ; token will return ; as the name), but possibly that should also be null?

So maybe something like Token::__construct(int $id, string $text, int $line, int $pos, string $name = null) ?

This will increase memory usage, so I don't think it's a good idea. The class does allow extension though, so you can extend the getTokenName() method to also map your custom IDs to the appropriate names.

SpacePossum · 2020-02-19T14:05:36Z

Returning the character as name for tokens with id < 256 makes perfect sense to me, than null always means "unknown".

👍 on allowing extending the Token class, that will work fine for our use-case.

marcioAlmada · 2020-02-20T13:15:26Z

I think the token class extension solves all the issues regarding having behavior and memory trade offs (avoid userland composition)... maybe it's better to drop the default methods or do we intend to keep them due to efficiency?

Return instances of "static". Make the constructor final to allow this.

> The new PhpToken class adds an object-based interface to the tokenizer. > It provides a more uniform and ergonomic representation, while being more memory efficient and faster. Refs: * https://wiki.php.net/rfc/token_as_object * https://github.com/php/php-src/blob/4522cbb789ea4d0b70b5a1bdf7f3c0f4648d8fb7/UPGRADING#L929-L933 * php/php-src#5176 * php/php-src@5a09b9f

nikic added the RFC label Feb 12, 2020

nikic mentioned this pull request Feb 12, 2020

Optimize memory consumption nikic/PHP-Parser#651

Closed

nikic force-pushed the token-get-all-object branch from a3d38bb to b506865 Compare February 14, 2020 16:09

TysonAndre reviewed Feb 16, 2020

View reviewed changes

ext/tokenizer/tests/PhpToken_constructor.phpt Show resolved Hide resolved

nikic changed the title ~~Implement TOKEN_AS_OBJECT mode~~ Implement object-based token_get_all() alternative Feb 17, 2020

nikic force-pushed the token-get-all-object branch from 1bc1617 to fa67619 Compare February 17, 2020 09:21

nikic force-pushed the token-get-all-object branch from fa67619 to 86bdc84 Compare March 2, 2020 17:34

nikic added 4 commits March 26, 2020 10:38

Implement TOKEN_AS_OBJECT mode

6284c9f

Locally intern strings in token_get_all()

b41e1ca

Add PhpToken methods

a1f7db4

Introduce PhpToken::getAll() instead of flag

d310410

nikic added 10 commits March 26, 2020 10:38

Add PhpToken::__construct()

9ad94a8

Allow extending PhpToken

18e53f4

Return instances of "static". Make the constructor final to allow this.

Fix bool/int confusion

1b6b962

Fix initialization of extra properties

df5fc1b

Update constants and check abstract classes

b782726

Add SKIPIF sections

a852dbd

Return null from getTokenName() if token unknown

8b58f41

Add __toString() method

24e5967

Locally intern strings in token_get_all()

10590f6

Add test

ab7c362

nikic force-pushed the token-get-all-object branch from 86bdc84 to ab7c362 Compare March 26, 2020 10:00

php-pulls closed this in 5a09b9f Mar 26, 2020

carusogabriel added this to the PHP 8.0 milestone May 29, 2020

jrfnl mentioned this pull request Aug 17, 2020

PHP 8.0: NewClasses: handle 3 new classes PHPCompatibility/PHPCompatibility#1211

Merged

Implement object-based token_get_all() alternative #5176

Implement object-based token_get_all() alternative #5176

Uh oh!

Conversation

nikic commented Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Feb 12, 2020

Uh oh!

staabm commented Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Feb 12, 2020

Uh oh!

staabm commented Feb 12, 2020

Uh oh!

marcioAlmada commented Feb 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Feb 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Feb 13, 2020

Uh oh!

nicolas-grekas commented Feb 13, 2020

Uh oh!

nikic commented Feb 13, 2020

Uh oh!

marcioAlmada commented Feb 13, 2020

Uh oh!

nikic commented Feb 13, 2020

Uh oh!

TysonAndre commented Feb 16, 2020

Uh oh!

Uh oh!

carusogabriel commented Feb 16, 2020

Uh oh!

jrfnl commented Feb 16, 2020

Uh oh!

nikic commented Feb 17, 2020

Uh oh!

SpacePossum commented Feb 19, 2020

Uh oh!

nikic commented Feb 19, 2020

Uh oh!

SpacePossum commented Feb 19, 2020

Uh oh!

marcioAlmada commented Feb 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nikic commented Feb 12, 2020 •

edited

Loading

staabm commented Feb 12, 2020 •

edited

Loading

marcioAlmada commented Feb 13, 2020 •

edited

Loading

nikic commented Feb 13, 2020 •

edited

Loading

marcioAlmada commented Feb 20, 2020 •

edited

Loading