Skip to content

Implement object-based token_get_all() alternative #5176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

nikic
Copy link
Member

@nikic nikic commented Feb 12, 2020

RFC: https://wiki.php.net/rfc/token_as_object

Adds new object-based interface for token_get_all(). See RFC for details.

class PhpToken implements Stringable {
    /** One of the T_* constants, or an integer < 256 representing a single-char token. */
    public int $id;
    /** The textual content of the token. */
    public string $text;
    /** The starting line number (1-based) of the token. */
    public int $line;
    /** The starting position (0-based) in the tokenized string. */
    public int $pos;
 
    /**
     * Same as token_get_all(), but returning array of PhpToken.
     * @return static[]
     */
    public static function getAll(string $code, int $flags = 0): array;
 
    final public function __construct(int $id, string $text, int $line = -1, int $pos = -1);

    /** Get the name of the token. */
    public function getTokenName(): ?string;
 
    /**
     * Whether the token has the given ID, the given text,
     * or has an ID/text part of the given array.
     * 
     * @param int|string|array $kind
     */
    public function is($kind): bool;
 
    /** Whether this token would be ignored by the PHP parser. */
    public function isIgnorable(): bool;

    /** Same as $token->text. */
    public function __toString(): string;
}

@nikic nikic added the RFC label Feb 12, 2020
@nikic
Copy link
Member Author

nikic commented Feb 12, 2020

Pushed another commit that adds local string interning -- may or may not want to enable that, but this is just a technical detail.

Some results for token_get_all() on a large PHP file (100x):

Without interning: 
    Default:
        Time: 0.42794013023376s
        Memory Usage: 14.037727355957MB
    TOKEN_AS_OBJECT:
        Time: 0.32422184944153s
        Memory Usage: 8.0559387207031MB
With interning: 
    Default:
        Time: 0.41672205924988s
        Memory Usage: 13.329658508301MB
    TOKEN_AS_OBJECT:
        Time: 0.35216188430786s
        Memory Usage: 7.3478698730469MB

We can see that TOKEN_AS_OBJECT drops memory usage by ~40% and improves performance by ~20%.

Interning further drops memory usage by 8% and regresses performance by 9%. As token_get_all() is usually "fast enough" (not the bottleneck at all), the memory usage saving is probably more valuable.

@staabm
Copy link
Contributor

staabm commented Feb 12, 2020

Should the class be different named then PhpToken to reduce potential of conflicting with already existing userland classes?

@nikic
Copy link
Member Author

nikic commented Feb 12, 2020

@staabm No, I don't care about that at all. But if there is a better name (on it's own merits, not because of concerns about conflicts), then of course we can consider that -- I'm not particularly attached to this one.

@staabm
Copy link
Contributor

staabm commented Feb 12, 2020

I see. I would suggest PhpParserToken because „token“ itself is a term which is used in a lot of different contexts

@marcioAlmada
Copy link
Contributor

marcioAlmada commented Feb 13, 2020

Hi 😄

This is indeed already much better than the array|string token mess. But the current interface still has a problem that always annoyed me, which is to decide how to compare the tokens:

// before:

if (is_string($token) && $token === ';'); // === 59 would be worst here
else (is_array(($token) && $token[0] === T_OBJECT_OPERATOR);

// now

if ($token->text === ';');
else ($token->id === T_OBJECT_OPERATOR);

Would it be damaging to have a single field for both id and text? Or to have an method is(int|string $id) method to handle the comparison?

if ($token->is(';'));
else ($token->is(T_OBJECT_OPERATOR));

Maybe there is advantage to have all in a single field if we consider how switch is often used:

switch($token->type) // I could not think of a better name
case T_CLASS:
case T_FUNCTION:
case T_INTERFACE:
case ')':
case '}':
case ';':
// [...]

Anyway it will be easy to wrap the token instances into a class with new methods on userland, at the cost of having two objects per token instead of one.

@nikic
Copy link
Member Author

nikic commented Feb 13, 2020

@marcioAlmada I don't see how a single field for id/text would work. For tokens with more than one character, those do not have a one-to-one relationship.

I think it's fine to compare id/text depending on whether it's a single-char token or a mutli-char token. An alternative would be to declare your own define('T_SEMICOLON', \ord(';')), if you would like to base everything on T_* constants.

Having an is() method still may be useful -- I think not so much for consolidating the check for id/text, but to support the common extension of specifying multiple allowed token types:

public function is(int|string|array $type) {
    if (is_array($type)) {
        foreach ($type as $singleType) {
            if ($this->is($singleType)) {
                return true;
            }
        }
        return false;
    } else if (is_string($type)) {
        return $this->text === $type;
    } else {
        return $this->id === $type,
    }
}

This could be more efficient internally. Another potentially useful method is isIgnorable(), which would basically be the same as is([T_WHITESPACE, T_COMMENT, T_DOC_COMMENT, T_OPEN_TAG]), just more efficient and matching PHP semantics. Another one is getTokenName(), so you don't have to go through token_name() for debug output.

However, I'm not entirely sure it's a good idea to include these initially. The previous discussion on this topic (https://externals.io/message/98610) degenerated because someone decided that as soon as an "object" is somehow involved, we need some ridiculously over-engineered object-oriented interface instead. I would like to avoid this.

@nikic
Copy link
Member Author

nikic commented Feb 13, 2020

RFC: https://wiki.php.net/rfc/token_as_object

I've included additional methods as an open question.

@nicolas-grekas
Copy link
Contributor

Nice :)
Would it make sense to make the new behavior polyfillable, i.e. use a new function (or static method) instead of a flag?

@nikic
Copy link
Member Author

nikic commented Feb 13, 2020

@nicolas-grekas Probably makes sense... I've added it as an open question for now: https://wiki.php.net/rfc/token_as_object#construction_method

I was originally concerned about "combinatorial explosion" if each new flag gets a new function (or rather, multiplies all existing functions by two), but we don't really add new flags to this functionality that often...

@marcioAlmada
Copy link
Contributor

RFC: https://wiki.php.net/rfc/token_as_object

I've included additional methods as an open question.

Agreed. These methods seem enough as a baseline for most people doing tokenizing/parsing work without compromising too much, so +1

Maybe the isIgnorable method could be further optimized. It's very common to see this method called at least once per token on hot paths, if the parsing method backtracks this can be much more, perhaps.

Probably my last suggestion on this, as the RFC and patch already looks good IMMO. A new public property to compute the ignorable state on construction, could be doable:

class PhpToken {
    /** One of the T_* constants, or an integer < 256 representing a single-char token. */
    public int $id;
    /** The textual content of the token. */
    public string $text;
    /** The starting line number (1-based) of the token. */
    public int $line;
    /** The starting position (0-based) in the tokenized string. */
    public int $pos;
+   /** Indicates T_WHITESPACE, T_COMMENT, T_DOC_COMMENT, T_OPEN_TAG */
+   public bool $ignorable;
}

However, at this point, that's just a nit-pick. In practice I'm not sure how general of an 'optimization' this would be.

@nikic
Copy link
Member Author

nikic commented Feb 13, 2020

Added method implementations to this PR now...

@marcioAlmada So there is a trade-off between memory usage and speed here. You are right that having a separate ignorable property is going to be more efficient, because accessing the property is cheaper than calling a method (even if the method implementation here is really cheap). However, having this boolean (1 bit of information...) would add 128 bits of extra memory usage to each token. As memory usage is pretty sensitive here, I think it's not a good tradeoff.

@nikic nikic force-pushed the token-get-all-object branch from a3d38bb to b506865 Compare February 14, 2020 16:09
@TysonAndre
Copy link
Contributor

Thanks for working on this - I'm looking forward to the time and memory improvements of this.

I've noticed that if you add instance properties to the subclass of PhpToken, they don't get instantiated to the default - they're uninitialized instead.
I don't know if the class this is based on would have the same issue.

If this is intended, this should be documented in the RFC.
If this is not intended, this should be fixed (e.g. take a slower code path for instantiating subclasses, check EG(exception), or make PhpToken final if that's unexpectedly impractical.)

--TEST--
Extending the PhpToken class with extra properties
--FILE--
<?php

$code = <<<'PHP'
Hello, world!
PHP;

// Observed: When MyPhpToken::getAll() is called, otherProp is absent
// Expected: Properties get initialized to the default constant expressions.
// If default constant expressions are invalid (E.g. MISSING_CONSTANT), it throws.
class MyPhpToken extends PhpToken {
    public $otherProp = 123;
    public function getLoweredText(): string {
        return strtolower($this->text);
    }
}

var_export(MyPhpToken::getAll('hello, world'));
// The below is what actually happens, not the desired behavior
?>
--EXPECT--
array (
  0 => 
  MyPhpToken::__set_state(array(
     'id' => 312,
     'text' => 'hello, world',
     'line' => 1,
     'pos' => 0,
  )),
)

@carusogabriel
Copy link
Contributor

cc @gsherwood @jrfnl @kukulich @SpacePossum @julienfalque @keradus

@jrfnl
Copy link
Contributor

jrfnl commented Feb 16, 2020

cc @sebastianbergmann re php-token-stream

@nikic nikic changed the title Implement TOKEN_AS_OBJECT mode Implement object-based token_get_all() alternative Feb 17, 2020
@nikic
Copy link
Member Author

nikic commented Feb 17, 2020

@TysonAndre Nice catch! That wasn't intentional, and initialization of extra properties is now fixed. I've taken care to implement this in a way that only adds additional runtime cost if there really are extra properties.

@nikic nikic force-pushed the token-get-all-object branch from 1bc1617 to fa67619 Compare February 17, 2020 09:21
@SpacePossum
Copy link

Looking good :)

Wondering, when dealing with the token name. When the token name is unknown (because the id is), can we maybe update to either;

  • return null
  • throw an exception

basically, what I don't like is currently the magic value 'UNKNOWN' is returned;

<?php

$a = token_name(-1); // $a = 'UNKNOWN';

Also, does the setup allow for custom tokens?
For example we use a lot of custom id's and names so it is easier to analyze the tokens. For example all the different curly brace meanings.

So maybe something like Token::__construct(int $id, string $text, int $line, int $pos, string $name = null) ?

@nikic
Copy link
Member Author

nikic commented Feb 19, 2020

@SpacePossum Returning null instead of "UNKNOWN" would be a possibility. Not sure what should be returned for tokens < 256. I'm currently returning the character (so a ; token will return ; as the name), but possibly that should also be null?

So maybe something like Token::__construct(int $id, string $text, int $line, int $pos, string $name = null) ?

This will increase memory usage, so I don't think it's a good idea. The class does allow extension though, so you can extend the getTokenName() method to also map your custom IDs to the appropriate names.

@SpacePossum
Copy link

Returning the character as name for tokens with id < 256 makes perfect sense to me, than null always means "unknown".

👍 on allowing extending the Token class, that will work fine for our use-case.

@marcioAlmada
Copy link
Contributor

marcioAlmada commented Feb 20, 2020

I think the token class extension solves all the issues regarding having behavior and memory trade offs (avoid userland composition)... maybe it's better to drop the default methods or do we intend to keep them due to efficiency?

@nikic nikic force-pushed the token-get-all-object branch from fa67619 to 86bdc84 Compare March 2, 2020 17:34
@nikic nikic force-pushed the token-get-all-object branch from 86bdc84 to ab7c362 Compare March 26, 2020 10:00
@php-pulls php-pulls closed this in 5a09b9f Mar 26, 2020
@carusogabriel carusogabriel added this to the PHP 8.0 milestone May 29, 2020
jrfnl added a commit to PHPCompatibility/PHPCompatibility that referenced this pull request Aug 16, 2020
> The new PhpToken class adds an object-based interface to the tokenizer.
>  It provides a more uniform and ergonomic representation, while being more
memory efficient and faster.

Refs:
* https://wiki.php.net/rfc/token_as_object
* https://github.com/php/php-src/blob/4522cbb789ea4d0b70b5a1bdf7f3c0f4648d8fb7/UPGRADING#L929-L933
* php/php-src#5176
* php/php-src@5a09b9f
jrfnl added a commit to PHPCompatibility/PHPCompatibility that referenced this pull request Aug 17, 2020
> The new PhpToken class adds an object-based interface to the tokenizer.
>  It provides a more uniform and ergonomic representation, while being more
memory efficient and faster.

Refs:
* https://wiki.php.net/rfc/token_as_object
* https://github.com/php/php-src/blob/4522cbb789ea4d0b70b5a1bdf7f3c0f4648d8fb7/UPGRADING#L929-L933
* php/php-src#5176
* php/php-src@5a09b9f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants