Skip to content

Commit c56b17b

Browse files
takluyverwillingc
authored andcommitted
bpo-12486: Document tokenize.generate_tokens() as public API (#6957)
* Document tokenize.generate_tokens() * Add news file * Add test for generate_tokens * Document behaviour around ENCODING token * Add generate_tokens to __all__
1 parent c2745d2 commit c56b17b

File tree

4 files changed

+35
-6
lines changed

4 files changed

+35
-6
lines changed

Doc/library/tokenize.rst

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,16 @@ The primary entry point is a :term:`generator`:
5757
:func:`.tokenize` determines the source encoding of the file by looking for a
5858
UTF-8 BOM or encoding cookie, according to :pep:`263`.
5959

60+
.. function:: generate_tokens(readline)
61+
62+
Tokenize a source reading unicode strings instead of bytes.
63+
64+
Like :func:`.tokenize`, the *readline* argument is a callable returning
65+
a single line of input. However, :func:`generate_tokens` expects *readline*
66+
to return a str object rather than bytes.
67+
68+
The result is an iterator yielding named tuples, exactly like
69+
:func:`.tokenize`. It does not yield an :data:`~token.ENCODING` token.
6070

6171
All constants from the :mod:`token` module are also exported from
6272
:mod:`tokenize`.
@@ -79,7 +89,8 @@ write back the modified script.
7989
positions) may change.
8090

8191
It returns bytes, encoded using the :data:`~token.ENCODING` token, which
82-
is the first token sequence output by :func:`.tokenize`.
92+
is the first token sequence output by :func:`.tokenize`. If there is no
93+
encoding token in the input, it returns a str instead.
8394

8495

8596
:func:`.tokenize` needs to detect the encoding of source files it tokenizes. The

Lib/test/test_tokenize.py

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
from test import support
22
from tokenize import (tokenize, _tokenize, untokenize, NUMBER, NAME, OP,
33
STRING, ENDMARKER, ENCODING, tok_name, detect_encoding,
4-
open as tokenize_open, Untokenizer)
5-
from io import BytesIO
4+
open as tokenize_open, Untokenizer, generate_tokens)
5+
from io import BytesIO, StringIO
66
import unittest
77
from unittest import TestCase, mock
88
from test.test_grammar import (VALID_UNDERSCORE_LITERALS,
@@ -919,6 +919,19 @@ async def bar(): pass
919919
DEDENT '' (7, 0) (7, 0)
920920
""")
921921

922+
class GenerateTokensTest(TokenizeTest):
923+
def check_tokenize(self, s, expected):
924+
# Format the tokens in s in a table format.
925+
# The ENDMARKER is omitted.
926+
result = []
927+
f = StringIO(s)
928+
for type, token, start, end, line in generate_tokens(f.readline):
929+
if type == ENDMARKER:
930+
break
931+
type = tok_name[type]
932+
result.append(f" {type:10} {token!r:13} {start} {end}")
933+
self.assertEqual(result, expected.rstrip().splitlines())
934+
922935

923936
def decistmt(s):
924937
result = []

Lib/tokenize.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
blank_re = re.compile(br'^[ \t\f]*(?:[#\r\n]|$)', re.ASCII)
3838

3939
import token
40-
__all__ = token.__all__ + ["tokenize", "detect_encoding",
40+
__all__ = token.__all__ + ["tokenize", "generate_tokens", "detect_encoding",
4141
"untokenize", "TokenInfo"]
4242
del token
4343

@@ -653,9 +653,12 @@ def _tokenize(readline, encoding):
653653
yield TokenInfo(ENDMARKER, '', (lnum, 0), (lnum, 0), '')
654654

655655

656-
# An undocumented, backwards compatible, API for all the places in the standard
657-
# library that expect to be able to use tokenize with strings
658656
def generate_tokens(readline):
657+
"""Tokenize a source reading Python code as unicode strings.
658+
659+
This has the same API as tokenize(), except that it expects the *readline*
660+
callable to return str objects instead of bytes.
661+
"""
659662
return _tokenize(readline, None)
660663

661664
def main():
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
:func:`tokenize.generate_tokens` is now documented as a public API to
2+
tokenize unicode strings. It was previously present but undocumented.

0 commit comments

Comments
 (0)