Skip to content

bpo-21475: Support the Sitemap extension in robotparser #6883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 16, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions Doc/library/urllib.robotparser.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,15 @@ structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.

.. versionadded:: 3.6

.. method:: site_maps()

Returns the contents of the ``Sitemap`` parameter from
``robots.txt`` in the form of a :func:`list`. If there is no such
parameter or the ``robots.txt`` entry for this parameter has
invalid syntax, return ``None``.

.. versionadded:: 3.8


The following example demonstrates basic use of the :class:`RobotFileParser`
class::
Expand Down
21 changes: 21 additions & 0 deletions Lib/test/test_robotparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ class BaseRobotTest:
agent = 'test_robotparser'
good = []
bad = []
site_maps = None

def setUp(self):
lines = io.StringIO(self.robots_txt).readlines()
Expand All @@ -36,6 +37,9 @@ def test_bad_urls(self):
with self.subTest(url=url, agent=agent):
self.assertFalse(self.parser.can_fetch(agent, url))

def test_site_maps(self):
self.assertEqual(self.parser.site_maps(), self.site_maps)


class UserAgentWildcardTest(BaseRobotTest, unittest.TestCase):
robots_txt = """\
Expand Down Expand Up @@ -65,6 +69,23 @@ class CrawlDelayAndCustomAgentTest(BaseRobotTest, unittest.TestCase):
bad = ['/cyberworld/map/index.html']


class SitemapTest(BaseRobotTest, unittest.TestCase):
robots_txt = """\
# robots.txt for http://www.example.com/

User-agent: *
Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml
Sitemap: http://www.google.com/hostednews/sitemap_index.xml
Request-rate: 3/15
Disallow: /cyberworld/map/ # This is an infinite virtual URL space

"""
good = ['/', '/test.html']
bad = ['/cyberworld/map/index.html']
site_maps = ['http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml',
'http://www.google.com/hostednews/sitemap_index.xml']


class RejectAllRobotsTest(BaseRobotTest, unittest.TestCase):
robots_txt = """\
# go away
Expand Down
12 changes: 12 additions & 0 deletions Lib/urllib/robotparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ class RobotFileParser:

def __init__(self, url=''):
self.entries = []
self.sitemaps = []
self.default_entry = None
self.disallow_all = False
self.allow_all = False
Expand Down Expand Up @@ -141,6 +142,12 @@ def parse(self, lines):
and numbers[1].strip().isdigit()):
entry.req_rate = RequestRate(int(numbers[0]), int(numbers[1]))
state = 2
elif line[0] == "sitemap":
# According to http://www.sitemaps.org/protocol.html
# "This directive is independent of the user-agent line,
# so it doesn't matter where you place it in your file."
# Therefore we do not change the state of the parser.
self.sitemaps.append(line[1])
if state == 2:
self._add_entry(entry)

Expand Down Expand Up @@ -189,6 +196,11 @@ def request_rate(self, useragent):
return entry.req_rate
return self.default_entry.req_rate

def site_maps(self):
if not self.sitemaps:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test for this branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this branch is tested by test_site_maps on all the other tests for robotparser - they each test that it is none except for my single class that tests the positive case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, you're correct, I didn't click to the expand button and didn't notice that the test_site_maps method is part of BaseRobotTest.

return None
return self.sitemaps

def __str__(self):
entries = self.entries
if self.default_entry is not None:
Expand Down
2 changes: 2 additions & 0 deletions Misc/ACKS
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ Anthony Baxter
Mike Bayer
Samuel L. Bayer
Bo Bayles
Christopher Beacham AKA Lady Red
Tommy Beadle
Donald Beaudry
David Beazley
Expand Down Expand Up @@ -1760,6 +1761,7 @@ Dik Winter
Blake Winton
Jean-Claude Wippler
Stéphane Wirtel
Peter Wirtz
Lars Wirzenius
John Wiseman
Chris Withers
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Added support for Site Maps to urllib's ``RobotFileParser`` as
:meth:`RobotFileParser.site_maps() <urllib.robotparser.RobotFileParser.site_maps>`.
Patch by Lady Red, based on patch by Peter Wirtz.