bpo-35922: Fix RobotFileParser when robots.txt is invalid#11791
Conversation
taleinat
left a comment
There was a problem hiding this comment.
In addition to testing with empty robots.txt content, it would be good to also test with:
- one not referencing the given user-agent and without a default
- one with a default or with the given user-agent, but which doesn't define
Crawl_delayandRequest_rate
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Co-Authored-By: Tal Einat <taleinat+github@gmail.com>
|
Thanks for the review @taleinat! I made all the changes you requested, let me know if you want me to make other updates. I have made the requested changes; please review again |
|
Thanks for making the requested changes! @taleinat: please review the changes made to this pull request. |
taleinat
left a comment
There was a problem hiding this comment.
These changes are very well done, and they were done very quickly; nice!
Taking a deeper look at the structure of the existing tests, we would do better to build on the existing BaseRequestRateTest. Specifically, I think its test_request_rate() method should be changed to always check the crawl delay and request rate, so that setting them to None will check that RobotFileParser actually returns None.
Doing so should allow removing NoDefaultUserAgentTest, since that case will already be covered by the existing CrawlDelayAndRequestRateTest. EmptyFileTest should stay but not require any special logic, just define robots_txt to an empty stings. Similarly, NoRequestRateAndCrawlDelayTest could be written much more simply.
Let me know if you'd like to do this yourself or if you'd prefer me to make these changes myself.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be put in the comfy chair! |
|
Yeah, it will be better that way, I will push a new commit tomorrow. |
|
Thanks @taleinat, I have made the requested changes; please review again |
|
Thanks for making the requested changes! @taleinat: please review the changes made to this pull request. |
|
@remilapeyre: Status check is done, and it's a failure ❌ . |
No, please ignore those messages. They seem to be caused by a bug in Miss Islington. |
|
@remilapeyre: Status check is done, and it's a failure ❌ . |
Or maybe Miss Islington was just made aware of the failure earlier... |
|
The commit message should use "robots.txt", not "robot.txt". |
berkerpeksag
left a comment
There was a problem hiding this comment.
Good catch! I just left very minor comments. Otherwise it looks good to me.
| return self.default_entry.delay | ||
| if self.default_entry: | ||
| return self.default_entry.delay | ||
| else: |
There was a problem hiding this comment.
The else branch isn't needed:
if self.default_entry:
return self.default_entry.delay
return None| return self.default_entry.req_rate | ||
| if self.default_entry: | ||
| return self.default_entry.req_rate | ||
| else: |
| @@ -0,0 +1,4 @@ | |||
| Fix :meth:`RobotFileParser.crawl_delay()` and | |||
There was a problem hiding this comment.
Nit: () can be removed. Sphinx will it add it automatically.
| @@ -0,0 +1,4 @@ | |||
| Fix :meth:`RobotFileParser.crawl_delay()` and | |||
| :meth:`RobotFileParser.request_rate()` to return ``None`` rather than | |||
There was a problem hiding this comment.
Nit: () can be removed. Sphinx will it add it automatically.
|
|
||
|
|
||
| class EmptyFileTest(BaseRequestRateTest, unittest.TestCase): | ||
| robots_txt = "" |
There was a problem hiding this comment.
Nit: Please use single quotes.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
|
I have made the requested changes; please review again. |
|
@berkerpeksag, just awaiting your final okay, and I'll go about merging and backporting this. |
berkerpeksag
left a comment
There was a problem hiding this comment.
Good catch! Thank you for fixing this.
|
Thanks @remilapeyre for the PR, and @taleinat for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8. |
|
GH-14121 is a backport of this pull request to the 3.8 branch. |
|
GH-14122 is a backport of this pull request to the 3.7 branch. |
…delay or request rate (pythonGH-11791) Co-Authored-By: Tal Einat <taleinat+github@gmail.com> (cherry picked from commit 8047e0e) Co-authored-by: Rémi Lapeyre <remi.lapeyre@henki.fr>
…delay or request rate (pythonGH-11791) Co-Authored-By: Tal Einat <taleinat+github@gmail.com>
…delay or request rate (pythonGH-11791) Co-Authored-By: Tal Einat <taleinat+github@gmail.com>
I don't think this change requires a news entry.
https://bugs.python.org/issue35922