-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix StaleElementReferenceException in Crawler #2679
Comments
The issue appears to be related to dynamic DOM. Including a time.sleep(2) on the same pages before extracting the sub-links solved the issue (outside of haystack, same code as it uses). A better solution would be to implement a expected condition, something using
|
Hi @danielbichuetti thanks for the detailed issue! We have some ongoing work in #2658 that I believe is related to the problem you're having. We'll ensure we fix this one when working on #2658 |
Hi @masci I'm sorry for not sending all into one message. It's just that I got the error and investigated further after sometime. Using Selenium the only options would be implicit wait using (driver.implicitly_wait(x) # x second), or explicit wait. The latest would need the user to send an EC (expected_condition), ready-made or custom one. There is a framework - Playwright - from the same authors of Puppeteer that is currently under development by Microsoft. Disney, ING and Adobe uses it for web testing. We use it at the company for some web scraping. It's simple to install, allowing to use a pre-installed browser or let it download it. It has a pretty clear design and there are some wait conditions in the navigation classes, like:
A lot of the web scraping companies uses Playwright now. |
I've tested a lot more and implicitly wait is not a good choice. It's pretty unstable, and sometimes it just doesn't wait correct time (I didn't look at the source code to understand). Also, it's suitable to wait in elements being ready, and in some cases elements are removed from the DOM. A better solution would be waiting (time.sleep) or a custom WebDriver function. Another option would be to implement a Playwright port of the Crawler. Using a new variable page_wait_time using your patterns, init, crawl and run with the time in seconds to wait makes it a plausible solution for the time. And maybe another to receive the WebDriverWait function (page_wait_function), if user want to wait specific EC. Are you open to contributors ? May I code this fix and start then a Playwright class (PlaywrightCrawler) ? |
Big time! Please go ahead, I'll support you through the review process - thanks in advance! |
Describe the bug
When crawling some webpages the attempt to extract sublinks is getting some exceptions
Error message
Expected behavior
Crawler should capture till first depth the contents of pages
Additional context
To Reproduce
FAQ Check
System:
The text was updated successfully, but these errors were encountered: