Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix StaleElementReferenceException in Crawler #2679

Closed
1 task done
danielbichuetti opened this issue Jun 19, 2022 · 5 comments
Closed
1 task done

Fix StaleElementReferenceException in Crawler #2679

danielbichuetti opened this issue Jun 19, 2022 · 5 comments

Comments

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Jun 19, 2022

Describe the bug
When crawling some webpages the attempt to extract sublinks is getting some exceptions

Error message

StaleElementReferenceException Traceback (most recent call last)
Input In [20], in <cell line: 1>()
----> 1 docs = crawler.crawl(
2 urls=urls,
3 filter_urls=['/depeso/'],
4 crawler_depth=1
5 )

File /anaconda/envs/haystack/lib/python3.9/site-packages/haystack/nodes/connector/crawler.py:162, in Crawler.crawl(self, output_dir, urls, crawler_depth, filter_urls, overwrite_existing_files, id_hash_keys)
159 for url_ in urls:
160 existed_links: List = list(sum(list(sub_links.values()), []))
161 sub_links[url_] = list(
--> 162 self.extract_sublinks_from_url(
163 base_url=url
, filter_urls=filter_urls, existed_links=existed_links
164 )
165 )
166 for url, extracted_sublink in sub_links.items():
167 file_paths += self._write_to_files(
168 extracted_sublink, output_dir=output_dir, base_url=url, id_hash_keys=id_hash_keys
169 )

File /anaconda/envs/haystack/lib/python3.9/site-packages/haystack/nodes/connector/crawler.py:291, in Crawler._extract_sublinks_from_url(self, base_url, filter_urls, existed_links)
288 sub_links.add(base_url)
290 for i in a_elements:
--> 291 sub_link = i.get_attribute("href")
292 if not (existed_links and sub_link in existed_links):
293 if self._is_internal_url(base_url=base_url, sub_link=sub_link) and (
294 not self._is_inpage_navigation(base_url=base_url, sub_link=sub_link)
295 ):

File /anaconda/envs/haystack/lib/python3.9/site-packages/selenium/webdriver/remote/webelement.py:173, in WebElement.get_attribute(self, name)
171 if getAttribute_js is None:
172 _load_js()
--> 173 attribute_value = self.parent.execute_script(
174 "return (%s).apply(null, arguments);" % getAttribute_js,
175 self, name)
176 return attribute_value

File /anaconda/envs/haystack/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py:884, in WebDriver.execute_script(self, script, *args)
881 converted_args = list(args)
882 command = Command.W3C_EXECUTE_SCRIPT
--> 884 return self.execute(command, {
885 'script': script,
886 'args': converted_args})['value']

File /anaconda/envs/haystack/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py:430, in WebDriver.execute(self, driver_command, params)
428 response = self.command_executor.execute(driver_command, params)
429 if response:
--> 430 self.error_handler.check_response(response)
431 response['value'] = self._unwrap_value(
432 response.get('value', None))
433 return response

File /anaconda/envs/haystack/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py:247, in ErrorHandler.check_response(self, response)
245 alert_text = value['alert'].get('text')
246 raise exception_class(message, screen, stacktrace, alert_text) # type: ignore[call-arg] # mypy is not smart enough here
--> 247 raise exception_class(message, screen, stacktrace)

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: headless chrome=102.0.5005.115)

Expected behavior
Crawler should capture till first depth the contents of pages

Additional context

To Reproduce

from haystack.nodes import Crawler
from datetime import datetime

now=datetime.now().strftime("%Y-%m-%d-%H%M%S")
crawler = Crawler(output_dir='migalhas/'+ now)
urls=[]
for i in range(3200):
    urls.append('https://www.migalhas.com.br/depeso?pagina='+str(i))
docs = crawler.crawl(
    urls=urls,
    crawler_depth=1
)

FAQ Check

System:

  • OS: Ubuntu 18.04
  • GPU/CPU: Intel Xeon
  • Haystack version (commit or version number): 1.5.0
  • DocumentStore:
  • Reader:
  • Retriever:
@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Jun 19, 2022

The issue appears to be related to dynamic DOM. Including a time.sleep(2) on the same pages before extracting the sub-links solved the issue (outside of haystack, same code as it uses). A better solution would be to implement a expected condition, something using

from selenium.webdriver.support import expected_conditions as EC

@masci
Copy link
Contributor

masci commented Jun 20, 2022

Hi @danielbichuetti thanks for the detailed issue! We have some ongoing work in #2658 that I believe is related to the problem you're having.

We'll ensure we fix this one when working on #2658

@danielbichuetti
Copy link
Contributor Author

danielbichuetti commented Jun 20, 2022

Hi @masci

I'm sorry for not sending all into one message. It's just that I got the error and investigated further after sometime. Using Selenium the only options would be implicit wait using (driver.implicitly_wait(x) # x second), or explicit wait. The latest would need the user to send an EC (expected_condition), ready-made or custom one.

There is a framework - Playwright - from the same authors of Puppeteer that is currently under development by Microsoft. Disney, ING and Adobe uses it for web testing. We use it at the company for some web scraping. It's simple to install, allowing to use a pre-installed browser or let it download it. It has a pretty clear design and there are some wait conditions in the navigation classes, like:

page.goto("https://www.migalhas.com.br", wait_until="networkidle") or
page.wait_for_load_state("networkidle"); # This waits for the "networkidle" or even the most sofisticated one:
page.wait_for_function("() => window.amILoadedYet()")

A lot of the web scraping companies uses Playwright now.

@danielbichuetti
Copy link
Contributor Author

I've tested a lot more and implicitly wait is not a good choice. It's pretty unstable, and sometimes it just doesn't wait correct time (I didn't look at the source code to understand). Also, it's suitable to wait in elements being ready, and in some cases elements are removed from the DOM. A better solution would be waiting (time.sleep) or a custom WebDriver function. Another option would be to implement a Playwright port of the Crawler.

Using a new variable page_wait_time using your patterns, init, crawl and run with the time in seconds to wait makes it a plausible solution for the time. And maybe another to receive the WebDriverWait function (page_wait_function), if user want to wait specific EC.

Are you open to contributors ? May I code this fix and start then a Playwright class (PlaywrightCrawler) ?

@masci
Copy link
Contributor

masci commented Jun 21, 2022

Are you open to contributors ? May I code this fix and start then a Playwright class (PlaywrightCrawler) ?

Big time! Please go ahead, I'll support you through the review process - thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants