Fix StaleElementReferenceException in Crawler #2679

danielbichuetti · 2022-06-19T19:20:17Z

Describe the bug
When crawling some webpages the attempt to extract sublinks is getting some exceptions

Error message

StaleElementReferenceException Traceback (most recent call last)
Input In [20], in <cell line: 1>()
----> 1 docs = crawler.crawl(
2 urls=urls,
3 filter_urls=['/depeso/'],
4 crawler_depth=1
5 )

File /anaconda/envs/haystack/lib/python3.9/site-packages/haystack/nodes/connector/crawler.py:162, in Crawler.crawl(self, output_dir, urls, crawler_depth, filter_urls, overwrite_existing_files, id_hash_keys)
159 for url_ in urls:
160 existed_links: List = list(sum(list(sub_links.values()), []))
161 sub_links[url_] = list(
--> 162 self.extract_sublinks_from_url(
163 base_url=url, filter_urls=filter_urls, existed_links=existed_links
164 )
165 )
166 for url, extracted_sublink in sub_links.items():
167 file_paths += self._write_to_files(
168 extracted_sublink, output_dir=output_dir, base_url=url, id_hash_keys=id_hash_keys
169 )

File /anaconda/envs/haystack/lib/python3.9/site-packages/haystack/nodes/connector/crawler.py:291, in Crawler._extract_sublinks_from_url(self, base_url, filter_urls, existed_links)
288 sub_links.add(base_url)
290 for i in a_elements:
--> 291 sub_link = i.get_attribute("href")
292 if not (existed_links and sub_link in existed_links):
293 if self._is_internal_url(base_url=base_url, sub_link=sub_link) and (
294 not self._is_inpage_navigation(base_url=base_url, sub_link=sub_link)
295 ):

File /anaconda/envs/haystack/lib/python3.9/site-packages/selenium/webdriver/remote/webelement.py:173, in WebElement.get_attribute(self, name)
171 if getAttribute_js is None:
172 _load_js()
--> 173 attribute_value = self.parent.execute_script(
174 "return (%s).apply(null, arguments);" % getAttribute_js,
175 self, name)
176 return attribute_value

File /anaconda/envs/haystack/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py:884, in WebDriver.execute_script(self, script, *args)
881 converted_args = list(args)
882 command = Command.W3C_EXECUTE_SCRIPT
--> 884 return self.execute(command, {
885 'script': script,
886 'args': converted_args})['value']

File /anaconda/envs/haystack/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py:430, in WebDriver.execute(self, driver_command, params)
428 response = self.command_executor.execute(driver_command, params)
429 if response:
--> 430 self.error_handler.check_response(response)
431 response['value'] = self._unwrap_value(
432 response.get('value', None))
433 return response

File /anaconda/envs/haystack/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py:247, in ErrorHandler.check_response(self, response)
245 alert_text = value['alert'].get('text')
246 raise exception_class(message, screen, stacktrace, alert_text) # type: ignore[call-arg] # mypy is not smart enough here
--> 247 raise exception_class(message, screen, stacktrace)

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
(Session info: headless chrome=102.0.5005.115)

Expected behavior
Crawler should capture till first depth the contents of pages

Additional context

To Reproduce

from haystack.nodes import Crawler
from datetime import datetime

now=datetime.now().strftime("%Y-%m-%d-%H%M%S")
crawler = Crawler(output_dir='migalhas/'+ now)
urls=[]
for i in range(3200):
    urls.append('https://www.migalhas.com.br/depeso?pagina='+str(i))
docs = crawler.crawl(
    urls=urls,
    crawler_depth=1
)

FAQ Check

Have you had a look at our new FAQ page?

System:

OS: Ubuntu 18.04
GPU/CPU: Intel Xeon
Haystack version (commit or version number): 1.5.0
DocumentStore:
Reader:
Retriever:

The text was updated successfully, but these errors were encountered:

danielbichuetti · 2022-06-19T21:35:57Z

The issue appears to be related to dynamic DOM. Including a time.sleep(2) on the same pages before extracting the sub-links solved the issue (outside of haystack, same code as it uses). A better solution would be to implement a expected condition, something using

from selenium.webdriver.support import expected_conditions as EC

masci · 2022-06-20T08:17:34Z

Hi @danielbichuetti thanks for the detailed issue! We have some ongoing work in #2658 that I believe is related to the problem you're having.

We'll ensure we fix this one when working on #2658

danielbichuetti · 2022-06-20T09:26:46Z

Hi @masci

I'm sorry for not sending all into one message. It's just that I got the error and investigated further after sometime. Using Selenium the only options would be implicit wait using (driver.implicitly_wait(x) # x second), or explicit wait. The latest would need the user to send an EC (expected_condition), ready-made or custom one.

There is a framework - Playwright - from the same authors of Puppeteer that is currently under development by Microsoft. Disney, ING and Adobe uses it for web testing. We use it at the company for some web scraping. It's simple to install, allowing to use a pre-installed browser or let it download it. It has a pretty clear design and there are some wait conditions in the navigation classes, like:

page.goto("https://www.migalhas.com.br", wait_until="networkidle") or
page.wait_for_load_state("networkidle"); # This waits for the "networkidle" or even the most sofisticated one:
page.wait_for_function("() => window.amILoadedYet()")

A lot of the web scraping companies uses Playwright now.

danielbichuetti · 2022-06-20T17:22:27Z

I've tested a lot more and implicitly wait is not a good choice. It's pretty unstable, and sometimes it just doesn't wait correct time (I didn't look at the source code to understand). Also, it's suitable to wait in elements being ready, and in some cases elements are removed from the DOM. A better solution would be waiting (time.sleep) or a custom WebDriver function. Another option would be to implement a Playwright port of the Crawler.

Using a new variable page_wait_time using your patterns, init, crawl and run with the time in seconds to wait makes it a plausible solution for the time. And maybe another to receive the WebDriverWait function (page_wait_function), if user want to wait specific EC.

Are you open to contributors ? May I code this fix and start then a Playwright class (PlaywrightCrawler) ?

masci · 2022-06-21T08:02:00Z

Are you open to contributors ? May I code this fix and start then a Playwright class (PlaywrightCrawler) ?

Big time! Please go ahead, I'll support you through the review process - thanks in advance!

masci added journey:intermediate topic:crawler labels Jun 20, 2022

danielbichuetti mentioned this issue Jun 22, 2022

Improved crawler support for dynamically loaded pages #2710

Merged

4 tasks

ZanSara closed this as completed Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix StaleElementReferenceException in Crawler #2679

Fix StaleElementReferenceException in Crawler #2679

danielbichuetti commented Jun 19, 2022 •

edited

Loading

danielbichuetti commented Jun 19, 2022 •

edited

Loading

masci commented Jun 20, 2022

danielbichuetti commented Jun 20, 2022 •

edited

Loading

danielbichuetti commented Jun 20, 2022

masci commented Jun 21, 2022

Fix StaleElementReferenceException in Crawler #2679

Fix StaleElementReferenceException in Crawler #2679

Comments

danielbichuetti commented Jun 19, 2022 • edited Loading

danielbichuetti commented Jun 19, 2022 • edited Loading

masci commented Jun 20, 2022

danielbichuetti commented Jun 20, 2022 • edited Loading

danielbichuetti commented Jun 20, 2022

masci commented Jun 21, 2022

danielbichuetti commented Jun 19, 2022 •

edited

Loading

danielbichuetti commented Jun 19, 2022 •

edited

Loading

danielbichuetti commented Jun 20, 2022 •

edited

Loading