Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Browser Control #79

Open
philippta opened this issue Jan 6, 2025 · 4 comments
Open

Browser Control #79

philippta opened this issue Jan 6, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@philippta
Copy link
Owner

philippta commented Jan 6, 2025

Currently, when browser mode is enabled:

  1. flyscrape will use a browser to navigate to a page
  2. Wait until the page loaded and return its HTML
  3. Which then can be used for scraping

The browser control feature should support direct control of the browser, so that it can be used for page interaction and data extraction.

Example code:

export const config = {
  url: 'https://example.com/',
  browserControl: true,
};

export default function ({ doc, browser }) {
  browser.waitPageLoaded();
  browser.waitVisible(".products");

  const productImages = browser.find(".products").map((product) => {
    product.find(".swatches").click();

    return product.find(".product-image").attr("src");
  });

  return {
    productImages,
  };
}
@philippta philippta added the enhancement New feature or request label Jan 6, 2025
@dejurin
Copy link
Contributor

dejurin commented Feb 18, 2025

Does that mean we will be able to apply proxies?
I am not very familiar with go-rod, I apologize if my question seems trivial.

Although, logic suggests that the browser is already the current connection to the host, but proxies are set before the browser is created. In any case, I would like to know if there are plans to support proxies in browser mode?

func Example_customize_chrome_launch() {
	// set custom chrome options
	// use IDE to check the doc of launcher.New you will find out more info
	url := launcher.New().
		Set("proxy-server", "127.0.0.1:8080"). // add a flag, here we set a http proxy
		Delete("use-mock-keychain").           // delete a flag
		Launch()

	browser := rod.New().ControlURL(url).Connect()
	defer browser.Close()

	// auth the proxy
	// here we use cli tool "mitmproxy --proxyauth user:pass" as an example
	browser.HandleAuth("user", "pass")

	// mitmproxy needs cert config to support https, use http here as an example
	fmt.Println(browser.Page("http://example.com/").Element("title").Text())

	// Skip
	// Output: Example Domain
}

@philippta
Copy link
Owner Author

philippta commented Feb 18, 2025

The intention behind browser control was to allow for scraping dynamic websites, where content is for example only available after a press of a button, clicking on a tab or scrolling down. Or sometimes sections of a page only load after a longer period of time, so the script should wait for an element to be visible.

Currently, proxies are not supported when browser mode is enabled.
I can investigate and see if there’s a way to make proxies work with browser mode. However browser mode should not be used to configure the bowser, only to enrich interaction with a page.

I would like to keep things as simple and intuitive as possible and I don’t think any user would suspect to configure browser proxies through the browser object (in the scrape function), given there’s already a well specified proxies setting in the config object.

@dejurin
Copy link
Contributor

dejurin commented Feb 18, 2025

In my haste, I read the go-rod manual in detail, and realized it doesn't work that way. You're right. And I appreciate your simplicity and thoughtfulness.

Apparently you can take the same proxy list (from config) and integrate it into browser mode. If I understood the manual to the go-rod correctly - it is possible, as we can see on example 👆.

@dejurin
Copy link
Contributor

dejurin commented Feb 18, 2025

Back on topic, browser control, this is a much needed feature. Looking forward to the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants