Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: a detected url: "http://<html></html>" (depending on the crawl configuration) #74

Closed
brbog opened this issue Jun 2, 2022 · 0 comments
Assignees
Labels
Milestone

Comments

@brbog
Copy link

brbog commented Jun 2, 2022

This bug has probably always been there in all versions/branches of crawler4j.
I opened a pull request with more explanation + workaround in the Javadoc of the tests. The workaround could probably also be applied inside of edu.uci.ics.crawler4j.parser.Parser -> line 89 -> test if getHtml() returns "<html></html>" and don't extract urls if that's the case. Looks like a safe bet?

Feel free to use as it fits the project best. Tests are in JUnit5 instead of Spock, but didn't use mocking frameworks or added other dependencies like that. The test that verifies that the wrong extraction is done should be useful as it can also give feedback when it would be safe to get rid of the "fix code".

@rzo1 rzo1 closed this as completed in 6ea2b78 Jun 7, 2022
@rzo1 rzo1 added the bug label Jun 7, 2022
@rzo1 rzo1 self-assigned this Jun 7, 2022
@rzo1 rzo1 modified the milestones: v4.8.4, v4.9.1 Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants