bug: a detected url: "http://<html></html>" (depending on the crawl configuration) #74

brbog · 2022-06-02T11:43:42Z

This bug has probably always been there in all versions/branches of crawler4j.
I opened a pull request with more explanation + workaround in the Javadoc of the tests. The workaround could probably also be applied inside of edu.uci.ics.crawler4j.parser.Parser -> line 89 -> test if getHtml() returns "<html></html>" and don't extract urls if that's the case. Looks like a safe bet?

Feel free to use as it fits the project best. Tests are in JUnit5 instead of Spock, but didn't use mocking frameworks or added other dependencies like that. The test that verifies that the wrong extraction is done should be useful as it can also give feedback when it would be safe to get rid of the "fix code".

The text was updated successfully, but these errors were encountered:

brbog mentioned this issue Jun 7, 2022

bug: a detected url: "http://<html></html>" #72

Merged

rzo1 closed this as completed in 6ea2b78 Jun 7, 2022

rzo1 added the bug label Jun 7, 2022

rzo1 self-assigned this Jun 7, 2022

rzo1 modified the milestones: v4.8.4, v4.9.1 Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: a detected url: "http://<html></html>" (depending on the crawl configuration) #74

bug: a detected url: "http://<html></html>" (depending on the crawl configuration) #74

brbog commented Jun 2, 2022 •

edited

Loading

bug: a detected url: "http://<html></html>" (depending on the crawl configuration) #74

bug: a detected url: "http://<html></html>" (depending on the crawl configuration) #74

Comments

brbog commented Jun 2, 2022 • edited Loading

brbog commented Jun 2, 2022 •

edited

Loading