Releases: HHN/crawler4j
Releases · HHN/crawler4j
v4.7.0
Breaking Changes
Robots
- Replaces homebrew robotstxt code with
crawler-commons
Normalization
- Replaces homebrew URL normalization with
crawler-commons
You now need to pass a BasicURLNormalizer
into the PageFetcher
and the CrawlController
, e.g.
BasicURLNormalizer normalizer = BasicURLNormalizer.newBuilder().idnNormalization(BasicURLNormalizer.IdnNormalization.NONE).build();
Please note, that this BasicURLNormalizer
can support IdnNormalization
.
Dependency Upgrades
- Updates Tika to 2.1.0 (check/update your excludes, if you are importing crawler4j into your own code-base)
- Updates Jackson to 2.13.0 (test scope only)
- Updates PostgreSQL driver to 42.3.0 (examples only)
- Updates Flyway to 8.0.1 (examples only)
- Updates Guava to 31.0.1-jre
- Updates Groovy to 3.0.9 (test only)
Additional Notes
Full Changelog: v4.6.0...v4.7.0
v4.6.0
v4.5.1
Breaking Changes
- Updates Bytecode Level to Java 11
Dependency Upgrades
- Updates Tika to 1.26
- Updates Log4J2 to 2.14.1
- Updates URL detector to 0.1.23 and switch to /~https://github.com/URL-Detector/URL-Detector
v4.5.0
First release of my crawler4j fork.
It includes some breaking changes:
- New module structure
- New groupId to be able to deploy to Maven Central
- New artifactId's for core artifacts, see README.md
Other changes:
- Frontier (database) abstraction layer
- First draft of an HSQLDB-based frontier implementation