-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARC format improvements #691
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sebastian-nagel, quite a chunky PR - I'll have a closer look at its various parts later and give it some serious testing but it looks good on the whole.
@Override | ||
public Response intercept(Interceptor.Chain chain) throws IOException { | ||
|
||
String startFetchTime = DateTimeFormatter.ISO_INSTANT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't that the same as Instant.now().toString()? the formatter used is
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_INSTANT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like, although I don't like the toString() methods which often implicitly assumes some localization. It's also not the best design to do the date formatting defined by the WARC format in the protocol layer. But re-parsing the date from a String is also not ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the protocol could store the value as a long and let the WARC code format it whichever way it wants e.g. using Instant.ofEpochSecond(long epochSecond)
// get actual fetch time from metadata if any | ||
String captureTime = metadata.getFirstValue("_request.time_"); | ||
if (captureTime == null) { | ||
Date now = new Date(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use Instant instead of Date? Would mean changing in the rest of this class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. It would actually only change these lines.
external/warc/src/main/java/com/digitalpebble/stormcrawler/warc/WARCRecordFormat.java
Show resolved
Hide resolved
@sebastian-nagel could you update the README for the WARC module e.g. show how to set withRequestRecords() in Java topology and Flux? |
Yes, I'll do and also address the other comments regarding the fetch date. After a second check also this should be addressed:
|
- use okhttp's internal byte buffer, do not buffer content twice - store server IP address if HTTP headers are saved - optionally keep partially fetched content - track reason for trimmed content (content or time limit reached, network disconnect)
(http.store.headers == true)
- use capture time (when fetching began) for WARC-Date - add WARC-IP-Address
- enable WARCHdfsBolt/GzipHdfsBolt to write multiple gzip container `format(tuple)` - implement WARCRequestRecordFormat to be written before response record - request records are enabled by calling `withRequestRecords()` on an already created WARCHdfsBolt
between HTTP header and payload
and Transfer-Encoding) which may cause errors when processing WARC records with uncompressed, unchunked content
- define keys to hold HTTP headers, IP address and capture time in ProtocolResponse metadata as constants - add WARC-IP-Address to WARC request records - store start of fetch time as epoch millis in ProtocolResponse metadata - format WARC-Date with precision in seconds to meet WARC 1.0 spec - use new date and time API for processing dates
- add withRequestRecords - complete configuration via Flux - add note that currently only the okhttp protocol implementation stores all metadata required to write complete WARC records
5c1b7ce
to
0d10188
Compare
Rebased PR and addressed all discussed points. Also tested the WARC bolt via Flux, including passing the warcinfo map and configuring the bolt to use the RawLocalFileSystem. |
} | ||
|
||
class HTTPHeadersInterceptor implements Interceptor { | ||
|
||
final ZoneId TIME_ZONE_UTC = ZoneId.of(ZoneOffset.UTC.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it's removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks - could you remove the imports as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, will test it on a crawl mostly to check that the changes to the protocol implementation do not have a negative impact
@sebastian-nagel could we have http.content.partial.as.trimmed as well as any other new config in the default config file with some comments to document it? The default config file acts as documentation in a way. |
- document `http.content.partial.as.trimmed` in crawler-default.yaml
Done. Btw, crawler-default.yaml mentions |
thanks @sebastian-nagel that was part of the initial protocol from Nutch, will remove it |
thanks for this contrib @sebastian-nagel |
value = new String(Base64.getDecoder().decode(value)); | ||
} | ||
|
||
responsemetadata.addValue(key.toLowerCase(Locale.ROOT), value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sebastian-nagel not normalising the keys to lowercase has serious implications on the subsequent steps and is also incompatible with what the other implementation produces. What problem were you trying to solve here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a mistake - given that this was together with moving custom headers to constants (being lowercase).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a closer look: that's a regression. I see: HTTP header names are required in lowercase everywhere else. Sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problems, I should have spotted it when merging. Thanks for confirming!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would break some other things, eg. last-modified handling. Maybe add a comment?
Btw., the source.buffer()
calls are deprecated, should be getBuffer()
now.
…ercase http headers in okhttp
This PR combines a list of improvements to the WARC writer bolt and record formats but also to the okhttp-based protocol implementation:
withRequestRecords()
on an existing WARCHdfsBolt.WARC-Date
http.content.partial.as.trimmed
is trueWARC-Truncated
field (address WARC file format improvement: add WARC-Truncated header commoncrawl/news-crawl#31)X-Crawler
. Address WARC file format fix: mask HTTP header fields Content-Encoding and Transfer-Encoding, adjust Content-Length commoncrawl/news-crawl#30.I can split the PR into multiple one if required.
The result WARC files have been checked and validated using warcio's new check utility (webrecorder/warcio#68). An example request/response record pair with all changes applied: