Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC format improvements #691

Merged
merged 12 commits into from
Mar 23, 2019
Merged

Conversation

sebastian-nagel
Copy link
Contributor

This PR combines a list of improvements to the WARC writer bolt and record formats but also to the okhttp-based protocol implementation:

I can split the PR into multiple one if required.

The result WARC files have been checked and validated using warcio's new check utility (webrecorder/warcio#68). An example request/response record pair with all changes applied:

WARC/1.0
WARC-Type: request
WARC-Record-ID: <urn:uuid:8dfc0435-7417-4b4c-8189-705d766a3b16>
Content-Length: 377
WARC-Date: 2019-03-13T13:56:00.953Z
WARC-Target-URI: https://www.theguardian.com/politics/2019/mar/13/theresa-may-confirms-she-will-vote-to-block-no-deal-brexit-pmqs-jeremy-corbyn
Content-Type: application/http; msgtype=request
WARC-Block-Digest: sha1:KVPT2QYXP5FMSI542GRPAG4VXOUVCNRC

GET /politics/2019/mar/13/theresa-may-confirms-she-will-vote-to-block-no-deal-brexit-pmqs-jeremy-corbyn 
User-Agent: CCBot/3.0 (http://commoncrawl.org/faq/; info@commoncrawl.org)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
Host: www.theguardian.com
Connection: Keep-Alive
Accept-Encoding: gzip



WARC/1.0
WARC-Record-ID: <urn:uuid:203780a2-dcb8-49a7-973e-1d88b125f587>
WARC-Concurrent-To: <urn:uuid:8dfc0435-7417-4b4c-8189-705d766a3b16>
Content-Length: 1041736
WARC-Date: 2019-03-13T13:56:00.953Z
WARC-Type: response
WARC-IP-Address: 151.101.113.111
WARC-Target-URI: https://www.theguardian.com/politics/2019/mar/13/theresa-may-confirms-she-will-vote-to-block-no-deal-brexit-pmqs-jeremy-corbyn
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha1:YWXUX773QVKHQPO4MQXAFFBOH7IIYIA4
WARC-Block-Digest: sha1:DY6IIMGARYQGPERMPEUT6RUB3VZRQF76

HTTP/1.1 200 OK
X-Crawler-Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
ETag: W/"hash-826299181980705017"
Link: <https://assets.guim.co.uk/stylesheets/ed6dfcab2e0ddab1c51867a93921f6ca/content.garnett.css>; rel=preload; as=style; nopush,<https://assets.guim.co.uk/polyfill.io/v3/polyfill.min.js?rum=0&features=es6,es7,es2017,default-3.6,HTMLPictureElement,IntersectionObserver,IntersectionObserverEntry&flags=gated&callback=guardianPolyfilled&unknown=polyfill>; rel=preload; as=script; nopush,<https://assets.guim.co.uk/javascripts/ff1648a1c5e6beec11e0/graun.standard.js>; rel=preload; as=script; nopush,<https://assets.guim.co.uk/javascripts/9a2a60d293fd119118cf/graun.commercial.js>; rel=preload; as=script; nopush
X-Crawler-Content-Length: 164346
Content-Length: 1040002
Accept-Ranges: bytes
Date: Wed, 13 Mar 2019 13:56:00 GMT
Age: 36
Connection: keep-alive
Set-Cookie: GU_mvt_id=186384; expires=Tue, 11 Jun 2019 13:56:00 GMT; path=/; domain=.theguardian.com; Secure
X-Timer: S1552485361.961381,VS0,VE0
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Content-Security-Policy: default-src https:; script-src https: 'unsafe-inline' 'unsafe-eval'; style-src https: 'unsafe-inline'; img-src https: data: blob:; media-src https: data: blob:; font-src https: data:; connect-src https: wss:
Referrer-Policy: no-referrer-when-downgrade
Feature-Policy: camera 'none'; microphone 'none'; midi 'none'
X-GU-Edition: int
Cache-Control: max-age=60, stale-while-revalidate=6, stale-if-error=864000, private
Vary: Accept-Encoding,User-Agent
Set-Cookie: GU_geo_continent=EU; path=/; Secure


<!DOCTYPE html>
<html id="js-context" class="js-off is-not-modern id--signed-out" lang="en" data-page-path="/politics/2019/mar/13/theresa-may-confirms-she-will-vote-to-block-no-deal-brexit-pmqs-jeremy-corbyn">
<head>
<!--
     __        __                      _     _      _
     \ \      / /__    __ _ _ __ ___  | |__ (_)_ __(_)_ __   __ _
      \ \ /\ / / _ \  / _` | '__/ _ \ | '_ \| | '__| | '_ \ / _` |
       \ V  V /  __/ | (_| | | |  __/ | | | | | |  | | | | | (_| |
        \_/\_/ \___|  \__,_|_|  \___| |_| |_|_|_|  |_|_| |_|\__, |
                                                            |___/
    Ever thought about joining us?
    https://workforus.theguardian.com/careers/digital-development/
     --->
<title>Theresa May confirms she will vote to block no-deal Brexit | Politics | The Guardian</title>
<meta charset="utf-8">

Copy link
Contributor

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sebastian-nagel, quite a chunky PR - I'll have a closer look at its various parts later and give it some serious testing but it looks good on the whole.

@Override
public Response intercept(Interceptor.Chain chain) throws IOException {

String startFetchTime = DateTimeFormatter.ISO_INSTANT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that the same as Instant.now().toString()? the formatter used is
https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_INSTANT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like, although I don't like the toString() methods which often implicitly assumes some localization. It's also not the best design to do the date formatting defined by the WARC format in the protocol layer. But re-parsing the date from a String is also not ideal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the protocol could store the value as a long and let the WARC code format it whichever way it wants e.g. using Instant.ofEpochSecond(long epochSecond)

// get actual fetch time from metadata if any
String captureTime = metadata.getFirstValue("_request.time_");
if (captureTime == null) {
Date now = new Date();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use Instant instead of Date? Would mean changing in the rest of this class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. It would actually only change these lines.

@jnioche
Copy link
Contributor

jnioche commented Mar 15, 2019

@sebastian-nagel could you update the README for the WARC module e.g. show how to set withRequestRecords() in Java topology and Flux?

@sebastian-nagel
Copy link
Contributor Author

Yes, I'll do and also address the other comments regarding the fetch date. After a second check also this should be addressed:

  • add IP address also to request records
  • the WARC-Date should be only with a precision in seconds to be compatible with v1.0 of the WARC standard (1.1 allows a higher precision)

@jnioche jnioche added this to the 1.14 milestone Mar 18, 2019
- use okhttp's internal byte buffer, do not buffer content twice
- store server IP address if HTTP headers are saved
- optionally keep partially fetched content
- track reason for trimmed content (content or time limit reached,
  network disconnect)
- use capture time (when fetching began) for WARC-Date
- add WARC-IP-Address
- enable WARCHdfsBolt/GzipHdfsBolt to write multiple gzip container
  `format(tuple)`
- implement WARCRequestRecordFormat to be written before response record
- request records are enabled by calling `withRequestRecords()` on an
  already created WARCHdfsBolt
and Transfer-Encoding) which may cause errors when processing WARC
records with uncompressed, unchunked content
- define keys to hold HTTP headers, IP address and capture time
  in ProtocolResponse metadata as constants
- add WARC-IP-Address to WARC request records
- store start of fetch time as epoch millis in ProtocolResponse metadata
- format WARC-Date with precision in seconds to meet WARC 1.0 spec
- use new date and time API for processing dates
- add withRequestRecords
- complete configuration via Flux
- add note that currently only the okhttp protocol implementation
  stores all metadata required to write complete WARC records
@sebastian-nagel
Copy link
Contributor Author

Rebased PR and addressed all discussed points. Also tested the WARC bolt via Flux, including passing the warcinfo map and configuring the bolt to use the RawLocalFileSystem.

}

class HTTPHeadersInterceptor implements Interceptor {

final ZoneId TIME_ZONE_UTC = ZoneId.of(ZoneOffset.UTC.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - could you remove the imports as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks!

Copy link
Contributor

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, will test it on a crawl mostly to check that the changes to the protocol implementation do not have a negative impact

@jnioche
Copy link
Contributor

jnioche commented Mar 22, 2019

@sebastian-nagel could we have http.content.partial.as.trimmed as well as any other new config in the default config file with some comments to document it? The default config file acts as documentation in a way.

- document `http.content.partial.as.trimmed` in crawler-default.yaml
@sebastian-nagel
Copy link
Contributor Author

Done. Btw, crawler-default.yaml mentions http.store.responsetime which seems unused (not implemented).

@jnioche
Copy link
Contributor

jnioche commented Mar 22, 2019

thanks @sebastian-nagel that was part of the initial protocol from Nutch, will remove it

@jnioche jnioche merged commit 613c537 into apache:master Mar 23, 2019
@jnioche
Copy link
Contributor

jnioche commented Mar 23, 2019

thanks for this contrib @sebastian-nagel

@sebastian-nagel sebastian-nagel deleted the sc-warc-improvements branch March 25, 2019 10:44
value = new String(Base64.getDecoder().decode(value));
}

responsemetadata.addValue(key.toLowerCase(Locale.ROOT), value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebastian-nagel not normalising the keys to lowercase has serious implications on the subsequent steps and is also incompatible with what the other implementation produces. What problem were you trying to solve here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a mistake - given that this was together with moving custom headers to constants (being lowercase).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a closer look: that's a regression. I see: HTTP header names are required in lowercase everywhere else. Sorry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problems, I should have spotted it when merging. Thanks for confirming!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would break some other things, eg. last-modified handling. Maybe add a comment?
Btw., the source.buffer() calls are deprecated, should be getBuffer() now.

jnioche added a commit that referenced this pull request Apr 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants