Skip to content

Latest commit

 

History

History
185 lines (137 loc) · 6.46 KB

INTRO.mdwn

File metadata and controls

185 lines (137 loc) · 6.46 KB

Arkimet

Introduction

Arkimet is a set of tools to archive, dispatch and distribute data files.

Features

General

  • All arkimet functionality besides metadata extraction and dataset recovery is file format agnostic
  • Support for new metadata types, match functions and file formats can be added with plugins
  • Data is treated like an opaque, read only binary string, never modified to guarantee integrity
  • Archive data files are only accessed using append operations, to avoid the risk of accidentally corrupting existing data
  • Metadata is archived in a compact, binary representation to save space, but it can be handled in easy to read, easy to parse, standard formats.

Metadata

  • The extraction of metadata is flexible and customisable, and for generic formats like GRIB or HDF5 it can be scripted using Python
  • Metadata is stored in a compact binary representation to save space, but can be converted to standard formats when required
  • Metadata can have timestamped annotations to track its workflow
  • Metadata can be summarised to allow to know what data can be found in a big dataset without scanning its contents.

Remote access

  • Remote data access is provided through arki-server, an HTTP server application
  • arki-server can serve data from local datasets, as well as from remote datasets served by other arki-server instances (this allows, for example, to provide a single arki-server external front-end to various internal arki-servers in an organisation)
  • arki-server can be run behind apache mod-proxy to provide encrypted (SSL) or authenticated access
  • Client data access is done using libCURL, and can access the server over SSL or HTTP proxies
  • Summary of query results can be obtained quickly before transfering the result data]]
  • Postprocessing chains can be provided by the server to transfer only the postprocessed data (e.g. an average value instead of a big grid)

Archive

  • File layout can be customised depending on data volumes (one file per day, one file per month, etc.)
  • It is possible to select what metadata is indexed for faster queries
  • It is possible to know what metadata fields are used to identify a data item for the purpose of catching duplicate inserts
  • It is possible to insert, replace or remove items from a dataset
  • Datasets are self-contained, so it is possible to store the dataset in offline media, and query it right away as soon as the offline media is mounted online
  • Dataset summaries are small, can easily be shared across the Internet and can be used to keep track of offline datasets
  • Dataset summaries can be indexed in a second level index that implements complex data manipulations like spatial queries

User interfaces

  • A powerful and flexible suite of commandline tools allows to easily integrate arkimet into automated data processing chains.
  • arki-server not only allows remote access to the datasets, but it also provides a low-level, web-based query interface
  • ArkiMEOW is a web-based front-end to arkimet that provides simple and powerful browsing and data retrieval for end users.

Factsheet

The metadata currently supported are:

  • Reference time
  • Origin
  • Product
  • Level
  • Time Range
  • Area
  • Product definition information

The formats used to handle metadata are:

  • Arkimet-specific compact binary representation
  • Human-readable, easy to parse YAML representation http://www.yaml.org/
  • XML/RDF can be implemented as soon as a suitable standard is defined

The file formats that can be scanned are:

  • WMO GRIB (edition 1 and edition 2), any template can be supported via eccodes and Python scripts
  • WMO BUFR (edition 3 and edition 4), via DB-All.e
  • More can easily be added via plugins

The match expressions currently available are:

  • Exact or partial match on any types of metadata

  • Advanced matching of reference times:

     # Open or close intervals
     reftime:>=2005
     reftime:>=2005,<2007
     # Exact years, month, days, hours, etc.
     reftime:=2007
     reftime:=2007-05
     # Time extremes can vary in precision
     reftime:>=2005-06-01 12:00
     # Also interval of time of day can be specified
     reftime:>=12:00,<18:00
     # And repetitions during the day
     reftime:>=2007-06%6h
     reftime:=2007,>=12:00/30m,<18:00
    
  • Configurable aliases of subexpressions for easy recall via mnemonic names

Licensing and acquisition

Software dependencies

Examples

To extract metadata:

$ arki-scan --dump --annotate file.grib

To create a summary of the content of many GRIB files:

$ arki-scan --summary *.grib > summary

To view the contents of the generated summary file:

$ arki-dump --yaml summary

To select GRIB messages from a file:

$ arki-scan file.grib | arki-grep --data 'origin:GRIB1,98;reftime:>=2008' > file1.grib

To dispatch grib messages into datasets:

$ arki-mergeconf datasets/* > allowed-datasets
$ arki-scan --dispatch=allowed-datasets file.grib > dispatched.md

To query:

$ arki-query 'origin:GRIB1,98;reftime:>=2008' datasets/* --data

To get a preview of the results without performing the query:

$ arki-query 'origin:GRIB1,98;reftime:>=2008' datasets/* --summary --yaml --annotate

To export some datasets for remote access:

$ arki-mergeconf datasets/* > allowed-datasets
$ arki-server --port=8080 allowed-datasets

To query a remote dataset:

$ arki-mergeconf http://host.name > allowed-datasets
$ arki-query 'origin:GRIB1,98;reftime:>=2008' allowed-datasets --data