Arkimet is a set of tools to archive, dispatch and distribute data files.
- All arkimet functionality besides metadata extraction and dataset recovery is file format agnostic
- Support for new metadata types, match functions and file formats can be added with plugins
- Data is treated like an opaque, read only binary string, never modified to guarantee integrity
- Archive data files are only accessed using append operations, to avoid the risk of accidentally corrupting existing data
- Metadata is archived in a compact, binary representation to save space, but it can be handled in easy to read, easy to parse, standard formats.
- The extraction of metadata is flexible and customisable, and for generic formats like GRIB or HDF5 it can be scripted using Python
- Metadata is stored in a compact binary representation to save space, but can be converted to standard formats when required
- Metadata can have timestamped annotations to track its workflow
- Metadata can be summarised to allow to know what data can be found in a big dataset without scanning its contents.
- Remote data access is provided through arki-server, an HTTP server application
- arki-server can serve data from local datasets, as well as from remote datasets served by other arki-server instances (this allows, for example, to provide a single arki-server external front-end to various internal arki-servers in an organisation)
- arki-server can be run behind apache mod-proxy to provide encrypted (SSL) or authenticated access
- Client data access is done using libCURL, and can access the server over SSL or HTTP proxies
- Summary of query results can be obtained quickly before transfering the result data]]
- Postprocessing chains can be provided by the server to transfer only the postprocessed data (e.g. an average value instead of a big grid)
- File layout can be customised depending on data volumes (one file per day, one file per month, etc.)
- It is possible to select what metadata is indexed for faster queries
- It is possible to know what metadata fields are used to identify a data item for the purpose of catching duplicate inserts
- It is possible to insert, replace or remove items from a dataset
- Datasets are self-contained, so it is possible to store the dataset in offline media, and query it right away as soon as the offline media is mounted online
- Dataset summaries are small, can easily be shared across the Internet and can be used to keep track of offline datasets
- Dataset summaries can be indexed in a second level index that implements complex data manipulations like spatial queries
- A powerful and flexible suite of commandline tools allows to easily integrate arkimet into automated data processing chains.
- arki-server not only allows remote access to the datasets, but it also provides a low-level, web-based query interface
- ArkiMEOW is a web-based front-end to arkimet that provides simple and powerful browsing and data retrieval for end users.
The metadata currently supported are:
- Reference time
- Origin
- Product
- Level
- Time Range
- Area
- Product definition information
The formats used to handle metadata are:
- Arkimet-specific compact binary representation
- Human-readable, easy to parse YAML representation http://www.yaml.org/
- XML/RDF can be implemented as soon as a suitable standard is defined
The file formats that can be scanned are:
- WMO GRIB (edition 1 and edition 2), any template can be supported via eccodes and Python scripts
- WMO BUFR (edition 3 and edition 4), via DB-All.e
- More can easily be added via plugins
The match expressions currently available are:
-
Exact or partial match on any types of metadata
-
Advanced matching of reference times:
# Open or close intervals reftime:>=2005 reftime:>=2005,<2007 # Exact years, month, days, hours, etc. reftime:=2007 reftime:=2007-05 # Time extremes can vary in precision reftime:>=2005-06-01 12:00 # Also interval of time of day can be specified reftime:>=12:00,<18:00 # And repetitions during the day reftime:>=2007-06%6h reftime:=2007,>=12:00/30m,<18:00
-
Configurable aliases of subexpressions for easy recall via mnemonic names
Licensing and acquisition
- arkimet is Free Software, licensed under the GNU General Public License version 2 or later.
- the source is available at http://www.arpa.emr.it/sim/?arkimet
Software dependencies
- arkimet is written in C++, and it regularly compiles and passes the test suite using GCC version 3.3.3, 4.1.2 and 4.2.3.
- It should run on any reasonable Unix system, and has been tested on Debian, RedHat and Suse Linux systems.
- Python version 3.6 or later is required for the customisable logic and GRIB import
- ECMWF grib_api is required for GRIB (edition 1 and 2) support http://www.ecmwf.int/products/data/software/grib_api.html
- SQLite is optionally required for indexed datasets http://www.sqlite.org/
- CURL is optionally required to access remote datasets http://curl.haxx.se/
- python and python-cherrypy are optionally required to serve datasets over HTTP http://www.python.org/ http://www.cherrypy.org/
- DB-All.e version 4.0 or later is optionally required to enable BUFR support http://www.arpa.emr.it/dettaglio_documento.asp?id=514&idlivello=64
To extract metadata:
$ arki-scan --dump --annotate file.grib
To create a summary of the content of many GRIB files:
$ arki-scan --summary *.grib > summary
To view the contents of the generated summary file:
$ arki-dump --yaml summary
To select GRIB messages from a file:
$ arki-scan file.grib | arki-grep --data 'origin:GRIB1,98;reftime:>=2008' > file1.grib
To dispatch grib messages into datasets:
$ arki-mergeconf datasets/* > allowed-datasets
$ arki-scan --dispatch=allowed-datasets file.grib > dispatched.md
To query:
$ arki-query 'origin:GRIB1,98;reftime:>=2008' datasets/* --data
To get a preview of the results without performing the query:
$ arki-query 'origin:GRIB1,98;reftime:>=2008' datasets/* --summary --yaml --annotate
To export some datasets for remote access:
$ arki-mergeconf datasets/* > allowed-datasets
$ arki-server --port=8080 allowed-datasets
To query a remote dataset:
$ arki-mergeconf http://host.name > allowed-datasets
$ arki-query 'origin:GRIB1,98;reftime:>=2008' allowed-datasets --data