The code for our leapfrog implementation for Apache Jena is available here
-
Java 11
apt install openjdk-11-jdk
-
any x64 linux distribution with glib support
-
Docker
-
- On a debian-based distro:
sudo apt install bzip2
Some of the following steps can take hours to complete, so we recommend using tmux to execute them.
- On a debian-based distro:
-
Clone this repository. if you use ssh keys, use:
git clone git@gitlab.com:learnedrdf/benchmark.git
otherwise
git clone https://gitlab.com/learnedrdf/benchmark.git
-
Download the dataset used and move it to the
benchmark
folder -
Or you can ask administrators to get access to full English-only Wikidata dataset.
-
Extract it
bzip2 -d wikidata-filtered.nt.bz2
-
Or you can construct the dataset from the latest truthy wikidata dump
-
Pull the ubuntu image
docker pull ubuntu
-
Download the files apache-jena-3.17.0.tar.gz from Apache Jena downloads page and move it into
jena
folder -
Change directory into
jena
folder -
Extract it
tar -xf apache-jena-3.17.0.tar.gz
-
Open a containerized environment
docker run -v ${PWD}/../:/srv -i -t ubuntu /bin/bash
-
Change directory to
srv/jena
-
Upgrade packages
apt update
-
Install JDK 11
apt install openjdk-11-jdk -y
-
Install the Java runtime environment
apt install default-jre -y
-
Create the database for Jena and Jenaclone
apache-jena-3.17.0/bin/tdbloader2 --loc=db/jena ../wikidata-filtered.nt
-
Edit the file
apache-jena-3.9.0/bin/tdbloader2index
and go to the line followinggenerate_index "$K3 $K1 $K2" "$DATA_TRIPLES" OSP
-
Add the following lines
generate_index "$K1 $K3 $K2" "$DATA_TRIPLES" SOP generate_index "$K2 $K1 $K3" "$DATA_TRIPLES" PSO generate_index "$K3 $K2 $K1" "$DATA_TRIPLES" OPS
-
Create the database for the leapfrog impementation
apache-jena-3.17.0/bin/tdbloader2 --loc=db/leapfrog ../wikidata-filtered.nt
- Build the Jenaclone project using the
mvn install
Maven lifecycle, as specified in the documentation of the Jenaclone repository - From the root directory of Jenaclone, copy the file
jena-fuseki2/jena-fuseki-server/target/jena-fuseki-server-3.17.0
intobenchmark/jena/jars
of the benchmark root directory - Make sure the Jenaclone .jar file is named
fuseki-jenaclone.jar
- Change directory into the root directory of this repository
docker build -t jenaclone_benchmark .
- Run the container
docker run \ --rm \ -v ${PWD}/benchmark/:/benchmark \ jenaclone_benchmark
Now the results are available in the folders queries/*/output
You can find our results in the results folder. For each query pattern you will find a folder containing two files, one for the Original Apache Jena and for for our modification of Apache Jena. Each line of a file contains three values separated by a semicolon: queryNumber;numberOfResutls;executionTimeInNanoseconds
After running the benchmark and the results can be found in benchmark/benchmark/queries/*/output
, move them to the corresponding folder in results
.
Convert the benchmark results to .csv files
python3 to_csv.py
This script requiret matplotlib, which can be install with
pip3 install matplotlib
Now, to run the analysis tool, run the command
python3 analysis.py bgps optionals existence_check
You can specify youself the folder containing results you want an analysis of. For example the following command will only provide an analysis of the optional query benchmark results
python3 analysis.py optionals
To print the query benchmark running times along their query with intermediate result set sizes, run the command
python3 result_size_analysis.py <RDF_DATASET> <RESULT_FOLDER1_> {RESULT_FOLDER_N}
<RDF_DATASET>
is the dataset file to compute intermediate result set sizes. Make sure this is the same dataset file used to load the Jena triplestore used in the benchmark results. Following arguments are benchmark result folders, just like when running the analysis.py
tool. At least one folder is required.
Even though the command seems to load the data into TDB2 instead of TDB1, it is not true. Therefore, make sure not to add the --tdb2
flag when running the server, since this would make the server return empty result sets because there is no TDB2 data.
According to the Apache Jena documentation, tdbloader translates into tdb1.xloader, where the prefix determines which version of TDB to use.