If you’re already familiar with the platform, you can probably skip this high-level overview:
Dynabench is organized around tasks (e.g. NLI or QA). Each task can have one or more owners, who have control over how the task is set up. For example, task owners can specify which metrics and datasets they want to use in the evaluation-as-a-service framework for evaluating models. Anyone can upload models to the evaluation cloud, where scores for metrics are computed on the selected datasets. Once those models have been uploaded and evaluated, they can be put in the loop for dynamic data collection and human-in-the-loop evaluation. Task owners can collect data via the Web interface on dynabench.org, or with crowd workers on e.g. Mechanical Turk by using the Mephisto crowdsourcing tool suite.
One crucial aspect of Dynabench is that it can be used to evaluate current models with humans in the loop. Rather than having a static test set, we can have actual people try to find weaknesses in existing models, either by actively trying to fool them, or by identifying weaknesses in natural interactions. Alignment of AI models with real humans is crucial, especially as models are deployed in more and more critical places in the world. The data that comes out of this process can be used to train stronger models, or at the very least can help identify ways to construct better models.
Suppose there was no Natural Language Inference task yet, and we wanted to start one. We would log onto our Dynabench account, and on our profile page fill out the "Request new task" form. Once approved, we would have a dedicated task page that we control, as the task owner, via the corresponding task owner page. The first thing we would do is choose existing datasets that we want to evaluate models on when they are uploaded, along with the metrics we want to use for evaluation. Next, we would submit baseline models, or ask the community to submit them. If we then wanted to collect a new round of dynamic adversarial data, where workers are asked to create examples that the models get wrong, we can upload new contexts to the system and start collecting data. Once we have enough data and if we find that training on the data helps improve the system, we can upload better models and put those in the data collection loop again so as to build even stronger models. The same basic process was used for constructing data sets like Adversarial Natural Language Inference and Beat the AI. Our tools make it easy for anyone to construct datasets with humans and models in the loop.
This manual covers the following:
- Owners
- Datasets
- Rounds
- Contexts
- Model evaluation
- Collecting data
- Task configs
- Frequently Asked Questions
This manual describes how Task Owners can configure their tasks through the task owner interface. To go to the task owner interface for your task, click on the "gear" icon in the top right-hand corner of your task page. Here's a useful checklist to go through the whole process of creating and setting up a dynatask:
- Submit your task proposal
- Configure your task config object
- Add other task owners, if applicable
- Add contexts.
- Add your relevant (leaderboard/non-leaderboard) datasets
- Submit at least one model for evaluation
- Put one or more models in the loop for data collection, if desired.
You can now collect new data, or if you like just use Dynabench as a leaderboard platform and start collecting data against your state-of-the-art models at a later stage.
Every task can have multiple owners. The owners and their affiliations are displayed on the task page. Every task needs to have at least one registered owner on the Dynabench platformn. Owners can be added/removed under the Owners tab in the task owner interface.
Every task has two main types of datasets: leaderboard and non-leaderboard datasets. The former show up on the leaderboard and are used for computing any aggregation metrics, such as the dynascore. The latter are only shown on the individual model page. Concrete examples would be: test sets would be leaderboard datasets, while dev sets, checklists and stress test sets would in most cases be non-leaderboard datasets.
Under the datasets tab in the task owners interface, you will be able to add, modify and remove new datasets for evaluation. New datasets are uploaded in a jsonl
format, with fields corresponding to the fields defined in your task config, as well as a uid
field designating a unique identifier for the given example. For example, the first two lines for a classification task could read:
{"uid": "1", "text": "Hello world", "label": 1}
{"uid": "2", "text": "Foo bar", "label": 0}
If your task supports fairness and robustness evaluations, for every dataset you upload, you can also upload perturbed versions that will be used for the fairness and robustness metrics. In order to generate perturbations, use the tools in the evaluation/scripts
directory.
Tasks can collect data over multiple rounds. A round of data collection usually (but not necessarily) has the same model(s) in the loop, and once finished usually becomes a leaderboard evaluation dataset. Task owners can export round data. Under the rounds tab in the task owners interface, you will be able to configure existing rounds, as well as to start new ones. Here, for example, you would determine what models to use in the loop, what contexts to collect data for, and with what settings.
Most tasks on Dynabench involve contextualized data collection: a question is written for a given passage or image, a hypothesis is written for a given premise, et cetera. For every round in your task, you can add contexts under the rounds tab by clicking the "Upload Contexts" section. The format for a context corresponds to the fields defined in your task config. Optionally, you can specify tag
and filter for these tags if you customize your data collection frontend (e.g. when you want to collect Mechanical Turker data for a subset of your full dataset). Note that the tag
key must be in your uploaded contexts, but you can set it to null
. The format for the contexts is jsonl
. For example, the first two lines for a classification task could read:
{"context": {"statement": "Lorem ipsum dolor et cetera"}, "tag": null, "metadata": {}}
{"context": {"statement": "Some other context"}, "tag": null, "metadata": {}}
Most tasks on Dynabench involve models in the loop. Models are uploaded for evaluation via Dynalab and follow the input/output fields defined in your task config. They are evaluated on all datasets. You can set the current models in the loop for every round under the rounds tab. For all models, users are encouraged to submit detailed model cards and provide as many details as they can. Only "published" models are shown on the task leaderboard. Task owners are encouraged to submit their own baselines shortly after task creation.
Data collection consists of two components: creation and validation. In the former, new examples are created by Dynabench visitors or crowdworkers. In the latter, existing examples are validated to ensure correctness. In both cases, interfaces exist both for regular web-based users as well as for Mechanical Turk workers, via Mephisto. Task owners can configure how exactly examples should be validated (e.g., how many validations are needed to consider an example fully validated, etc).
Data collection happens for the currently active round. Throughout the data collection process, data can be exported via the task owner page and inspected at will. Scripts are provided for inspecting data and paying crowd workers bonuses.
The input/output format for a task, both in terms of what the model expects and how the annotator frontend behaves, are configured in the task config under the Advanced tab in the task owners interface. Here are examples of annotator configs for NLI, QA, Sentiment Analysis, Hate Speech Detection, and Machine Translation:
We recommend that you look through the examples before reading further. The task config can have the following fields:
-
goal_message
(optional; default is no message): a string that is displayed at the top of the create interface as a goal for annotators. See an example -
content_warning
(optional; default is no message): a string that is displayed at the top of the validate interface as a warning to validators. See an example -
model_wrong_metric
(optional; default isask_user
): This defines whether a person fooled a model. There are three options to choose from:exact_match
,string_f1
, andask_user
:-
exact_match
requiresreference_names
as an argument.reference_names
defines which outputs the metric should use. -
string_f1
requiresreference_name
as an argument.reference_name
defines which output the metric should use. Thethreshold
is a number between 0 and 1 and is also a required argument. -
ask_user
does not require any arguments. It tells the create interface to ask the annotator whether the model was fooled or not.
-
-
aggregation_metric
(optional; default isdynascore
): This is a string that defines how the leaderboard ranks models. There is only one aggregation metric to choose:dynascore
. -
perf_metric
: This is the metric that is used to evaluate the performance of an uploaded model. There are six to choose from:macro_f1
,squad_f1
,accuracy
,bleu
,sp_bleu
,vqa_accuracy
. They all requirereference_name
as an argument, which specifies which model output to use to compute the score. The score is computed by comparing the model output keyed byreference_name
with the gold output keyed byreference_name
in an uploaded dataset. If you are usingsquad_f1
,accuracy
, orvqa
accuracy, you have the option of uploading datasets with a list of candidates keyed byreference_name
instead of just one. For example:Model output for an example: {"answer": "the sky is blue"} Example in the dataset: {"question": "What color is the sky?", "answer": ["the sky is blue", "the sky is light blue", "the sky is cerulean"]}
-
delta_metrics
(optional; default is no delta metrics): A delta metric returns the value that the perf metric would return, if it was run on a perturbed version of the dataset. There are two to choose from:fairness
androbustness
; neither require any arguments. If you select either of these delta_metrics, then any datasets that you upload for your task must be accompanied by the corresponding perturbed versions. We have released the tools that we use for perturbing Hate Speech, Question Answering, Natural Language Inference, and Sentiment Analysis datasets in theevaluation/scripts
directory. -
context
,input
,output
(optional; default is no context, input, output): These are lists of objects that define the I/O for a task. There must be at least one object in each list for your task to be valid. This means that all dynabench tasks must have something for thecontext
, even if it is a placeholder.context
is the example information that task owner provides.input
is the example information that the annotators enter.output
is what the model outputs, giveninput
andcontext
. When objects ininput
andoutput
have the same name, a model receives the input without these objects. For example: if an annotator is asked to provide a target label in their input and a model is asked to return a label to compare to this target label, the target label is filtered out of the annotator’s input before it is sent to the model. -
metadata
(optional, default is no metadata conditions): A dictionary containing two lists which are keyed bycreate
andvalidate
. Just likecontext
,input
, andoutput
, these are also lists of objects that define the I/O for a task, except they are optional. The objects defined increate
appear after an annotator enters an example into the create interface, and so are dependent on a model’s response. For the objects increate
, you can specifymodel_wrong_condition
astrue
orfalse
to define whether these objects appear when the annotator’s example is model fooling or not. You can also not specifymodel_wrong_condition
if you want any of the objects to appear regardless of the model's response. The objects defined invalidate
appear in the validate interface after a validator has selectedflagged
,correct
, orincorrect
. For the objects invalidate
, you can specifyvalidated_label_condition
asincorrect
,correct
orflagged
to define the conditions where you show these objects to validators. You can also not specifyvalidated_label_condition
if you want an object to always appear in the validate interface. -
Objects that compose the I/O for a task: In 7 and 8, we explained where to enter objects that compose the I/O for a task. We call them “annotation objects” Here are the options that you can select from:
-
string
: Data for this type is stored as a string.placeholder
is a required argument. -
image
: Data for this type is stored in our database as a url to the image. The task owner should upload images in contexts and datasets as urls. But the object that is sent as an input to models is a base64 string representation of the image. For model testing purposes, you can get this base64 string from this python code:base64.encodebytes(requests.get(url, verify=False).content).decode("utf-8")
. -
context_string_selection
: Data for this type is stored as a string; it must be a substring of another annotation object incontext
.reference_name
designates the name of a string-type object in the context to select from, and it is a required argument. -
multiclass
: Data for this type is stored as a string; it must be from a set of owner-defined labels.labels
is a list of strings that defines the set of available labels, and it is a required argument. You can specify an argument for this type:as_goal_message
. When true, it is displayed as a goal message with an embedded dropdown component (see an example), and when false, is displayed as a basic multiple choice dropdown component. -
prob
: Data for this type is stored as a dictionary, where a key is a label name and a value is a float. Values should sum to 1.reference_name
designates the name of amulticlass
object to get the labels from, and it is a required argument. You can specify an argument for this type:single_prob
. Whensingle_prob
is true, data for this type is stored as a float between 0 and 1. It typically represents the confidence of a model about its answer. This annotation object cannot be used in theinput
.
-
Yes! Submit a PR, and reach out to us if you need help. We've tried to make adding a new annotation component as easy as possible. You need to provide a frontend definition for your custom component here and a syntax-checker for it here (to ensure that the backend rejects improperly formatted data)
Yes! Submit a PR, and reach out to us if you need help. You'll need to add your metric here and make a syntax checker for it's task config YAML specs here
Yes! Submit a PR, and reach out to us if you need help. You'll need to add your metric here along with a syntax checker for it's task config YAML specs.
Yes! But this might take a while. We have not yet modularized aggregation metrics.
If your task has been accepted but does not show up, did you make it public by unchecking the "hidden" field? To give task owners some time to configure their task properly, tasks are initialized as hidden by default.
Absolutely! Dynabench is open source and a community effort: you are welcome to submit PRs. We would be happy to help if you have any questions. Please open an issue!