Welcome to the DSTC 8 competition! In this task you will develop a system that can predict or generate a user response to a dialogue from any domain.
Test Set | Metric | Baseline 1 | Baseline 2 | Team A | Team B | Team C | Team D |
---|---|---|---|---|---|---|---|
MetaLWOz (heldout) - pure task | BLEU 1 | 0.0956738 | 0.071289217 | 0.247069566 | 0.127296662 | 0.139526286 | 0.092481424 |
MetaLWOz (heldout) - pure task | BLEU 4 | 0.0235307 | 0.015362318 | 0.110891162 | 0.034489617 | 0.036630688 | 0.016711193 |
MetaLWOz (heldout) - cross task | BLEU 1 | 0.0593872 | 0.050962451 | 0.172918246 | 0.103930693 | 0.122790246 | 0.093176351 |
MetaLWOz (heldout) - cross task | BLEU 4 | 0.00412379 | 0.003077277 | 0.035006828 | 0.016665385 | 0.022570272 | 0.017658968 |
MultiWOz (single domain per dialogue) | BLEU 1 | 0.178067 | 0.143607441 | 0.392799366 | 0.114798727 | 0.100198825 | 0.2477592 |
MultiWOz (single domain per dialogue) | BLEU 4 | 0.0257248 | 0.014038822 | 0.157241504 | 0.021712508 | 0.019080919 | 0.068340561 |
MultiWOz (single domain per dialogue) | Intent F1 | 0.515258 | 0.466089053 | 0.78690419 | 0.644938428 | 0.613976032 | 0.549802744 |
MultiWOz (single domain per dialogue) | Intent + Slots F1 | 0.265817 | 0.195955486 | 0.599330053 | 0.483334653 | 0.418702922 | 0.423359061 |
For the evaluation dataset, please check the data sources page.
Evaluation for this task is using automatic as well as human metrics.
During development, participants can track their progress using word overlap metrics, e.g. using nlg-eval. Depending on the parameters of scripts/make_test_set, you can determine within-task or across-task generalization within a MetaLWoz domain.
Towards the end of the evaluation phase, we will provide a zip file with dialogues in a novel domain and a file specifying dialogues and turns that participants should predict. The file format is the same as the one produced by scripts/make_test_set, each line is a valid JSON object with the following schema: { "support_dlgs": ["SUPPORT_DLG_ID_1", "SUPPORT_DLG_ID_2", ...], "target_dlg": "TARGET_DLG_ID", "predict_turn": "ZERO-BASED-TURN-INDEX" }
Dialogue IDs uniquely identify a dialogue in the provided MetaLWoz zip file.
To generate predictions, condition your (pre-trained) model on the support dialogues, and use the target dialogue history as context to predict the indicated user turn.
Make sure that (1) your model has never seen the test domain before predicting and (2) reset your model before adapting it to the support set and predicting each dialogue.
On the responses submitted by the participants, we will
Submissions should have one response per line, in JSON format, with this schema: { "dlg_id": "DIALOGUE ID FROM ZIP FILE", "predict_turn": "ZERO-BASED PREDICT TURN INDEX", "response": "PREDICTED RESPONSE" }
where dlg_id and predict_turn correspond to the target_dlg id and predict_turn of the test specification file above, respectively.
Additionally we ask that submissions be clearly marked or annotated according to which test spec from the evaluation dataset they correspond to. This could be subdirectories for each of the test specs, or corresponding prefixes or filenames in the final zip archive.
A sample submission is available, based on generating responses like our retrieval baseline published on GitHub.
./scripts/retrieval-baseline predict your-model eval_data/dstc8-metalwoz-heldout.zip \
--test-spec eval_data/test-spec-metalwoz-held-out-pure-task.jsonl \
--nlg-eval-out-dir submission/predictions-metalwoz-heldout-pure
./scripts/retrieval-baseline predict your-model eval_data/dstc8-metalwoz-heldout.zip \
--test-spec eval_data/test-spec-metalwoz-held-out-cross-task.jsonl \
--nlg-eval-out-dir submission/predictions-metalwoz-heldout-cross
./scripts/retrieval-baseline predict your-model eval_data/dstc8-multiwoz2.0.zip \
--test-spec eval_data/test-spec-multiwoz2.0.jsonl \
--nlg-eval-out-dir submission/predictions-multiwoz2.0
When submitting your results, please fill in the fields below in the submission form. Please provide an email address the organizers can contact you at, if it differs from the one registered with Codalab.
We will take your last submission to the platform as your final submission, by default. We will evaluate more submissions if our time and budget allow.
By registering for this competition the participant agrees to the following terms.
Participants will not publish their results, code, or models prior to the DSTC 8 workshop.
The organizers reserve the right to
Final submissions will be accepted up to October 6, 2019 at 11:59pm Eastern Standard Time.
Results must be submitted from one Codalab account per team.
Each team may only have a single Codalab account.
The submission must be annotated with an affiliation to be considered for evaluation.
All results must be submitted through the Codalab platform.
Participants agree to not share code privately or outside of the Codalab platform until the DSTC 8 workshop.
Participants agree to use the validation data provided in the evaluation phase for model validation and evaluation only.
Should participants use external data for model training or evaluation they agree to either
and describe the external data in their submission.
Data for the competition is provided via external links and is subject to the licenses included therein.
Participants will not abuse the Codalab infrastructure to gain a competitive advantage in the competition.
Participants will conduct themselves in a respectful manner on the Codalab website or face disqualification.
Baseline code and other task details can be found here.
In goal-oriented dialogue, data is scarce. This is a problem for dialogue system designers, who cannot rely on large pre-trained models. The aim of our challenge is to develop natural language generation (NLG) models which can be quickly adapted to a new domain given a few goal-oriented dialogues from that domain.
The suggested approach roughly follows the idea of meta-learning (e.g. MAML: Finn, Abbeel, Levine, 2017, Antoniou et al. 2018, Ravi & Larochelle 2017): During the training phase, train a model that can be adapted quickly to a new domain:
During the evaluation phase, the model should predict the final user turn of an incomplete dialogue, given some (hundreds) of examples from the same domain:
You can contact all the contest organizers at dstc8-task2@microsoft.com
The organizers, affiliated with MSR Montréal, are:
Start: June 17, 2019, midnight
Description: Test the format of your submissions and troubleshoot errors here. Note the leaderboard does nothing
Start: June 17, 2019, midnight
Description: Final model predictions submitted to the competition.
Oct. 14, 2019, 6:59 a.m.
You must be logged in to participate in competitions.
Sign In# | Username | Score |
---|---|---|
1 | adatkins | 0.0 |