SemEval-2017 Task 9 (generation subtask): English News/Forum generation from AMR

Organized by jonmay - Current server time: April 3, 2025, 2:02 p.m. UTC

First phase

Development/Dry Run
Aug. 1, 2016, midnight UTC

End

Competition Ends
Jan. 21, 2017, midnight UTC

Welcome!

 

Overview:

Abstract Meaning Representation (AMR) is a compact, readable, whole-sentence semantic annotation. Annotation components include entity identification and typing, PropBank semantic roles, individual entities playing multiple roles, entity grounding via wikification, as well as treatments of modality, negation, etc.

Here is an example AMR for the sentence “The London emergency services said that altogether 11 people had been sent to hospital for treatment due to minor wounds.”

(s / say-01
      :ARG0 (s2 / service
            :mod (e / emergency)
            :location (c / city :wiki ‘‘London’’
                  :name (n / name :op1 ‘‘London’’)))
      :ARG1 (s3 / send-01
            :ARG1 (p / person :quant 11)
            :ARG2 (h / hospital)
            :mod (a / altogether)
            :purpose (t / treat-03
                  :ARG1 p
                  :ARG2 (w / wound-01
                        :ARG1 p
                        :mod (m / minor)))))

Note the inclusion of PropBank semantic frames (‘say-01’, ‘send-01’, ‘treat-03’, ‘wound-01’), grounding via wikification (‘London’), and multiple roles played by an entity (e.g. ‘11 people’ are the ARG1 of send-01, the ARG1 of treat-03, and the ARG1 of wound-01).

In 2016 SemEval held its first AMR parsing challenge and received strong submissions from 11 diverse teams. In 2017 we have extended the challenge to both parsing of biomedical data and generation. This subtask is concerned with the latter:

Subtask 2: AMR-to-English Generation

In this completely new subtask, participants will be provided with AMRs and will have to generate valid English sentences. Scoring will make use of human evaluation. The domain of this subtask will be general news and discussion forum, much like was done in 2016's parsing task.

For the AMR from above:

 

(s / say-01
      :ARG0 (s2 / service
            :mod (e / emergency)
            :location (c / city :wiki ‘‘London’’
                  :name (n / name :op1 ‘‘London’’)))
      :ARG1 (s3 / send-01
            :ARG1 (p / person :quant 11)
            :ARG2 (h / hospital)
            :mod (a / altogether)
            :purpose (t / treat-03
                  :ARG1 p
                  :ARG2 (w / wound-01
                        :ARG1 p
                        :mod (m / minor)))))

a correct answer would, of course, be "The London emergency services said that altogether 11 people had been sent to hospital for treatment due to minor wounds." However, another correct answer would be "London emergency services say that altogether eleven people were sent to the hospital for treating of their minor wounds." Sentences will be automatically scored by single-reference BLEU and possibly other automated metrics as well. However, they will also be scored by human preference judgments, using the methods (and interface) employed by WMT. Ultimately, the results judged best by human evaluators get the SemEval trophy.

Example general-domain data with AMRs can be found here

Existing AMR-related research: Kevin Knight has been keeping a list here. It is hard to keep up though, so please send email to jonmay@isi.edu if yours is missing and you want a citation)

 

How to Participate In The Evaluation

Participation is a two-phase process:

  1. Participate in the Development/Dry Run (optional but highly recommended)
  2. Participate in the Evaluation

Participation in each phase is more or less the same:

  1. Train a generation system and run it on the appropriate test set.
  2. Create an answer file for the test set containing the sentences. The answer file must follow the appropriate form:
    • The sentences should be in the same order as the AMRs in the test set.
    • There should be a single line of text for each AMR and the total number of lines in the file should be equal to the number of AMRs.
    • Automated scores are case insensitive but references are not tokenized.
    • Unlike the parsing subtask, there may not be empty lines in between generated texts.
    • Unlike the parsing subtask, lines prefixed with "#" or any other symbol will not be ignored.
  3. Create a submission package. This is a .zip file with a single file named `answer.txt.'
  4. Navigate to the `Participate' tab and the `Submit/View Results' subtab. Enter any information into the box and click `Submit' to upload your submission package.
  5. Refresh the page periodically until the status of your system is `Finished.' If something goes wrong you may wish to look at the various output logs to debug, including your scores.
    • During the Development/Dry Run Phase you may resubmit unlimited multiple times.
    • During the Evaluation Phase you may only submit twice, to discourage hillclimbing on the test data. Your last submission will be considered your official submission.

Evaluation Criteria

Please note that all evaluation criteria are subject to change at the whim of the task organizer

The primary trophy-determining metric for this subtask will be a human judgement obtained by the union of (possibly empty) sets of SemEval participants, other NLP researchers, other individuals known to the task organizer, and crowdsourced workers. The TrueSkill algorithm, as described in the WMT 2016 findings paper, will be used to produce a numerical metric.

Automated metrics, which may include but are not limited to BLEU, will be used in the online submission system. These metrics are not official.

We welcome the proposal of human and automated metrics for this task, since it is not at all clear that the above proposed methods are in fact the best way to evaluate systems. That being said, unless otherwise indicated by the task organizer, the trophy-determining metric is that listed above.

Terms and Conditions

By submitting to the 'Evaluation' phase of this track you agree to the public release of your submissions' scores at the SemEval 2017 workshop and in the associated publicly available proceedings, at the task organizer's discretion. Scores may include, but are not limited to, automatic and manual quantitative judgements, qualitative judgements, and other metrics as the task organizer sees fit. You accept that the ultimate decision of metric choice and score value is that of the task organizer. You further agree that your system will be named according to the team name provided at the time of submission or to a suitable shorthand, as determined by the task organizer. You agree that the task organizer is under no obligation to release scores and that scores may be withheld if it is the task organizer's judgement that the submission was incomplete, deceptive, or violated the letter or spirit of the competition's rules. Inclusion or exclusion of a submission's scores is not an endorsement or unendorsement of a team or individual's submission, system, or science. You further acknowledge that all trophy-making decisions are made at the sole discretion of the task organizer and that the organizer may present zero or more trophies. The definition of what constitutes a trophy is up to the task organizer.

Development/Dry Run

Start: Aug. 1, 2016, midnight

Description: Generate from News/Forum AMRs from LDC2016E25. See 'Evaluation' under 'Learn the Details' for information on how to submit.

Evaluation

Start: Jan. 9, 2017, midnight

Description: Generation from the SemEval 2017 Task 9 News/Forum AMR Evaluation corpus. This data will be released when the evaluation period begins. See 'Evaluation' under 'Learn the Details' for information on how to submit.

Competition Ends

Jan. 21, 2017, midnight

You must be logged in to participate in competitions.

Sign In