SemEval-2018 Task 12 - The Argument Reasoning Comprehension Task Forum

Go back to competition Back to thread list Post in this thread

> [Anouncement] Test phase and organizational info

Dear participants, we got some new information and recommendations from SemEval organizers which we would like to share further and comment on to make everything as transparent as possible.

"Evaluation period begins Monday. Usually, evaluation periods for individual tasks are 7 to 14 days, but there is no hard and fast rule about this. Keep your participants updated on the exact time frame. Upload the test data at the appropriate time. Some tasks may involve more than one sub-task, each having a separate evaluation time frame, or the sub-tasks may run in parallel."

* Test data are available
* The test phase ends in three weeks, on Monday 29. January 23.59 UTC

"Keep the participants updated on how the leaderboard works and what settings you have for it. For example, if you have set things up a certain way, you may want to say that the official leaderboard will eventually show the results for the last submission. For this last submission make sure to include all the files for all the subtasks you want to participate in."

* Leaderboard and achieved score are not visible during the test phase, therefore you should upload only one solution. If you upload more than one solution (we set the maximum to 3 just in case), only the last submission will be considered official.

"After the evaluation period, you are to write a task-description paper and the participants get to write their system-description papers."

* As usual in SemEval

"Participants review each others submissions."

* As usual in SemEval

"Shortly after the end of the evaluation period, ask all participants to provide details of their team and submission. The information can be used for a number of purposes including: writing a summary of participating systems in the task paper, determine the resources used by each submission, determining details about each team, etc."

* Depending on the number of participants, we'll prepare either an onlin form or send around an e-mail inquiry. The anticipated deadline for this is Friday February 2nd (four days after the test phase ends).

"Ideally, wait a few days before making the leaderboard public. This is to allow people to fill in the form mentioned above, and also to make sure there are no issues with the submissions."

* Official results will be shown presumably on Monday February 5th.

"Wait at least a few days after the leaderboard is made public before releasing the gold data. Inevitably, there will be some team contacting you saying there was some issue which caused their result to be really low."

* Let's hope it won't be the case :) We'll release the gold data a week after the results are made public.

Should you have any questions, feel free to comment!

Posted by: ivan.habernal @ Jan. 8, 2018, 2:04 p.m.

Another point regarding the failed submission before the evaluation was fixed yesterday (participants: lanman, Joker, hongking9): I deleted all failed submissions so you are allow to re-submit your solution

Posted by: ivan.habernal @ Jan. 9, 2018, 3:01 p.m.

Hi Ivan,

we are currently undecided which approach we will use for our predictions. Since we are limited to one submission, it will probably be the approach that yields the best results on the dev set.

Would it be possible to submit additional (unranked?) results of other architectures / models / parameters and to receive their respective scores as well? (Preferably before the release of the gold data in order to not be able to tune these additional evaluations towards the test data). Depending on how different our architectures and the results are going to be, we might want to describe more than one approach in our system paper. Would that be ok or should we limit the system paper to the official results?

Best regards
Matthias

Posted by: Liebeck @ Jan. 15, 2018, 12:04 p.m.

Hi Matthias,

The motivation for "one shot" only evaluation is clear - it exactly simulates the real scenario where the future is unknown, yet one has to use the possibly "best" system (the selection problem is also far from being trivial). The split to train (years 2011-2015), dev (2016) and test (2017) should also reflect that.

However, I see no problem in describing the other options you've tried on the gold data once released; just remember these results are not the official ones as they are already "tweaks on the test set".

Hope it helps,

Ivan

Posted by: ivan.habernal @ Jan. 15, 2018, 12:47 p.m.

Hi Ivan,

okay, thanks. If we decide to describe additional approaches in our system paper, we will benchmark them after the release of the gold data.

Best regards
Matthias

Posted by: Liebeck @ Jan. 15, 2018, 12:52 p.m.
Post in this thread