SemEval 2019 Task 9 - SubTask B - Suggestion Mining from Online Reviews and Forums

Organized by TDaudert - Current server time: Sept. 24, 2018, 11:08 a.m. UTC


Scoring the Trial Data
Aug. 12, 2018, midnight UTC


Scoring the Test Data
Jan. 10, 2019, midnight UTC


Competition Ends
Jan. 31, 2019, midnight UTC

Welcome to the Subtask B for SemEval 2019 task 9 on suggestion mining. A detailed introduction is provided on the codalab page for Subtask A.


Task Organisers
     Genesys Telecommunications Laboratory Inc, Galway, Ireland
     Insight Centre for Data Analytics, National University of Ireland Galway, Galway, Ireland
     Insight Centre for Data Analytics, National University of Ireland Galway, Galway, Ireland
     Computer Science Department, University of Dayton, Dayton, Ohio, USA 
For any public discussions related to the task, please post to the google group.


Suggestion mining task will comprise of two subtasks. Participating teams should participate in at-least one of the subtasks.
All scripts and data can be downloaded here:

Sub-task A 

Under this subtask, participants will perform domain specific suggestion mining, where the test dataset will belong to the same domain as the training and development datasets, i.e. suggestion forum for windows platform developers. A separate codalab page is set for subtask A

Sub-task B

Under this subtask, participants will perform cross domain suggestion mining, where train/development and test datasets will belong to separate domains. Train and development datasets will remain the same as subtask A, while the test dataset will belong to the domain of hotel reviews.
This means that a model trained on the suggestion forum dataset will be evaluated on the hotel review dataset.


Evaluation Metrics 

Classification performance of the submissions will be evaluated on the basis of F-1 score for the positive class, i.e. the suggestion class. F1 score will range from 1 to 0. The class distribution in the provided test datasets will be balanced out prior to the release of the test set.

Predicted Label Actual Label
  Suggestion Non-Suggestion
Suggestion True Positive False Positive
Non-Suggestion False Negative True Negative


Given that Psugg, Rsugg, and F1sugg are the precision, recall and F1 score for the suggestion class:

Psugg = True Positives / (True Positives + False Positives)

Rsugg = True Positives / (True Positives + False Negatives)

F1sugg  = 2 * (Psugg * Rsugg) / (Psugg + Rsugg)


Rule based systems

The submissions will not be limited to the statistical classifiers. In the case of rule based systems, participants can choose to participate in the two subtasks with the same system. Participants can also submit different rule based systems for the two subtasks.


Additional resources

Both rule based systems and statistical systems are allowed to use additional language resources, with one exception. Participants are prohibited from using additional hand labeled training datasets for any of the domain, i.e. data where sentences are manually labeled as suggestion and non-suggestions.
Any other resources which are readily available on the web or are generated using automated systems. Eg. scraping text from a website which can be automatically identified as suggestion, automatically tagging additional data using a system trained on the provided training data.


The datasets and evaluation scripts are available at our Github page . The datasets will be incrementally available as per the SemEval-2019 timelines. Please refer to the Terms and Condition for the timelines.

Submitted systems

  • Teams are allowed to use the development set for training
  • Teams should not use manually labeled training dataset (suggestion labeled sentences) outside the one provided as a part of this competition.
  • Teams are allowed to use silver standard datasets, for example, sentences scraped from the web which are likely to be suggestions or non-suggestions but are not further manually labeled.
  • Only one final submission will be recorded per team. The codalab website will only show an updated submission if results are higher.


  • All data released for this task is done so under the Creative Commons License (licenses could also be found with the data).

  • Organizers of the competition might choose to publicize, analyze and change in any way any content submitted as a part of this task. Wherever appropriate, academic citation for the sending group would be added (e.g. in a paper summarizing the task).


The teams wishing to participate in SemEval 2019 should strictly adhere to the following deadlines.

Task Schedule for SemEval2019

  • 20 Aug 2018: Trial data and evaluation script available
  • 17 18 September 2018: Training, development data ready. Benchmark system results available.
  • 10 Jan 2019: Evaluation starts
  • 31 Jan 2019: Evaluation period ends
  • 05 Feb, 2019: Results posted
  • 28 Feb 2019: System and Task description paper submissions due by 23:59 GMT -12:00
  • 14 Mar 2019: Paper reviews due (for both systems and tasks)
  • 06 Apr 2019: Author notifications
  • 20 Apr 2019: Camera ready submissions due
  • Summer 2019 (TBD): SemEval 2019


Competitions should comply with any general rules of SEMEVAL.

 The organizers are free to penalized or disqualify for any violation of the above rules or for misuse, unethical behaviour or other behaviours they agree are not accepted in a scientific competition in general and in the specific one at hand.


Please contact the task organisers or post on the task mailing list if you have any further queries.


Annotation Overview

Oxford dictionary defines suggestion as, An idea or plan put forward for consideration. Some of the listed synonyms of suggestions are proposal, proposition, recommendation, advice, hint, tip, clueIn our annotation study, we observe that human perception of the term suggestion is subjective, and this effects the preparation of hand labeled datasets for suggestion mining.

The datasets provided under this task are backed by a study of suggestions appearing in different domains and formalisation of the definition of suggestions in the context of suggestion mining [1]. The datasets have been annotated in two phases, where phase-1 employs crowdsourced annotators, and phase-2 employs in-house expert annotators.

The final datasets comprise of only those sentences tagged as suggestions which explicitly express suggestions (explicit suggestions), and not just provide information which could be used to infer suggestions (implicit suggestions). For example, 'I loved the cup cakes from the bakery next door' is an implicit form of a suggestion which can be explicitly expressed as,  'Do try the cupcakes from the bakery next door'.

In this year's SemEval, we are evaluating suggestion mining systems for two domains:


Suggestion Forums - Subtask A

Suggestion forums are dedicated forums which are used to provide suggestions for improvement in an entity. The data is collected from feedback posts on Universal Windows Platform, available on
Often people tend to provide the context in suggestion posts, which gets repetitive in the case of large number of posts under the same topic (see the snapshot below). Suggestion mining can act as automatic summarisation in this use case, by identifying the sentences where a concrete suggestion is expressed. We obserce that the datasets derived from this domain contain a relatively larger number of positive class instances, as compared to the other domains. The sentences are automatically split using stanford's parser.
Under the subtask A, training and validation sets will be provided for this domain, and the submissions will be evaluated on a test dataset from the same domain.
Number of sentences in each dataset
  Trial data: Train set Trial data: Test set  Train Development  Test 
Suggestions 585  296 TBD TBD    TBD
Non-suggestions 1915 296  TBD TBD   TBD




Hotel Reviews - Subtask B

Wachsmuth et al. (2014) [2] provide a large sentiment analysis dataset of hotel reviews from the TripAdvisor website. We take a subset of these reviews, the sentences were already split in the dataset. The hotel review dataset will be used as the test dataset for subtask B. As mentioned in Terms and Conditions, participants are free to use additional non-labeled datasets. The raw hotel review dataset provided by Wachsmuth et al. (2014) could be one such dataset, and is openly available.

Number of sentences in each dataset
  Trial data: Test set Test set 
Suggestions 404     TBD
Non-suggestions 404   







[1] Sapna Negi, Maarten de Rijke, and Paul Buitelaar. Open Domain Suggestion Mining: Problem Definition and Datasets. arXiv preprint arXiv:1806.02179 (2018).

[2] Henning Wachsmuth, Martin Trenkmann, Benno Stein, Gregor Engels, and Tsve- tomira Palakarska. "A review corpus for argumentation analysis. In Proceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing, volume 8404 of LNCS, pages 115–127, Kathmandu, Nepal, 2014. Springer.

Scoring the Trial Data

Start: Aug. 12, 2018, midnight

Scoring the Test Data

Start: Jan. 10, 2019, midnight

Competition Ends

Jan. 31, 2019, midnight

You must be logged in to participate in competitions.

Sign In
# Username Score
1 The-Baseline 0.7740
2 CVxTz 0.7327
3 Mulx10 0.6667