SemEval 2019 Task 9 - SubTask A - Suggestion Mining from Online Reviews and Forums

Organized by TDaudert - Current server time: Jan. 19, 2021, 10:25 p.m. UTC

First phase

Scoring the Trial Data
Aug. 18, 2018, midnight UTC


Competition Ends
Jan. 31, 2019, midnight UTC


Welcome to the pilot challenge on suggestion mining!

Suggestion mining can be defined as the extraction of suggestions from unstructured text, where the term 'suggestions' refers to the expressions of tips, advice, recommendations etc. Consumer opinions towards commercial entities like brands, services, and products are generally expressed through online reviews, blogs, discussion forums, or social media platforms. These opinions largely express positive and negative sentiments towards a given entity, but also tend to contain suggestions for improvising the entity or tips to the fellow consumers. Traditional opinion mining systems mainly focus on automatically calculating the sentiment distribution towards an entity of interest by means of Sentiment Analysis methods. A suggestion mining component can extend the capabilities of traditional opinion mining systems, which can then cater to additonal applications. Such systems can empower both public and private sectors by extracting the suggestions which are spontaneously expressed on various online platforms, enabeling the organisations to collect suggestions from much larger and varied sources of opinions than the tradional suggestion box or online feedback forms. 

Suggestion mining remains a relatively young area as compared to Sentiment Analysis, especially in the context of recent advancements in neural network based approaches for learning feature representations. Suggestion mining research could drive the engagement of both commercial entities, as well as the research communities working on problems like opinion mining, supervised learning, representation learning, etc. From a linguistic viewpoint, topics of interest to be explored within this task include extra propositional aspects like mood and modality, as well as determing the importance of different kinds of syntactic and semantic features. It is observed that in some cases the grammatical properties of a sentence can alone decide its label, while at times semantics can play a significant role. In this pilot SemEval task, we introduce suggestion mining as a simple task of classifying given sentences into suggestion and non-suggestion classes. With this task, we will evaluate the submitted systems for two domains, software developers suggestion forum, and hotel reviews. We will evaluate the cross domain performance of statistical models, since suggestions tend to possess similar linguistic properties across domains. It can also prove to be an evaluation for transfer learning methods.

Suggestions mining from online reviews and forums

Depending on the domain (source and topic of data), a number of applications can be associated with the automatic extraction of suggestions. Speaking of online hotel reviews, one such application could be room tips extraction. Platforms like Tripadvisor often ask the reviewers of a hotel to write additional 'room tips' for the future travellers, which on the other hand tend to appear within the reviews themselves, eg. 'Be sure to specify a room at the back of the hotel'. Other obvious application would be to collect suggestions for improvement in the hotel, eg. 'An electric kettle would have been a good addition to the room.' Given that a large number of reviews and opinions about a given hotel can be collected from multiple sources, a suggestion mining system will ensure a broad range of suggestions.

Room tips on Tripadvisor

Considering the domain of discussion forums, all the posts within a single thread are centered around answering a question or replying to a topic defined in the first post of the thread. A number of times these questions/topics are advice seeking, for example, "Advice for our week in Vienna" thread on a travel discussion forum. The answers on discussion forums are conversational and may contain additional contextual and information than what the first post sought. A suggestion mining system can extract the exact sentences where the advice is expressed, which renders a suggestion mining system as a suggestion summarisation system for discussion forums. Another type of discussions forums are dedicated suggestion forums pertaining to a commercial entity. For example, a suggestion forum to share the platform capability requests and general ideas for improving the Windows developer platform. Such forums operate by developers or customers posting messages explaining the improvisations they want to see in the product, which is different from online reviews where the main objective is to provide positive or negative ratings and rest of the information is additional. In the case of developer suggestion forums, the contextual text describing the functionality of the product gets repetitive over a large number of posts and a suggestion mining system would really be helpful to extract sentences containing concrete suggestions. A suggestion post is shown in the image below, where only first and last sentences express suggestions while the rest can be considered as the context.


 Some examples of suggestions found among the text from different opinion platforms are listed below. 

Example Suggestion

 Electronics Reviews

  I would recommend doing the upgrade to be sure you
  have the best chance at trouble free operation.

 Electronics Reviews

  My one recommendation to creative is to get some
  marketing people to work on the names of these things.

 Hotel Reviews

  An electric kettle would have been a good addition to the room.

 Hotel Reviews

  Be sure to specify a room at the back of the hotel.

 Travel Discussion Forum

  If you do book your own airfare, be sure you don’t
  have problems if Insight has to cancel the tour or reschedule it




Some of the observed challenges in suggestion mining are:

  • Class imbalance: Suggestions appear sparsely in domains like hotel reviews (6% - 13%), which leads to higher data annotation costs as well as results in a skewed class distribution for model training.
  • Figurative expressions: The text from social media and other sources usually contain figurative use of language. For example, `Try asking for extra juice at breakfast - its 22 euros!!!!!' is more of a sarcasm than a suggestion. Therefore, a sentence in the form of suggestions may not always be a suggestion, and vice versa. 
  • Context dependency: At times, context plays a major role in determining whether a sentence is a suggestion or not. For example, `There is a parking garage on the corner of the Forbes showroom.' can be perceived as a suggestion (for parking space) when it appears in a restaurant review and the human annotator gets to read the full review, while the same sentence would not be labeled as a suggestion if it is present in the description of the locality of a Forbes showroom.
  • Long and complex sentences: Often, a suggestion is only expressed in one part of a long sentence, or appears as a very long sentence, like, `I think that there should be a nice feature where you can be able to slide the status bar down and view all the push notifications that you got but you didn't view, just like android and IOS, but the best part is that it fixes many problems like when people wanted a short cut to turn WiFi on and off and data on and off so that would be a nice feature to have 2'. This poses challenges to the training algorithms for learning effective features, as well as for certain pre-processing steps like part of speech tagging. 

Relevant Publications

  1. Open Domain Suggestion Mining: Problem Definition and Datasets . Sapna Negi, Maarten de Rijke, and Paul Buitelaar. arXiv preprint arXiv:1806.02179 (2018)
  2. Inducing Distant Supervision in Suggestion Mining through Part-of-Speech Embeddings Sapna Negi, and Paul Buitelaar. "Inducing Distant Supervision in Suggestion Mining through Part-of-Speech Embeddings." arXiv preprint arXiv:1709.07403 (2017).
  3. A Study of Suggestions in Opinionated Texts and their automatic Detection. Sapna Negi, Kartik Asooja, Shubham Mehrotra, Paul Buitelaar. *SEM 2016, Co-located with ACL 2016, Berlin, Germany.
  4. Sapna Negi, Paul Buitelaar Suggestion Mining from Opinionated Text In: Pozzi, F. A.; Fersini, E.; Messina, E.; Liu, B. (Eds.) The Handbook of Sentiment Analysis in Social Networks, Elsevier
  5. Towards the Extraction of Customer-to-Customer Suggestions from Reviews . Sapna Negi and Paul Buitelaar. EMNLP 2015, Lisbon, Portugal
  6. Caroline Brun and Caroline Hagege. Suggestion mining: Detecting suggestions for improvement in users’ comments. Research in Computing Science, 2013. 

Task Organisers
     Genesys Telecommunications Laboratory Inc, Galway, Ireland
     Insight Centre for Data Analytics, National University of Ireland Galway, Galway, Ireland
     Insight Centre for Data Analytics, National University of Ireland Galway, Galway, Ireland
     Computer Science Department, University of Dayton, Dayton, Ohio, USA 
For any public discussions related to the task, please post to the google group.


Suggestion mining task will comprise of two subtasks. Participating teams should participate in at-least one of the subtasks. Relevant scripts and datasets are available at:

Sub-task A 

Under this subtask, participants will perform domain specific suggestion mining, where the test dataset will belong to the same domain as the training and development datasets, i.e. suggestion forum for windows platform developers.

Sub-task B

Under this subtask, participants will perform cross domain suggestion mining, where train/development and test datasets will belong to separate domains. Train and development datasets will remain the same as subtask A, while the test dataset will belong to the domain of hotel reviews.
This means that a model trained on the suggestion forum dataset will be evaluated on the hotel review dataset.
A separate codalab page is set for sub-task B.


Evaluation Metrics 

Classification performance of the submissions will be evaluated on the basis of F-1 score for the positive class, i.e. the suggestion class. F1 score will range from 1 to 0. The class distribution in the provided test datasets will be balanced out prior to the release of the test set.

Predicted Label Actual Label
  Suggestion Non-Suggestion
Suggestion True Positive False Positive
Non-Suggestion False Negative True Negative


Given that Psugg, Rsugg, and F1sugg are the precision, recall and F1 score for the suggestion class:

Psugg = True Positives / (True Positives + False Positives)

Rsugg = True Positives / (True Positives + False Negatives)

F1sugg  = 2 * (Psugg * Rsugg) / (Psugg + Rsugg)


Rule based systems

The submissions will not be limited to the statistical classifiers. In the case of rule based systems, participants can choose to participate in the two subtasks with the same system. Participants can also submit different rule based systems for the two subtasks.


Additional resources

Both rule based systems and statistical systems are allowed to use additional language resources, with one exception. Participants are prohibited from using additional hand labeled training datasets for any of the domain, i.e. data where sentences are manually labeled as suggestion and non-suggestions.
Any other resources which are readily available on the web or are generated using automated systems. Eg. scraping text from a website which can be automatically identified as suggestion, automatically tagging additional data using a system trained on the provided training data.


The datasets and evaluation scripts are available at our Github page . The datasets will be incrementally available as per the SemEval-2019 timelines. Please refer to the Terms and Condition for the timelines.

Submitted systems

  • Teams are allowed to use the development set for training
  • Teams should not use manually labeled training dataset (suggestion labeled sentences) outside the one provided as a part of this competition.
  • Teams are allowed to use silver standard datasets, for example, sentences scraped from the web which are likely to be suggestions or non-suggestions but are not further manually labeled.
  • Only one final submission will be recorded per team. The codalab website will only show an updated submission if results are higher.


  • All data released for this task is done so under the Creative Commons License (licenses could also be found with the data).

  • Organizers of the competition might choose to publicize, analyze and change in any way any content submitted as a part of this task. Wherever appropriate, academic citation for the sending group would be added (e.g. in a paper summarizing the task).


The teams wishing to participate in SemEval 2019 should strictly adhere to the following deadlines.

Task Schedule for SemEval2019

  • 20 Aug 2018: Trial data and evaluation script available
  • 17 September 2018: Training, development data ready. Benchmark system results available.
  • 10 Jan 2019: Evaluation starts
  • 31 Jan 2019: Evaluation period ends
  • 05 Feb, 2019: Results posted
  • 28 Feb 2019: System and Task description paper submissions due by 23:59 GMT -12:00
  • 14 Mar 2019: Paper reviews due (for both systems and tasks)
  • 06 Apr 2019: Author notifications
  • 20 Apr 2019: Camera ready submissions due
  • Summer 2019 (TBD): SemEval 2019


Competitions should comply with any general rules of SEMEVAL.

 The organizers are free to penalized or disqualify for any violation of the above rules or for misuse, unethical behaviour or other behaviours they agree are not accepted in a scientific competition in general and in the specific one at hand.


Please contact the task organisers or post on the task mailing list if you have any further queries.


Annotation Overview

Oxford dictionary defines suggestion as, An idea or plan put forward for consideration. Some of the listed synonyms of suggestions are proposal, proposition, recommendation, advice, hint, tip, clueIn our annotation study, we observe that human perception of the term suggestion is subjective, and this effects the preparation of hand labeled datasets for suggestion mining.

The datasets provided under this task are backed by a study of suggestions appearing in different domains and formalisation of the definition of suggestions in the context of suggestion mining [1]. The datasets have been annotated in two phases, where phase-1 employs crowdsourced annotators, and phase-2 employs in-house expert annotators.

The final datasets comprise of only those sentences tagged as suggestions which explicitly express suggestions (explicit suggestions), and not just provide information which could be used to infer suggestions (implicit suggestions). For example, 'I loved the cup cakes from the bakery next door' is an implicit form of a suggestion which can be explicitly expressed as,  'Do try the cupcakes from the bakery next door'.

In this year's SemEval, we are evaluating suggestion mining systems for two domains, suggestion forums and hotel reviews. Datasets are available at:


Suggestion Forums - Subtask A

Suggestion forums are dedicated forums which are used to provide suggestions for improvement in an entity. The data is collected from feedback posts on Universal Windows Platform, available on
Often people tend to provide the context in suggestion posts, which gets repetitive in the case of large number of posts under the same topic (see the snapshot below). Suggestion mining can act as automatic summarisation in this use case, by identifying the sentences where a concrete suggestion is expressed. We obserce that the datasets derived from this domain contain a relatively larger number of positive class instances, as compared to the other domains. The sentences are automatically split using stanford's parser.
Under the subtask A, training and validation sets will be provided for this domain, and the submissions will be evaluated on a test dataset from the same domain.
Number of sentences in each dataset
  Trial data: Train set Trial data: Test set  Train Development  Test 
Suggestions  1428  296 TBD TBD    TBD
Non-suggestions  4356 296  TBD TBD   TBD




Hotel Reviews - Subtask B

Wachsmuth et al. [2] provide a large sentiment analysis dataset of hotel reviews from the TripAdvisor website. We take a subset of these reviews, the sentences were already split in the dataset. The hotel review dataset will be used as the test dataset for subtask B. As mentioned in Terms and Conditions, participants are free to use additional non-labeled datasets. The raw hotel review dataset provided by Wachsmuth et al. (2014) could be one such dataset, and is openly available.

Number of sentences in each dataset
  Trial data: Test set Test set 
Suggestions 404     TBD
Non-suggestions 404   







[1] Sapna Negi, Maarten de Rijke, and Paul Buitelaar. Open Domain Suggestion Mining: Problem Definition and Datasets. arXiv preprint arXiv:1806.02179 (2018).

[2] Henning Wachsmuth, Martin Trenkmann, Benno Stein, Gregor Engels, and Tsve- tomira Palakarska. "A review corpus for argumentation analysis. In Proceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing, volume 8404 of LNCS, pages 115–127, Kathmandu, Nepal, 2014. Springer.

Scoring the Trial Data

Start: Aug. 18, 2018, midnight

Scoring the Test Data

Start: Jan. 10, 2019, midnight

Competition Ends

Jan. 31, 2019, midnight

You must be logged in to participate in competitions.

Sign In