CIKM AnalytiCup 2017: Lazada Product Title Quality Challenge

Organized by hadylauw - Current server time: Dec. 15, 2018, 4:44 a.m. UTC

First phase

Phase 1: Validation Leaderboard
May 1, 2017, midnight UTC

End

Competition Ends
Aug. 7, 2017, 11:59 p.m. UTC

On Lazada, we have millions of products across thousands of categories. To stand out from the crowd, sellers employ creative, sometimes disruptive efforts to improve their search relevancy or attract the attention of customers. Product titles like "hot sexy red clutch rug sack travel backpack unisex cheap with free gift" degenerate user experience by cluttering the site with irrelevant, misleading titles.

In this Product Title Quality Challenge, we provide you with a set of product titles, description, and attributes, together with the associated title quality scores (clarity and conciseness) as labeled by Lazada's internal QC team. Your task is to build a product title quality model that can automatically grade the clarity and the conciseness of a product title.

This challenge is a part of the CIKM AnalytiCup 2017 co-arranged as part of the ACM CIKM 2017 conference. The reports of the semi-finalist teams will be publicly released online. We also invite the finalists (top 3 teams) and potentially some semi-finalists as well to present their approaches at the CIKM AnalytiCup Workshop on November 6 in Singapore.  

Let's work together to free Lazada from poor quality titles and return us to a state of clean and relevant product titles.

Prizes

The Prize Pool for the Lazada Product Title Quality Challenge is SGD15,000 and will be distributed among the participants as follows:

  • The 1st Place receives SGD6,000 (~USD4200) in Prize Money +  SGD2,000 (~USD1400) in Travel Support*
  • The 2nd Place receives SGD2,000 (~USD1400) in Prize Money +  SGD2,000 (~USD1400) in Travel Support*
  • The 3rd Place receives SGD1,000 (~USD700) in Prize Money +  SGD2,000 (~USD1400) in Travel Support*

*In addition to Prize Money, any top-3 team with a member to present their solution in person at the CIKM AnalytiCup Workshop held on November 6 in Singapore will receive a further SGD2,000 in travel support.

In order to be eligible for any award, winners are required to submit the code and solution report (4 pages in ACM format) to the organizers by the stipulated deadline.

Lazada reserves the right not to award some of or all the prizes, if the competition criteria are not met.

Timeline

  • 00:00 UTC, 1 May 2017: Phase 1 of the contest begins. Validation Leaderboard opens. Participants can make up to 24 submissions per day (up to 2000 submissions in total).
  • 00:00 UTC, 24 July 2017: Phase 2 of the contest begins. Testing data (without labels) is released. Testing Leaderboard opens. Participants can make up to 10 submissions in total.
  • 23:59 UTC, 31 July 7 August 2017: The contest ends.
  • 12:00 UTC, 7 August 14 August 2017: Semi-finalists (top 10 teams based on the Phase 2: Testing Leaderboard) are announced. The semi-finalists are requested to submit a 4-page report and their code.
  • 23:59 UTC, 28 August 4 September 2017: Report & code are due.
  • 12:00 UTC, 7 September 22 September 2017: Finalists (top 3 teams among the semi-finalists, and the methods are well-justified) are announced.
  • 6 November 2017: The exact placements of finalists (1st place, 2nd place, 3rd place) will be announced during the workshop at the conference.

Organizers

This competition is sponsored by Lazada Group, and is organized by:
John Berns | Kai Xin Thia | Michal Polanowski | Karthik Nallamothu | Eugene Yan,
in conjunction with CIKM AnalytiCup Co-chairs:
Hady W. Lauw | Xiaoli Li | Nik Spirin.

Questions?

  • Please first go through all the pages in this competition for complete information.
  • Information on how to make submissions can be found on Participate tab.
  • If you have further questions, please post them on Forums tab.

Wishing you all the best!

Data

The data contains product titles that are stratified sampled based on sales figures across multiple product categories in Singapore, Malaysia, Philippines.  Each line in the file refers to a specific product. For example:
my,NO037FAAA8CLZ2ANMY,RUDY Dress,Fashion,Women,Clothing,"<ul> <li>Short Sleeve</li> <li>3 Colours 8 Sizes</li> <li>Dress</li> </ul> ",33.0,local

The comma-separated values in each line correspond to the following features:

  • country : The country where the product is marketed, with three possible values: my for Malaysia, ph for Philippines, sg for Singapore

  • sku_id : Unique product id, e.g., "NO037FAAA8CLZ2ANMY"

  • title : Product title, e.g., "RUDY Dress"

  • category_lvl_1 : General category that the product belongs to, e.g., "Fashion"

  • category_lvl_2 : Intermediate category that the product belongs to, e.g., "Women"

  • category_lvl_3 : Specific category that the product belongs to, e.g., "Clothing"

  • short_description : Short description of the product, which may contain html formatting, e.g., "<ul> <li>Short Sleeve</li> <li>3 Colours 8 Sizes</li> <li>Dress</li> </ul> "

  • price : Price in the local currency, e.g., "33.0".  When country is my, the price is in Malaysian Ringgit.  When country is sg, the price is in Singapore Dollar.  When country is ph, the price is in Philippine Peso.

  • product_type : It could have three possible values: local means the product is delivered locally, international means the product is delivered from abroad, NA means not applicable.

To get access to the data, you need to log in and go to Participate tab.

Labels

Each product in the data is also associated with two quality scores labeled by Lazada's internal QC team based on the following content guidelines.

Clarity

  • 1 means within 5 seconds you can understand the title and what the product is. It easy to read. You can quickly figure out what its key attributes are (color, size, model, etc.).
  • 0 means you are not sure what the product is after 5 seconds. it is difficult to read. You will have to read the title more than once.

Conciseness

  • 1 means it is short enough to contain all the necessary information.
  • 0 means it is too long with many unnecessary words. Or it is too short and it is unsure what the product is.

Problem Definition

Your task is to build product title quality models that can automatically grade product titles based on the provided features, and output two scores for each product:

  • Clarity: a probability value between 0 and 1.  For example: 0.34139.
  • Conciseness: a probability value between 0 and 1.  For example: 0.98747.

Evaluation Metrics

Submissions will be evaluated using the Root-Mean-Square Error (RMSE), defined as follows:

In the above formula, N is the number of instances,  ypred is the predicted probability value for a given instance, and  yref is the ground-truth value for that instance.

We will first measure the RMSE separately for clarity and conciseness. The ranking among participants is determined by the Overall RMSE, which is the average of the RMSE for clarity and the RMSE for conciseness. The lower the overall score, the higher is the ranking.

Training / Validation / Testing Splits

  • The largest split is the Training Data, containing more than 36 thousand product titles. In addition to the features, you are also provided with the clarity and conciseness labels.
  • The second split is the Validation Data, containing more than 11 thousand product titles with the features, but no labels. This is used for validation (Phase 1), which runs from 1 May to 23 July. Until 23 July, participants can submit their solutions without violating the submission limit (24 submissions per day, 2000 submissions in total). The ranking will be continuously updated on the public leaderboard.
  • The third split is the Testing Data, containing more than 12 thousand product titles with the features, but no labels. This is used for the final evaluation (Phase 2) in the period from 24 July till 31 July 7 August. The participants are allowed to make up to 10 submissions in total.

We use the three-stage process to avoid possible "leaderboard boosting", when the ranking/scores from the validation stage could be used for overfitting the model to the test set. By having the third hold-out set, we minimize this possibility and guarantee fair evaluation.

Competition Rules

Breach of any competition rule specified below is ground for disqualification.  In the event of any dispute in connection with the Competition, or with the interpretation or implementation of these rules, the decision of the Organizers shall be final.

Eligibility

The competition is open to everyone but the employees of Lazada or anyone involved with the organization.

One account per participant

You cannot sign up to CodaLab from multiple accounts, and therefore you cannot submit from multiple accounts.

Team limits

The maximum size of a team is four (4) participants.

Team mergers

Team mergers are allowed all throughout Phase 1, and can be performed by the team leader (go to your account's User Settings and indicate team name and members). In order to merge, the combined team must have a total submission count less than or equal to the maximum allowed as of the merge date. The maximum allowed is the number of submissions per day multiplied by the number of days the competition has been running. The organizers do not provide any assistance regarding the team mergers.

No private sharing outside teams

Privately sharing code or data outside of teams is not permitted. It is permitted to share code if it is made available to all participants on the forums or as a public repo (e.g., Github).

Submissions

You may submit a maximum of 24 entries per day during the Phase 1 (Validation Leaderboard). For the Phase 2 (Testing Leaderboard), you can only submit 10 entries in total.

At the end of the Phase 2, the following teams are to submit their code as well as a report describing their solution (4 pages in ACM format) by the stipulated deadline.

  • Semi-finalist teams who are in the top 10 of the Phase 2 Testing Leaderboard based on the Overall score
  • The team with the highest Clarity score
  • The team with the highest Conciseness score

The submitted codes and reports may be inspected to check the validity of the solution.  The reports will eventually be made publicly available on the CIKM conference website.

Selected teams will also be invited to present their solutions in person at the CIKM AnalytiCup Workshop on November 6 in Singapore.  To allocate the limited presentation slots, preference will be given to award-winning teams, as well as teams deemed by the organizers to have interesting or remarkable solutions.

Winners

The ranking of entries based on the Overall score during Phase 2 will be used to determine the winners, subject to the validity of the solutions.  Finalists will be the top 3 among the semi-finalist teams.   A tie in the Overall score will be broken in favor of the earlier submission on the final leaderboard.

Phase 1: Validation Leaderboard

Start: May 1, 2017, midnight

Description: Ongoing model development and evaluation based on validation data. The results are shown on the Validation Leaderboard.

Phase 2: Testing Leaderboard

Start: July 24, 2017, midnight

Description: Final submission based on the testing data. The results are shown on the Testing Leaderboard.

Competition Ends

Aug. 7, 2017, 11:59 p.m.

You must be logged in to participate in competitions.

Sign In