On Lazada, we have millions of products across thousands of categories. To stand out from the crowd, sellers employ creative, sometimes disruptive efforts to improve their search relevancy or attract the attention of customers. Product titles like "hot sexy red clutch rug sack travel backpack unisex cheap with free gift" degenerate user experience by cluttering the site with irrelevant, misleading titles.
In this Product Title Quality Challenge, we provide you with a set of product titles, description, and attributes, together with the associated title quality scores (clarity and conciseness) as labeled by Lazada's internal QC team. Your task is to build a product title quality model that can automatically grade the clarity and the conciseness of a product title.
This challenge is a part of the CIKM AnalytiCup 2017 co-arranged as part of the ACM CIKM 2017 conference. The reports of the semi-finalist teams will be publicly released online. We also invite the finalists (top 3 teams) and potentially some semi-finalists as well to present their approaches at the CIKM AnalytiCup Workshop on November 6 in Singapore.
Let's work together to free Lazada from poor quality titles and return us to a state of clean and relevant product titles.
The Prize Pool for the Lazada Product Title Quality Challenge is SGD15,000 and will be distributed among the participants as follows:
*In addition to Prize Money, any top-3 team with a member to present their solution in person at the CIKM AnalytiCup Workshop held on November 6 in Singapore will receive a further SGD2,000 in travel support.
In order to be eligible for any award, winners are required to submit the code and solution report (4 pages in ACM format) to the organizers by the stipulated deadline.
Lazada reserves the right not to award some of or all the prizes, if the competition criteria are not met.
This competition is sponsored by Lazada Group, and is organized by:
John Berns | Kai Xin Thia | Michal Polanowski | Karthik Nallamothu | Eugene Yan,
in conjunction with CIKM AnalytiCup Co-chairs:
Hady W. Lauw | Xiaoli Li | Nik Spirin.
Wishing you all the best!
The data contains product titles that are stratified sampled based on sales figures across multiple product categories in Singapore, Malaysia, Philippines. Each line in the file refers to a specific product. For example:
my,NO037FAAA8CLZ2ANMY,RUDY Dress,Fashion,Women,Clothing,"<ul> <li>Short Sleeve</li> <li>3 Colours 8 Sizes</li> <li>Dress</li> </ul> ",33.0,local
The comma-separated values in each line correspond to the following features:
country : The country where the product is marketed, with three possible values: my for Malaysia, ph for Philippines, sg for Singapore
sku_id : Unique product id, e.g., "NO037FAAA8CLZ2ANMY"
title : Product title, e.g., "RUDY Dress"
category_lvl_1 : General category that the product belongs to, e.g., "Fashion"
category_lvl_2 : Intermediate category that the product belongs to, e.g., "Women"
category_lvl_3 : Specific category that the product belongs to, e.g., "Clothing"
short_description : Short description of the product, which may contain html formatting, e.g., "<ul> <li>Short Sleeve</li> <li>3 Colours 8 Sizes</li> <li>Dress</li> </ul> "
price : Price in the local currency, e.g., "33.0". When country is my, the price is in Malaysian Ringgit. When country is sg, the price is in Singapore Dollar. When country is ph, the price is in Philippine Peso.
product_type : It could have three possible values: local means the product is delivered locally, international means the product is delivered from abroad, NA means not applicable.
To get access to the data, you need to log in and go to Participate tab.
Each product in the data is also associated with two quality scores labeled by Lazada's internal QC team based on the following content guidelines.
Your task is to build product title quality models that can automatically grade product titles based on the provided features, and output two scores for each product:
Submissions will be evaluated using the Root-Mean-Square Error (RMSE), defined as follows:
In the above formula, N is the number of instances, ypred is the predicted probability value for a given instance, and yref is the ground-truth value for that instance.
We will first measure the RMSE separately for clarity and conciseness. The ranking among participants is determined by the Overall RMSE, which is the average of the RMSE for clarity and the RMSE for conciseness. The lower the overall score, the higher is the ranking.
We use the three-stage process to avoid possible "leaderboard boosting", when the ranking/scores from the validation stage could be used for overfitting the model to the test set. By having the third hold-out set, we minimize this possibility and guarantee fair evaluation.
Breach of any competition rule specified below is ground for disqualification. In the event of any dispute in connection with the Competition, or with the interpretation or implementation of these rules, the decision of the Organizers shall be final.
The competition is open to everyone but the employees of Lazada or anyone involved with the organization.
You cannot sign up to CodaLab from multiple accounts, and therefore you cannot submit from multiple accounts.
The maximum size of a team is four (4) participants.
Team mergers are allowed all throughout Phase 1, and can be performed by the team leader (go to your account's User Settings and indicate team name and members). In order to merge, the combined team must have a total submission count less than or equal to the maximum allowed as of the merge date. The maximum allowed is the number of submissions per day multiplied by the number of days the competition has been running. The organizers do not provide any assistance regarding the team mergers.
Privately sharing code or data outside of teams is not permitted. It is permitted to share code if it is made available to all participants on the forums or as a public repo (e.g., Github).
You may submit a maximum of 24 entries per day during the Phase 1 (Validation Leaderboard). For the Phase 2 (Testing Leaderboard), you can only submit 10 entries in total.
At the end of the Phase 2, the following teams are to submit their code as well as a report describing their solution (4 pages in ACM format) by the stipulated deadline.
The submitted codes and reports may be inspected to check the validity of the solution. The reports will eventually be made publicly available on the CIKM conference website.
Selected teams will also be invited to present their solutions in person at the CIKM AnalytiCup Workshop on November 6 in Singapore. To allocate the limited presentation slots, preference will be given to award-winning teams, as well as teams deemed by the organizers to have interesting or remarkable solutions.
The ranking of entries based on the Overall score during Phase 2 will be used to determine the winners, subject to the validity of the solutions. Finalists will be the top 3 among the semi-finalist teams. A tie in the Overall score will be broken in favor of the earlier submission on the final leaderboard.
Start: May 1, 2017, midnight
Description: Ongoing model development and evaluation based on validation data. The results are shown on the Validation Leaderboard.
Start: July 24, 2017, midnight
Description: Final submission based on the testing data. The results are shown on the Testing Leaderboard.
Aug. 7, 2017, 11:59 p.m.
You must be logged in to participate in competitions.Sign In