CIKM AnalytiCup 2017: Lazada Product Title Quality Challenge Forum

Go back to competition Back to thread list Post in this thread

> Official External Data Thread

For this competition, you are expected to work with the dataset by Lazada. Use of external data (beyond that provided by the competition) is permitted, provided the data is freely available.

If you are using a source of external data, you must post the source to the official external data forum thread no later than two weeks prior to the deadline of Phase 1. Once a source is posted here, you do not need to repost it.

This requirement is to ensure:
1. You have obtained the data legally.
2. The organizers and the community have the opportunity to examine the validity and the appropriateness of the data.
3. All the participants are on the same footing, and no one is advantaged because of privileged access to special dataset.
4. Sharing data with each other might result in better insights and more interesting models.

The organizers reserve the right to rule out specific datasets if they are found to be inappropriate.

Posted by: hadylauw @ May 12, 2017, 9:31 a.m.


I would probably end up using Glove word embeddings, and then some Python packages that provide trained models such as Spacy, NLTK and antispam.

Posted by: mnicosia @ July 8, 2017, 11:09 a.m.

I will also use Glove word embeddings and maybe some corpora from NLTK.

Posted by: victor191 @ July 9, 2017, 9 a.m.

I used pre-trained word2vec, char2vec and NLTK resources.

Posted by: thanhvu @ July 9, 2017, 1:16 p.m.

I will use the text publicly available text embedding. I will also use lazada websites links for any additional data.

Posted by: GD @ July 10, 2017, 4:20 a.m.

I have used Globe emb, Spacy and NLTK.

Posted by: mcp @ July 10, 2017, 6:30 a.m.

I used sentiword net

Posted by: Murkrow @ July 10, 2017, 10:20 a.m.

I might use some color dictionary that generated by myself.

Posted by: sherryxue1991 @ July 10, 2017, 11:02 a.m.

- Data: Wikipedia english articles data dump
- Pre-trained models: GloVe (Stanford NLP Group), Word2Vec (Google), openNLP

Posted by: DataNinja @ July 10, 2017, 3:01 p.m.

can I use some web page in lazada website? thank you.

Posted by: sherryxue1991 @ July 10, 2017, 3:27 p.m.

I am using word embeddings (available publicly), SpaCY and NLTK

Posted by: samarthagarwal23 @ July 10, 2017, 4:01 p.m.

I also use nltk stopword corpus

Posted by: Murkrow @ July 10, 2017, 4:03 p.m.

We used Glove embedding, spacy, nltk, and list of product brands from lazada.

Posted by: Saigonapps @ July 10, 2017, 10:09 p.m.

@sherryxue1991 sure you can use Lazada website data.
```can I use some web page in lazada website? thank you.```

Posted by: kaixin.thia_lazada @ July 11, 2017, 3:25 a.m.

Hi, my team may use Globe emb, Spacy and NLTK.

Posted by: TangYifan @ July 18, 2017, 7:58 a.m.

I would use tools like NLTK ,spacy ,beautifulsoup ,sklearn and keras for this competition.

Posted by: naiven @ July 18, 2017, 10:39 a.m.


My team is using Glove word embeddings and Meaning Cloud text analysis.

Best regards,

Posted by: octavioloyola @ July 19, 2017, 6:34 p.m.

YifanTang's post (cont'd)

We also use GoogleNews-vectors-negative300.bin

Posted by: fangyizhang @ July 21, 2017, 8:46 a.m.
Post in this thread