CIKM Cup 2016 Track 1: Cross-Device Entity Linking Challenge

Organized by spirinus - Current server time: Nov. 16, 2018, 11:23 a.m. UTC

Previous

Phase 1: Validation Leaderboard
Aug. 5, 2016, midnight UTC

Current

Phase 2: Testing Leaderboard
Oct. 2, 2016, midnight UTC

End

Competition Ends
Oct. 5, 2020, midnight UTC
Online advertising is, perhaps, the most successful business model for the Internet known to date and the major element of the online ecosystem. Advertising companies help their clients market products and services to the right audiences of online users. In doing so, advertising companies collect a lot of user generated data, e.g. browsing logs and ad clicks, perform sophisticated user profiling, and compute the similarity of ads to user profiles. User identity plays the essential role in the success of an online advertising company/platform.
 
As the number and variety of different devices increases, the online user activity becomes highly fragmented. People check their mobile phones on the go, do their main work on laptops, and read documents on tablets. Unless a service supports persistent user identities (e.g. Facebook Login), the same user on different devices is viewed independently. Rather than doing modeling at the user level, online advertising companies have to deal with weak user identities at the level of devices. Moreover, even the same device could be shared by many users, e.g. both kids and parents sharing a computer at home. Therefore, building accurate user identity becomes a very difficult and important problem for advertising companies. The crucial task in this process is finding the same user across multiple devices and integrating her/his digital traces together to perform more accurate profiling.
 
The Cross-Device Entity Linking Challenge provides a unique opportunity for academia and industry researchers to work on this challenging task. We encourage both early career and senior researchers to participate in the challenge by testing new ideas for cross-device matching and consolidating the approaches already published and described in the existing work. The successful participation in the challenge implies solid knowledge of entity resolution, link prediction, and record linkage algorithms, to name just a few. 
 
For the model development, we release a new dataset provided by Data-Centric Alliance (DCA). The dataset contains an anonymized browse log for a set of anonymized userIDs representing the same user across multiple devices. We also provide obfuscated site URLs and HTML titles. By looking at this problem from the graph-theoretical perspective, we release data about nodes (userIDs at the level of devices and the corresponding click-stream logs) and a subset of known existing edges. The participants have to predict new edges (identify the same user across multiple devices). The evaluation is done by calculating the ratio of correctly predicted edges using the F1 measure.
 
The Challenge is a part of the CIKM2016 and continues the CIKM Cups series co-arranged as part of the ACM CIKM conference. The reports of the winning teams will be publicly released online. We also invite all participants to present their approaches at the CIKM Cup Workshop on October 28th in Indianapolis, USA.
 
Since online advertising is an industry dealing with sensitive and large-scale datasets, it is hard for academic researchers to get access and work with such datasets. Therefore, this challenge might be especially interesting for researchers from academia, who want to work with the real large-scale advertising dataset and experiment with various known and new graph mining algorithms applied to the cross-device matching problem.
 
We also very welcome and encourage the participation of:
  • industry researchers from companies working on online advertising including major RTB/DMP/DSP vendors such as BlueKai, Turn, Lotame, eXelate, OpenX, etc.
  • industry researchers and engineers, who have accumulated a lot of expertise relevant to this problem. We encourage the teams from top research labs such as Microsoft Research, Google Research, Yahoo Labs, Yandex, Baidu Labs and to join in;
  • early career data scientists and professors teaching data mining and (social/information) network mining, who could leverage the challenge to teach/learn by doing having the unique access to the large-scale real-world dataset.
We hope that you will enjoy participating in Cross-Device Entity Linking Challenge and push to the limits your creativity and data mining talent. Good luck!

Main challenge

You should find correct pairs beetween users which are not represented in train.csv (it's test userIDs).

Every test userID (excluding ~0.5% "noise" userIDs) connected with some others test userIDs.

Metric

The goal of this competition is to identify the same users across multiple devices (predict new edges in the anonymized userID graph) using the browse log and associated meta-data. The participants have to submit the most likely pairs of anonymized userIDs (edges in the matching graph), where both userIDs are associated with the same user.  One userID might match to many userIDs because a user might have more than 2 devices. 
 
Submissions will be evaluated using F1 measure  (a harmonic mean of Precision and Recall) by comparing the ground-truth userID matching with the one predicted by the participants. Specifically, for Precision we will count the number of pairs predicted correctly among the pairs submitted by a participant and for Recall we will count the number of predicted correctly pairs out of all ground-truth matching userIDs available in a validation (phase 1) or test (phase 2) set.
 
The ground-truth data is provided by the Data Centric Alliance (DCA) and its partners. The fact that two ground-truth userIDs are associated with the same user is established as follows:
  • The user uses both a mobile and a web app provided by the same company with the persistent user identity (e.g. Facebook Login; DISCLAIMER: Facebook is used here only to illustrate the concept of persistent identity and not related to the CIKM Cup 2016 Cross-Device Entity Linking Challenge). This company provides the matching information to DCA based on the privacy-preserving partnership agreement. 
  • The userIDs are known to belong to the same user based on the information provided by ad exchange platforms during the RTB process.
  • For robustness, we add a pair of userIDs to the ground-truth matching only if it is confirmed in three different sources.
In total, there are 721,443 ground-truth pairs of matched userIDs. A larger subset of these pairs (506,136) is released for the model development and the rest of the known matching pairs (215,307) are used for testing. 50% of the correct pairs are used for the first phase and 50% are used for the second phase during the final stage of the Challenge. 

Submission Format

The participants have to submit a list of userID,userID. Each line must contain two comma separated values (CSV format):
 
(line 1) userID,userID
(line 2) userID,userID
(line 3) userID,userID
...
(line 215307) userID,userID
 
The lines 1 through 215,307 contain pairs of userIDs. Please check the format of the provided baseline submission in the case of difficulties. For example, the first five lines from a baseline submission are:
 
d3cc7d6292a7e3b91d1ee70eabafeb3f,fab17b41a6e200b6296448fdc245055f
4c5d237606a5bcd590ed656302a293c1,c2cf6c52419c784bc9672c315c951174
d3cc7d6292a7e3b91d1ee70eabafeb3f,fab17b41a6e200b6296448fdc245055f
94c2d6a5d5c504412494a05860a65d53,9c7634d90def17915bd5118f89688193
5fab89ec22ea3d9c403b0b539882e492,f067ceab3ce63afcc256690b05515b75
 
If more than 215,307 pairs are provided, the scoring algorithm will read all of them and calculate the F1 measure. If less than 215,307 pairs are provided, we will calculate the F1 measure only for these records and won't append the rest of the submission with incorrect pairs just because they are missing. For both Phase 1 and Phase 2 the participants have to submit 215,307 matching pairs. Depending on the stage of the challenge, the scoring script will automatically use 50% of correct pairs to update the leaderboard.

Baseline

The baseline works as follows:
  1. Building TF-IDF matrix on user's domains
  2. For each userID, which not represented in train.csv, find 15 nearest neighbors by Euclidean distance, generate 15 pairs in alpabetical order and save distance scores.
  3. Sort pairs by distances in in ascending order.
  4. Take top-215307 pairs.
 
The corresponding python script is available here. The organizers tested this script by creating a baseline submission.
 
IMPORTANT: The name of the submission file should be submission.txt and prior to submitting it must be zip-archived. The final file format is submission.txt.zip.
 
IMPORTANT: All pairs should be in alphabetical order. For example, if pair (abc, dce) is correct, pair (dce, abc) will be not correct.

Train / Test Public / Test Private Splits

We partitioned the data set into three parts:
  • The first and the largest part is used for the model development. You can use this part to train and evaluate your model offline on your own machine.
  • The second part is used for validation (phase 1), which runs from Aug 5th to Oct 2nd. Until Oct 2nd, the participants can submit their solutions without violating the daily submissions limits (15 submissions per day to keep the load on the server manageable). The ranking will be continuously updated on the public leaderboard.
  • The third part is used for the final evaluation in the period from Oct 2nd till October 5th. The participants are allowed to submit the final prediction only 3 times. After that the system will not accept the files.
We use the three-stage process to avoid possible "leaderboard boosting", when the ranking/scores from the validation stage could be used for overfitting the model to the test set. By having the third hold-out set, we minimize this possibility and guarantee fair evaluation.

Prizes

Money and a meeting with a Distinguished Researcher

The Prize pool for the Cross-device Matching Challenge is $5,000 and will be distributed among the participants as follows:
  • The 1st Place receives $1,600 in the Prize Money and an opportunity to meet with a Distinguished Researcher having a relevant expertise to the Challenge topic.
  • The 2nd Place receives $1,300 in the Prize Money and an opportunity to meet with a Distinguished Researcher having a relevant expertise to the Challenge topic.
  • The 3rd Place receives $1,000 in the Prize Money and an opportunity to meet with a Distinguished Researcher having a relevant expertise to the Challenge topic.
  • The 4th Place receives $700 in the Prize Money and an opportunity to meet with a Distinguished Researcher having a relevant expertise to the Challenge topic.
  • The 5th Place receives $400 in the Prize Money and an opportunity to meet with a Distinguished Researcher having a relevant expertise to the Challenge topic.
To receive the Prize Money, the winners must publicly describe their approaches in the form of the research report. No participation in the CIKM Workshop is required. However, we highly encourage all participants to attend the CIKM Workshop to meet fellow data scientists and present their approaches on October 28th in Indianapolis, USA. The meeting with the Distingusihed Researcher (personality will be announced shortly --- now we are discussing this opportunity with several strong candidates) will be arranged during the CIKM conference (will be announced separately) or via Skype if a winner cannot make it to the conference.

Collaboration Opportunity

Top-3 participants from academia based on the final leaderboard ranking will be offered an opportunity to collaborate with the data provider (DCA) after the competition is over. If the winner is from industry, s/he will not be eligible for this prize. For example, for the leaderboard (1 industry, 2 academia, 3 academia, 4 industry, 5 academia), the participants ranked 2, 3, and 5 will be offered to continue collaboration.
 
We administer this prize understanding the challenges that academia faces without having access to real-world datasets. At the same time, the organizer cannot release the dataset forever and to everyone because of the sensitive nature of the data. We hope that with this merit-based data sharing mechanism, we could both enable high quality research with the publicly accessible results and protect the privacy of online users.

Competition Rules

One account per participant

You cannot sign up to CodaLab from multiple accounts and therefore you cannot submit from multiple accounts.

No private sharing outside teams

Privately sharing code or data outside of teams is not permitted. It's okay to share code if made available to all participants on the forums or as a public Github repo.

Team Mergers

Team mergers are allowed and can be performed by the team leader. In order to merge, the combined team must have a total submission count less than or equal to the maximum allowed as of the merge date. The maximum allowed is the number of submissions per day multiplied by the number of days the competition has been running. The organizers don't provide any assistance regarding the team mergers. 

Team Limits

The maximum size of a team is three participants.

Submission Limits

You may submit a maximum of 15 entries per day during the first stage (validation). For the second stage (test), you can only submit three times.

Terms and Conditions

  • You agree to the Challenge Rules you are reading now.
  • The organizers, employees of DCA, and all people, who had access to the ground-truth data, aren't eligible for the Prize.
  • Team mergers are allowed until the second stage of the competition starts (Oct 2nd, 2016).
  • The winners will be offered an opportunity to collaborate with the data provider after the competition is over.
  • The winners are required to share a public report/paper (4-8 pages, ACM double-column format) to be eligible for the CIKM Cup Award Certificate, the Prize Money allocation and the meeting with a Distinguished Researcher. All participants are highly encouraged but not required to submit papers documenting their approaches and present them during the CIKM Cup workshop in Indianapolis on October 28th. The papers will be shared publicly one the official CIKM Cup 2016 website, like for WSDM Cup 2016. Different from previous CIKM conferences, the workshop proceedings this year will NOT be included in the ACM Digital Library. This would eliminate any concern of self-plagiarism if the authors resubmit their workshop papers to a formal publication venue. 
  • The participants do not have to share or open source their source code. This is a common convention in the research community allowing researchers from industrial labs to participate in the challenge.
  • You agree that all submissions that you make during this competition could be used by the Organizer to build an aggregated ensemble cross-device matching model for with the results of this experiment released in a publicly accessible research report.

The participants can post a question to the CIKM Cup 2016 CodaLab forum or email the organizers at cikmcup [ symbol ] gmail [ symbol ] com with the subject "CIKM Cup 2016: Track 1 (DCA)" (we will do our best job to share relevant updates with all participants but encourage people to use the forum).

Data Files

The dataset can be downloaded from here.  
 
To allay privacy concerns and protect business sensitive information the data is fully anonymized. Only meaningless numeric IDs of users and hashed URLs/titles are released. All actions performed by a userID are grouped together.
 
There are four different files described below.

facts.json (~2.64GB)

A browsing log containing a list of events for a specific userID: fid is an eventID; ts is a timestamp; and uid is a userID. We anonymized userIDs by hashing internal anonymized DCA userIDs one more time with an MD5-based hash function. An example low record is presented below:
 

{"facts": [{"fid": 9140201, "ts": 1464769462076}, {"fid": 8799201, "ts": 1464759923649}, {"fid": 7644575, "ts": 1464759921103}, {"fid": 7286929, "ts": 1464759913447}, {"fid": 7644575, "ts": 1464759891103}, {"fid": 7286929, "ts": 1464759883447}, {"fid": 10816834, "ts": 1464759535330}, {"fid": 8799201, "ts": 1464759484110}], "uid": "59e3393261202d419e3c2721a6e15f9f"}

urls.csv (~1.2GB)

A mapping from an fid to the hashed URL. The hashing is done by replacing all words in the URL with their hash-codes. To hash URLs, we: (1) build the vocabulary by concatenating all available textual data such as URLs and titles; (2) for each unique word, assign a hash-code using an MD5-based hash function; (3) replace each word with the corresponding hash-code. Slashes in the URLs are preserved in the original form. An example is presented below:

13469796,ed95a9a5be30e4c8/2a3448823137f338/06429febabd51328?868579c6e3d277e

titles.csv (~240.3MB)

A mapping from an fid to the hashed HTML title. The hashing is done by replacing all words in the title with their hash-codes based on MD5. We use the same encoding as for URLs, i.e. if a word appears both in a URL and in a title, it you will see the same hash-code. Spaces between words are preserved. An example is presented below:

6847456,e89c7e0a7501863e e16ec727e36197f3 b764c27d3881dc01 003936c4683cbc1d a37475fd0852f506 e89c7e0a7501863e

train.csv (~33.4MB)

A set of matching userIDs for a supervised cross-device matching model training. The userIDs from train.csv and test.csv don't overlap.

Dataset Statistics

  • The number of unique tokens in titles (dictionary size): 8,485,859
  • The number of unique tokens in URLs (dictionary size): 27,398,114
  • The average number of events/facts per userID: 197
  • The median number of events/facts per userID: 106
  • The number of unique domain names: 282,613
  • The number of all events for all users combined: 66,808,490
  • The number of unique userIDs: 339,405
  • The number of unique websites (domina + URL path): 14,148,535
  • The number of users in the train set: 240,732
  • The number of users in the test set (public and private leaderboard combined): 98,255
  • Known matching pairs for training: 506,136
  • Known matching pairs for testing (public and private leaderboard combined): 215,307

Phase 1: Validation Leaderboard

Start: Aug. 5, 2016, midnight

Description: Ongoing model development and evaluation with the results on the public leaderboard.

Phase 2: Testing Leaderboard

Start: Oct. 2, 2016, midnight

Description: Final submission.

Competition Ends

Oct. 5, 2020, midnight

You must be logged in to participate in competitions.

Sign In
# Username Score
1 agrigorev 0.3851214625
2 u.tanielian 0.4255321060
3 dremovd 0.4149406605