Given a new question Q (aka the original question), and the set of the first ten related questions from the forum retrieved by a search engine, each associated with its first ten comments appearing in its thread, the goal is to rank the 100 comments according to their relevance with respect to the original question. We want the “Good” comments to be ranked above the “PotentiallyUseful” and Bad” comments, which will be considered just bad in terms of evaluation. Although, the systems are supposed to work on 100 comments, we take an application-oriented view in the evaluation, assuming that users would like to have good comments concentrated in the first ten positions. We believe users care much less about what happens in lower positions (e.g., after the 10th) in the rank, as they typically do not ask for the next page of results in a search engine such as Google or Bing. This is reflected in our primary evaluation score, MAP, which we restrict to consider only the top ten results.

Once you've created an account and registered, you can begin submitting your output for evaluation. You can run the evaluation script locally using the provided script in the Dropbox folder. You will need Python 2.7 in your environment. More details are found within the script. You can also find snippets of what the truth.relevancy file will contain as well as what the scorer is expecting from your submission.predictions file.


For successfully completing your system submission you need to submit a text file (submission.predictions) with your system’s predictions and your source code in a single zip file.

The scorer takes as input a "GOLD_FILE" and a "PREDICTIONS_FILE". Both files should contain one prediction per line in the following format:

"Question_ID"     "Answer_ID"     "RANK"     "SCORE"     "LABEL"
where tabulation is used as a separator.

The file should be sorted by "Question_ID", then by "Answer_ID" (this is already the order in the provided XML files, so no additional sorting is needed). "RANK" is a positive integer, reflecting the rank of the answer with respect to the question. In fact, the value of "RANK" is not used in scoring (and one can put there any integer); it is only included in the file for better readability of the "GOLD_FILE". "SCORE" is a real number reflecting the relevance of the answer with respect to the question. A higher value means higher relevance of the answer with respect to the question. In the "PREDICTIONS_FILE", this value is used to determine the ranking of the answers (in descending order) with respect to the question, and thus is key for calculating MAP.

NOTE: CodaLab is an open source framework for running competitions. Your system submissions will be ranked according to accuracy of the system but the ranking will be public and thus it’s super important that the username you choose for the submission is not disclosing your identity. In order to identify which student gets credit for which system submission, please note in your report, and in your source code, the name you used to identify your submission in CodaLab (user name).
