please note that we deliberately deactivated the leaderboard for the test phase in order to avoid optimization on the test set. This also includes the rank. The rank that can be seen in the leaderboard is based on the score for correct submissions, i.e. 0 or 1. That means every correct submission shares the same rank, CodaLabs ranking is thus random.
If you want to compare your implementation with other participants, we refer to the 'Evaluation Validation Set' phase. This phase is designed to give you feedback on the validation set, i.e. your submission will be evaluated with the 'blurbs_dev_label.txt' file of the public dataset. For a fair comparison, if you want to compare your current system with the baseline system and other participants systems you should *only* train on the 'blurbs_train.txt' and submit labels for 'blurbs_dev_nolabel.txt'. To submit into the 'Evaluation Validation Set' phase go to 'Submit / View Results' and select the 'Evaluation Validation Set' phase. You can then go to the result section and select this phase to view your status compared to others. Note that the baseline system was submitted by 'Raly'. Please also note that the rank is based on the average of the scores for subtask A and subtask B.
Hope this helps, best,
GermEval #1 Organizers
I get different results when running the script manually compared to the Codalab-Upload results (Evaluation Validation Set). For the manual script (evaluate.py), I use the gold.txt file that came with it to get the results. I put the same file I upload to Codalab in the input directory with the gold.txt.
The only difference is, that it is zipped for the Codalab-Upload but that doesn't change anything.
Has the evaluation script or gold-file changed (compared to the one used by Codalab)?Posted by: polylemma @ July 26, 2019, 11:25 a.m.
Please note that the gold.txt file in the 'public_dat/evaluation/input_dev' folder is NOT the development set labels. This file is only an example and contains only a fraction of the data. The development set labels are in 'public_dat/blurbs_dev_label.txt'. If you replace 'public_dat/evaluation/input_dev/gold.txt' with 'public_dat/blurbs_dev_label.txt' and run the evaluation script you will get the same results as in the upload version.Posted by: remstef @ July 26, 2019, 11:52 a.m.
Thank you very much. Makes total sense and yes, the results are the same now :)Posted by: polylemma @ July 26, 2019, 11:58 a.m.
Glad to hear that, and sorry for the confusion.
Best of luck for the final phase,