Hello. I was evaluating Task 4 models locally in my machine and on the leaderboard. I observed that the evaluation set with 1200 students (in coda)/ 900 (locally) have high variance (due to the choice of the seeds for splitting). Agreement for best models using different evaluation sets (based on seeds) is very low. For example, you (the organizer) might find using three different more seeds, my last three models would perform widely differently and their ranking would be different (with high margin).
As such this is not an issue. But I feel this creates an evaluation on the final set kind of random (around 60% chance for the best model on one split will be the best on another split). Do you think testing on multiple seeds for split (more than 20), would make the evaluation more robust?
Posted by: arighosh @ Oct. 5, 2020, 9:16 p.m.Thanks for pointing it out. We have modified the evaluation program a bit in an attempt to make the evaluation more robust. See the message that we send to all participants just now. Let me know if there are issues. Thanks!
Posted by: moonlightlane @ Oct. 15, 2020, 3:16 a.m.Hello, I find that each of my submissions has been rerun on 10 different datasets. Thanks for your effort. I 'm concerned about some of my submissions i.e., submit_003.zip and submit_021.zip. Each of these submissions was rerun twice at around 10/15/2020 03:01, as shown in the private submitting page. Is there something wrong with my submissions, and how can I make sure or check that they are evaluated successfully? Thank you very much.
Posted by: TAL_ML_Group @ Oct. 16, 2020, 2:49 a.m.Hi! there is nothing wrong, probably just because I mistakenly clicked rerun twice... I can confirm that those two runs where successful. Thanks!
Posted by: moonlightlane @ Oct. 16, 2020, 3:35 p.m.Oh I got it. Thank you very much!
Posted by: TAL_ML_Group @ Oct. 17, 2020, 2:46 a.m.