As you introduced the human evaluation score to provide more reliable evaluation for Subtask C, future works will not be able to use it to compare with the competition's systems.
So, we propose to report another evaluation metric on the task description paper which is MoverScore:
It is easy to compute and it will be a good evaluation metric if it aligned with the human evaluation scores. Also, it will be a great support for that metric in the literature to provide more reliable text generation metric other than BLEU score. As you can see in Subtask C results, teams ranking changed significantly when using human evaluation score instead of BLEU score.