Our team is having trouble reloading checkpoints and evaluating fully trained models. At the end of training, ray saves a model folder containing files which we use to load the trained model (same as the provided baseline code). However, when loaded, the trained model performs substantially worse when compared to the performance during the last training steps. Additionally, we tried to restore training from previous checkpoints however the problem remains and the model seems to start training almost from scratch. We are not sure if the problem occurs at load or save time.
Did you come across similar issues? Do you have any suggestion on how we should solve this problem? We are unable to submit our best models due these saving/loading issues.
Thank you in advance,
Team162 - drive++
Hi,
Thanks for your message. I've not run into this issue myself but I've not tested this extensively. To be clear, when reloading the model, is it throwing an exception and then continuing training from scratch, which RLLib will do given the setting of 'max_failures', or is there no error message at all? To check go to the relevant folder you've specified (probably in ray_results), open it and see if there are any error.txt files. If there are they should give you a clue as to your problem - e.g. an incompatibility between action/ observation spaces of your model versus your current version of the code.
Posted by: HuaweiUK @ Dec. 4, 2019, 11:06 a.m.