CodaLab -

> Cannot use official docker

I try to use the official docker 'scottyak00/codalab-00:gpu-worker-tf-2.2'. It downloaded successfully but cannot run. Could you help me on this, please?

---------------------------------------------------------------
The command I used and the error message was :

(base) aaa@desktop:~$ sudo docker run scottyak00/codalab-00:gpu-worker-tf-2.2
/usr/local/lib/python3.6/dist-packages/celery/platforms.py:812: RuntimeWarning: You are running the worker with superuser privileges, which is
absolutely not recommended!

Please specify a different user using the -u option.

User information: uid=0 euid=0 gid=0 egid=0

uid=uid, euid=euid, gid=gid, egid=egid,
[2020-07-17 17:01:56,961: WARNING/MainProcess] /usr/local/lib/python3.6/dist-packages/billiard/__init__.py:321: RuntimeWarning: force_execv is not supported as the billiard C extension is not installed
warnings.warn(RuntimeWarning(W_NO_EXECV))
[2020-07-17 17:01:56,972: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused.
Trying again in 2.00 seconds...

Posted by: zhanyu @ July 17, 2020, 5:09 p.m.

Hi! We are investigating this right now. In the meantime, are you able to run the code locally?
You can also use the Colab environment (https://colab.research.google.com/drive/1u4NQCUhubWsK9GcrwLVa4bw3O_f0su83?usp=sharing) we prepared to tested it out (since the hosted runtime should have the updated dependencies).

Posted by: ydjiang @ July 17, 2020, 5:54 p.m.

Thank you!
The sample codes can run locally (I checked this with TensorFlow 2.0 as written in README.md, but I don't know why here it said TF2.2 -- https://competitions.codalab.org/competitions/25301#learn_the_details-details ).
The colab also works well. I just want to check if the package 'jax' is already installed (in the colab environment, 'import jax' works fine).

Posted by: zhanyu @ July 17, 2020, 5:59 p.m.

The Tensorflow version on the server is actually 2.2. The README is a bit outdated. Sorry for the confusion.
The Colab has the same version of Tensorflow as the docker but I don't believe the docker has JAX installed.
If you want to use JAX, we can certainly install it. Is there a functionality of JAX that you would like to use?

Posted by: ydjiang @ July 17, 2020, 6:03 p.m.

Oh, JAX is not that critical. I can definitely find the counterpart in TF2 of the functions I need. I'm just curious.

Posted by: zhanyu @ July 17, 2020, 6:11 p.m.

Another question is about the public_data.
I uploaded the three baselines -- distance, jacobian, sharpness -- to the website. It turns out that the order on the website (online evaluation) is jacobian > distance > sharpness, while in my local environment with the public_data, the order is sharpness > distance > jacobian (totally different to the online result).
Could you please explain this? In this sense, is the local public_data a good example of the online task?

Posted by: zhanyu @ July 17, 2020, 7:14 p.m.

In order to avoid overfitting, the private data is intentionally created to be different from the public data. Although we cannot provide the details of the private data, it uses a different image dataset and contains different hyperparameter categories. We understand that this is a very difficult task, but we believe that a good measure should be able to capture generalization of all these data. We will be closely monitoring the competition and listening to the participants’ feedback.

Posted by: ydjiang @ July 17, 2020, 8:22 p.m.

Can we use SciPy in the submission code?

Posted by: zhanyu @ July 17, 2020, 11:30 p.m.

We can add scipy. Unfortunately I don't actually have the push access to the docker we use. I need to wait for my colleague to do it.
I believe there is Sklearn so maybe Scipy is already included.

Posted by: ydjiang @ July 18, 2020, 3:27 a.m.

I just submit a job and it seems like the online environment does support SciPy. Thank you!
Could you tell me the GPU memory for the environment? In the homepage, it said '1 Nvidia K80 GPU' (https://competitions.codalab.org/competitions/25301#learn_the_details-details) which should be 24GB, right? But from the stderr file I have, the available GPU memory is only 12GB.

Posted by: zhanyu @ July 18, 2020, 3:48 a.m.

If you click on the ingestion error log, you will see that the program does have access to an Nvidia K80.
But, I am not sure why the memory is different from the specs...

Posted by: ydjiang @ July 19, 2020, 7:55 p.m.

I think Google cloud by default gives you half of the GPU board as the Tesla K80 comes with 2 "VRAM", each of which has 12GB of memory.
(source: https://stackoverflow.com/questions/49260333/google-cloud-tesla-k80-only-one-device-showing-up)
Is 12 GB too small for your workload? Usually I found decreasing the minibatch size to be quite effective.

Posted by: ydjiang @ July 19, 2020, 8 p.m.

No problem. I was just curious about the memory issue since everything worked fine in my local V100. I can definitely adjust my batch size.
Thank you!

Posted by: zhanyu @ July 19, 2020, 9 p.m.

Post in this thread

Forums

Predicting Generalization in Deep Learning Forum

> Cannot use official docker