Deep Learning AMI with Source Code (CUDA 9, Ubuntu)
Amazon Web Services | 5.0Linux/Unix, Ubuntu 16.04 - 64-bit Amazon Machine Image (AMI)
no nvidia-smi and nvcc and tensorflow
ubuntu@ip-172-31-33-144:/usr/local$ nvcc
The program 'nvcc' is currently not installed. You can install it by typing:
sudo apt install nvidia-cuda-toolkit
nvidia-smi command not found.
tensorflow not in python
How to use it?
- Leave a Comment |
- Mark review as helpful
Fails to install my ssh key
Unable to connect to the instance. Definitely not a problem with my key. Have tried launching from the marketplace and also directly from the control panel.
Tensorflow Batchnorm Issue but otherwise good
This is a great instance for the CUDA versions and configuration and once I fixed the issue below my training was very fast. HOWEVER you should be very careful with using tensorflow on this instance. It is a Frankenstein's Monster of bleeding edge tensorflow (1.4-rc0) plus some PRs which have not even been merged to master to take advantage of the Voltas and CUDA 9.
My issue was:
'AttributeError: can't set attribute' while using the BatchNormalization layer in Tensorflow. It relates to this PR (https://github.com/tensorflow/tensorflow/pull/13388) where a 'dtype' is added to BatchNorm to allow for FP16 and FP 32 operations. There is an extra line in the tensorflow included in this AMI in /usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/normalization.py on line 145, 'self.dtype = dtype' which causes the error above when using the normal BatchNorm api. Commenting this line out fixes the problem.
Weirdly, this assignment on line 145 is not included in the PR (although the dates and authors match) so I think there must have been a rebase or something. Regardless, the line exists in the tensorflow in this ami and will cause you pain on almost any neural network because they almost all use BatchNorm. I couldn't figure out where I should post this because the code on Github does not have this problem.
Other than that - this is a fine AMI and I'm grateful to AWS for providing it and
for their continued advances in GPUs.
Incomplete
Trying to run testall but it fails.
Tensorflow test fails with keras not found.
Trying to upgrade Tensorflow and it breaks the install: only CPU mode, no GPU.
This AMI needs work.
Tensorflow is 1.4.0-rc0.