AWS Deep Learning AMI (Ubuntu 16.04)
Amazon Web Services | 48Linux/Unix, Ubuntu 16.04 - 64-bit Amazon Machine Image (AMI)
pytorch_36 breaks
Also unlike the description, the pytorch versioning running there was 1.0.0 not 1.4
Updating the torch version lead to some technical issues. Terrible.
- Leave a Comment |
- 1 comment |
- Mark review as helpful
This AMI Breaks Horovod under tensorflow_p36 env
So this AMI SHOULD NOT be used for Horovod from what I can see - there are multiple versions of openMPI installed and it really breaks everything.
I'm not sure I see why there packages installed globally and also under conda environments, it causes big problems.
I'm happy to be proved wrong but after 12 hours of messing about with it I've decided to build my own image which I shall share!
tensorflow_p36 doesn't work with GPU
You pretty much need to set it up everything yourself. Here is what you get out-of-the-box:
(tensorflow_p36) ubuntu@ip-172-31-6-11:~$ source activate tensorflow_p36
(tensorflow_p36) ubuntu@ip-172-31-6-11:~$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2019-05-01 23:37:47.936444: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
Aborted (core dumped)
Doesn't work
I've used previous versions <=v20.0 quite often and they worked okayish. In v20 the unattended upgrade process hangs itself and doesn't terminate even if you wait a long time. So you actually have to kill a bunch of processes before you can install additional packages. But now v22.0 doesn't even start due to a missing snapshot.
Tensorflow was not using GPU
I used this AMI for 40 Hours.
and I found out why my models were very slowly trained :
tensorflow/core/common_runtime/gpu/gpu_device.cc:1 406] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:0 0:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum r equired Cuda capability is 3.5.
env
conda/3python 3.6/tensorflow
I used a GX2 instance for nothing.
Thanks a lot for the time lost
Not great
This AMI did not work as I expected. I found it easier to install Tensorflow for Python from binaries myself, or using the official Docker Tensorflow image. For context I run Tensorflow for Python on MacOS and Ubuntu. It sounds like the AMI works as expected for people using Conda?