Sign in
Categories
Your Saved List Become a Channel Partner Sell in AWS Marketplace Amazon Web Services Home Help

This AMI Breaks Horovod under tensorflow_p36 env

  • By Marc
  • on 09/24/2019

So this AMI SHOULD NOT be used for Horovod from what I can see - there are multiple versions of openMPI installed and it really breaks everything.

I'm not sure I see why there packages installed globally and also under conda environments, it causes big problems.

I'm happy to be proved wrong but after 12 hours of messing about with it I've decided to build my own image which I shall share!


  • By aws-nskul
  • on 09/30/2019

Thank you for the feedback. Currently there are two locations where mpirun (via OpenMPI) is available: 1. In /usr/bin and 2. /home/ubuntu/anaconda3/envs/<env>/bin where env is an environment corresponding to the frameworks such as Tensorflow, MXNet and Chainer. The mpirun binary outside of the conda environment (/usr/bin/) exists to maintain compatibility with the CNTK framework which expects an older version (v1.10) at that location. The newer OpenMPI versions are available in the conda environments. We recommend using absolute/full path of the mpirun binary to run mpi workloads or use the --prefix flag https://www.open-mpi.org/faq/?category=running#mpirun-prefix. For example, for Tensorflow python36 environment, use /home/ubuntu/anaconda3/envs/tensorflow_p36/bin/mpirun We will update our user guide to make this clearer.