Paris-Saclay University uses AWS to advance data science through collaborative challenges

This is a guest post by Maria Teleńczuk, research engineer at the Paris-Saclay Center for Data Science (CDS), and Alexandre Gramfort, senior research scientist at INRIA, the French National Institute for Research in Digital Science and Technology. They explain how they adapted their open source data challenge platform RAMP to train the models submitted by the participants using Amazon Elastic Compute Cloud (Amazon EC2) Spot instances, and how they leveraged AWS to support three student challenges.

RAMP: How it all began

It took years of research and data collection to confirm that the Higgs boson exists. The story made headlines when it was eventually proven, but researchers wanted to dive deeper and understand more. That is why physicists and computer scientists at the European Organization for Nuclear Research (CERN), Paris-Saclay Center for Data Science (CDS), and other institutions organized a data challenge hosted on Kaggle, a popular online platform for data science competitions.

The winner of the challenge was invited to the Paris-Saclay University to explain and share their solution, which would benefit physicists. The solution code was written in Lisp with some custom compilation tricks, making it a challenge for researchers to recompile this code on their systems. The challenge turned out to be on how to reuse the Lisp code, and not on understanding the algorithms conceptually.

That is how an idea emerged: creating an open-source prototyping system, available to anyone to accelerate the dissemination of machine learning (ML) in experimental science. Participants would submit their code and not just the results. Each solution would be evaluated on a remote server or in the cloud, enabling data scientists to focus on the prediction task. Solutions would then be ranked in a public leaderboard and, importantly, after the challenge all submissions would be made public to maximize the spreading of good ideas. This is how the RAMP platform (Rapid Analytics and Model Prototyping) was created and made available via open source. The CDS team also launched RAMP.studio to put RAMP in action and host their own challenges.

Six years later, more than 20 different scientific challenges have been setup by the Center for Data Science within the Paris-Saclay University and have been successfully run with RAMP, by researchers in physics, pharmacology, neurosciences, and earth sciences. These challenges have been proposed publicly, sometimes attracting international data scientists, and were for most of them used in graduate training programs within the university. RAMP is still being actively developed by the Center for Data Science within the Paris-Saclay University.

The 2020 datacamp explained

Last year, the CDS team organized a datacamp with three different challenges that aim to either determine the source localization of MEG signals produced by the neuronal electrical activity, detect human gait, and classify variable stars. These three topics were proposed by different research units within the university.

Source localization of MEG signal: Our brains generate electromagnetic signals which originate from underlying neuronal activity. Those signals can be recorded noninvasively by means of magnetoencephalography (MEG). However, the exact location of the sources of those signals are difficult to discern. This source localization is a long standing challenge and researchers still dispute between possible hypotheses. To approach this problem from a different angle, CDS engineers simulated realistic MEG signals using the MNE Python library. By doing so, they were able to control the number and location of active sources. Given this simulated MEG signal for a few different subjects, students were asked to predict where the neural sources are located. If successful, their algorithms could be applied to the real MEG signals and lead to better understanding of the location of the recorded brain signals during different cognitive tasks and hence, to better understanding of the human brain.

Detection of human gait: Everyone has specific characteristics in how they walk, and the difference is even more significant between patients suffering from Parkinsons disease and healthy people. Understanding those differences and allowing for detection of differences could potentially lead to early diagnosis of some diseases. Using data from a publication, the students were given the accelerometer and gyrometer data collected from the feet of mixed population while walking. The task was to detect the start and end time of footsteps, which can then be used for quantitative diagnosis.

Classification of variable stars: Variable stars are those whose light is not emitted steadily. The reason might be that two stars are rotating around each other. If the stars are besides each other the strength of the light adds up; if the smaller star is in the back, the emitted light is lower as it is only of the larger star; if the smaller star is in the front, the strength of the light adds up again, but not as strongly as in the first case because some of the light of the larger star is hidden. The knowledge on the origins of this instability is crucial when surveying the sky. Given the data on the variable stars, so-called “light curves,” the students were asked to provide an automatic classification algorithm.

The students of the master in data science program at the Institut Polytechnique de Paris (IP Paris) selected a challenge, and were provided with detailed explanations, input data and sample solutions, as most of them were novice to the problem to solve. Students were allowed to use any freely available resources, all the knowledge they gathered so far, and any other tools they wished. They submitted their solutions to RAMP, which retrained and applied them to hidden private datasets. The score of the current best-performing algorithm was revealed on the leaderboard, motivating teams to submit better solutions.

However, with ever larger datasets and more sophisticated ML algorithms comes the need for more compute power. With more than 250 participants submitting multiple solutions each day, and each solution requiring hours of computation sometimes on GPU resources, the infrastructure behind RAMP.studio had to scale on demand to keep the training time reasonable for students.

Bringing RAMP to the next scale using AWS

Using AWS and an AWS Cloud Credit for Research program, the CDS team leveraged the scalability and the elasticity of the cloud to add GPU computing resources on demand. They adapted the RAMP platform to launch Amazon EC2 instances on AWS when students submit new solutions to evaluate.

The CDS team deployed a pipeline that automatically creates on request a custom instance image—called Amazon Machine Image (AMI)—that includes the necessary libraries, the RAMP code, the starting kit for the challenge, and the private dataset. This enables to quickly update the AMI if students request additional Python libraries. Then, when participants submit a solution, the RAMP platform launches a new Amazon EC2 instance, loads the solution into the instance solution, and starts the training remotely. Once the training has completed, the results are copied over to the RAMP platform, the leaderboard is updated, and the EC2 instance is terminated.

The RAMP platform launches Amazon EC2 Spot instances in priority, and falls back to on-demand instances if there is no Spot capacity available, or if the running Spot instance is interrupted and cannot be relaunched within a specific period of time after the submission.

Datacamp metrics

The platform was ready to scale, and a large group of students was let into RAMP.studio to explore, code, invent, and enjoy from November to December 2020. 272 participants submitted a total of 1480 solutions for the variable star challenge, 1083 for the MEG signal challenge, and 567 for the human gait challenge.

It took an average of 1.2 hours to train the models submitted for the MEG signal and the human gait challenges, and required more than 3250 hours of g4dn.xlarge instance running time. The variable star challenge only took an average of 5 minutes to train, and did not require GPU resources.

In the following diagram, you can see the best performing model accuracy evolving over time, and as the deadline for completion approached. The students tried to get the lowest possible score for the MEG signal challenge with 0 being the absolute best, and a score as close to 1 as possible for the human gait and variable star challenges. Now the scientist who proposed these challenges will explore the creative solutions proposed by students, which will hopefully contribute to future publications and research directions.

What’s next and learn more

After this successful datacamp edition, the CDS team is in the process of designing new challenges and, with the support of AWS, plans to enrich the experience of yet another group of RAMP.studio participants in applying ever more creative solutions.

The RAMP platform and the integration with AWS is also available to any schools or institutions willing to organize their own data science challenges and to provide a delightful experience for the participants. For more information, contact the CDS team at admin@ramp.studio.

If you are a researcher and the cloud can help you accelerate innovation, apply for the AWS Cloud Credit for Research program.

Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.