Prof. Dirk Valkenborg is the Programme director of the data science trajectory in the Master of Statistics & Data Science at UHasselt. He teaches a course on Machine Learning in this programme, and during this course, his students use the supercomputing infrastructure of The Flemish Computer Center. We sat down with Prof. Valkenborg to ask him why he is so keen on getting his students enthusiastic about working with a supercomputer.
Why did you introduce supercomputing to your students?
In the Master of Statistics & Data Science, particularly within the Data Science track, students must handle large datasets. So, we needed a solution to provide our students with more computational power. We also wanted to teach them some basic skills in supercomputing as preparation for their later professional life. Several options were available for consideration.
One possibility was extending our local infrastructure and granting students access to these facilities. But this extension would be very costly and require updating hardware and continuous maintenance, so this was not feasible. Alternatively, we could use a platform like Google Colab or Kaggle, which also has limitations. When training is required on a larger dataset, you get thrown out of the service because it takes too much time.
We contacted the VSC to check whether there was any possibility of giving our students access to the VSC computing facilities, and this arrangement was feasible. Of course, the entry-level here is higher. The other two options had a more accessible entry: run a Jupiter notebook on a GPU via Colab or Kaggle. When opting for local infrastructure facilities, we would give the students a bit more computing power. Still, their mode of operation would remain the as using their regular laptop. Students would not think about parallelisation, making data available, job submission etc. Using VSC infrastructure, forces them to do so.
Also local infrastructure does not scale, and the VSC infrastructure does.
You decided to introduce the students to supercomputing in the first semester of the second year of the Master's. Why?
In our master's path, the focus on the master thesis project is only in the last three months of the academic year. Getting students on board with supercomputing during that short period is challenging. Therefore, giving them a taste of the VSC capacity earlier in their educational journey would be much more convenient.
I teach a Machine Learning course in the second year of the Master of Statistics & Data Science. In this course, a relatively large dataset needs to be analysed, and this also can perfectly be done using supercomputing infrastructure. So, we added this as an extra competence for this course.
Prof. Bex and I organise two sessions in the Machine Learning course to gently introduce students to remote computing. One of the final objectives of the Machine Learning course is that their final model runs on a supercomputer.
What is the added value for the students?
Companies like VITO or Janssen Pharmaceutics have their own supercomputing infrastructure with a slightly different interface, but it is all terminal-based. Once they get on board with the philosophy, students can work on other infrastructures too.
So, it gives them additional skills in their first steps in the job market.
Or if they want to do a Ph.D., they are one step ahead. Many of our Ph.D. students get the same training as the one we provide in our Machine learning course.
If we ask Ph.D. students who have worked on their Ph.D. for several years to work with a supercomputer, they are reluctant to make that switch. They think: ‘I will just leave my laptop on for a few nights'. You can compare it with learning a new language. The sooner you start, the better. The earlier students learn their way around a supercomputer, the better.
Which students are currently involved in this?
We currently focus on students of the Master of Statistics & Data Science students. Still, these are elective courses and can also be taken by others (students entering from biology, biostatistics or bioinformatics). This is the student's choice, but we are considering how to make this compulsory for all our students because supercomputing is also relevant in statistics, bioinformatics, epidemiology etc. ….
How did you convince the students to step in? How did you take away the threshold with your students?
I first explained the project's added value and told them it was necessary. It is a competence evaluated through a pass/fail system. The student fails if one cannot run the script on the supercomputer.
I test two competencies in this way: first, are students technically proficient enough to submit a job on a supercomputer? Second: are they able to write a script? Because I still notice a lot when practically writing scripting languages (especially R), students often have a text file full of R code, which they run block by block. They select a few lines of code, run that code, scroll to another piece in their document, and then select the code they wish to run. This is incompatible with the philosophy of submitting a job to a supercomputer. There, students have to program in a logical line.
Did you have any lessons learned from this experience? How would you address these?
Because I am the only professor requiring this, it’s a bit of a forced exercise: you see students checking this off and struggling for a while but quickly falling back into old habits. To better consolidate this, I recommend that every professor works with the pass/fail method if possible.
I also noticed that students don’t start working on the supercomputer immediately after the introduction. Because the assignment is to run the final model on the supercomputer, which is sometime in December, they postponed it until then. Luckily the introduction session was recorded so they could look at it again. So, during that period, we received many questions about how to work because it was too long ago.
For the future, we are thinking about ways to activate students to work with it immediately and consolidate their knowledge. We could explain how to do it and let them practice through exercises so they do it repeatedly. After three or four times, they will master the skill.
Why would you convince other professors to use supercomputing?
For me, it was a need that became more pressing in the master thesis. If students want to deal with data and start to apply to compute there, they have to switch to a more extensive system. Introducing them to supercomputing in early in their career is an excellent way to prepare them for this.
In the course where I apply it, it is not strictly necessary, but this way, they can already learn those skills on a small case and see the added value of such a system.
The more we require students to use this, the more natural this becomes, and that’s a plus.
Thank you, prof. Valkenborg for this inspirational talk. We wish you good luck with the course!
EuroCC Belgium encourages the use of supercomputing in education and wants to facilitate others to teach how to utilise a supercomputer effectively. Therefore, we provide a professional teaching kit, including a comprehensive slide deck. More material (videos etc.) will follow soon.
Check out our teaching kit via this link. Unlock the potential of others with our training tools!