Acceleration Plan in practice: “By using simulation data you test the ‘fairness’ of machine learning”.

Education data zone

Machine learning is increasingly being used to make predictions. But is this done in an ethical way? MSc student Michelle Otten used the simulation dataset to test this: “There are many risks of unconscious discrimination”.

For her thesis, Michelle Otten investigated existing models within machine learning in combination with ethics. “I focused specifically on predicting student success. For example, educational institutions can use machine learning to improve the quality of their teaching and provide timely assistance or extra challenges where needed. Using a simulation dataset, I wanted to test whether established models work ethically and fairly to ensure equal opportunities.”

Naam: Michelle Otten (25)
Functie: MSc Student Business Analytics & Management at the Erasmus University Rotterdam (EUR)
Favoriete YouTube-kanaal: ‘Impact Theory’, in which Tom Bilyeu interviews many experts on psychology, health, relationships and achieving your ambitions
If I were Minister of Education, Culture and Science (OCW) tomorrow, the first thing I would do would be: to reverse the abolition of study financing, so that everyone has the opportunity to continue studying without worry.

Machine learning

Michelle explains: “Machine learning is a form of artificial intelligence. Based on existing data, it predicts future data by finding patterns in data and reproducing them. This gives reasonably reliable and strong predictions, but a model cannot just be used in every situation. Because of the complicated algorithms, errors can creep in unintentionally.

"Sometimes specific population groups are disadvantaged by unintentional errors."

Sensitive variables

Equal opportunities are determined by the standards in the Discrimination Act. “The challenge mainly lies in the conscious or unconscious use of variables that can potentially lead to discrimination. In the Dutch AVG legislation, these are called ‘sensitive variables’, such as gender, political affiliation or socio-economic background. Research also takes ‘proxy variables’ into account. These are variables that have a high correlation with variables that can lead to discrimination. So the variables appear neutral, but in fact they are not.”

"In order to assess whether a model leads to discrimination in practice, you have to look very specifically at the purpose."

Michelle explains: “It is not that you cannot use these sensitive variables to make predictions, but you have to be careful in which situation you use them. This remains a subjective consideration. The most important thing is to look at the purpose for which you are using the variables.”

Simulated data

In order to test existing models for ethical risks, Michelle created two use cases: “I created two models that use (exam) grades to shape admission criteria on the one hand and to identify which students might be underperforming on the other.” However, the current AVG legislation makes it difficult for educational institutions to release or share the required data with external parties such as Michelle.

"Thanks to the simulation dataset, I suddenly had access to data on over 20,000 simulated students".

“On the advice of my thesis supervisor, I therefore used the simulation dataset. This was created by Erasmus University, VU Amsterdam and the Safe and reliable use of education data zone., and has around 2.2 million observations of over 20,000 simulated students. This means that on the basis of real student data a fictitious dataset has been created in which the same correlations apply. In this way, the data cannot be traced back to existing students and my results can also be made public.”

Ethical implications

Michelle is the first external researcher to use the simulation dataset. She was in contact with Dominique van Deursen from the EUR for this: “Dominique helped me understand the dataset, use it properly and describe it in my research. I have discovered a number of small errors that are being addressed, for example multiple genders were assigned to a student number. My recommendation is to reproduce my research with real data to validate the results.”

One of the results from the use case about admission criteria is that Michelle has chosen to leave out the variable whether parents have studied at a university from the predictive model: “By looking at the parents you increase the risk of socio-economic inequality. But you can use this information to see which students need extra help, for example because they did not learn how to study or how to plan their studies at home. This emphasises the need for a conscious, ethical consideration of the use of variables for different purposes.”

Also use the simulation dataset?

Are you doing research and do you want to know if you can use the simulation dataset for this? Or are you curious about the rationale behind the data set? Please read on for more information. Please note: the simulation dataset is only available in Dutch.

Written by: Bianca Oppelaar
Image: Hunter Harritt via Unsplash

More from the Acceleration Plan

The next publications, products and projects might be interesting for you
Infographic results Acceleration Plan

In the Acceleration Plan we worked on maximising the possibilities digitalisation offers higher education. We developed instruments, ran pilots, and

Learning Community Tool

Getting started with the (continued) development of a powerful learning community How can collaboration between companies and educational institutions be