Data handling, machine learning and ethical aspects

27.2. Data handling, machine learning and ethical aspects#

The data collection (or selection) process in a machine learning pipeline has important ethical aspects. Undetected biases in the data set will propagate via the learning algorithm to the final model predictions. In physics, the model might be rather well understood, and we might have good control over the data generation process, but we should still develop a sound ethical attitude to the use, processing and analysis of data.

Another pressing ethical aspects deals with our approach to the scientific process. In particular, it is of utmost importance that scientific results are reproducible. In fact, reproducibility should be imprinted as part of the dialectics of science. Nowadays, with version control software like Git and various online repositories like Github, Gitlab etc, we can easily make our codes and data sets openly and easily accessible to a wide community. This service helps almost automagically to make our science reproducible. The large open-source development communities involved in Scikit-Learn, Tensorflow, Keras, etc, are all excellent examples of this. The codes can be tested and improved continuosly, helping thereby our scientific community at large in developing data analysis and machine learning tools. It is much easier today to gain traction and acceptance for making your science reproducible. From a societal stand, this is an important element since many of the developers are employees of large public institutions like universities and research labs.

Let us also add a disclaimer concerning the fantastic progress of machine learning technology. Even though we may dream of computers developing some kind of higher learning capabilities, at the end it is we (yes you reading these lines) who end up constructing and instructing, via various algorithms, the machine learning approaches.

For self-driving vehicles, where the standard machine learning algorithms discussed here enter into the software, there are stages where the human programmer must make choices. As an example, all carmakers have the safety of the driver and the accompanying passengers as their utmost priority. Consider the scenario where the programmer has to construct an if statement that decides in an accident scenario between crashing into a truck or steering into a group of bicyclists.

This leads to serious ethical aspects. Who is entitled to make such choices? Keep in mind that many of the algorithms you will encounter in this series of lectures, or that you will hear about later, are indeed based on simple programming instructions. And you are very likely to be one of the people who end up writing such a code. Thus, developing a sound ethical attitude is much needed. The example of the self-driving cars is just one of infinitely many cases where we have to make choices. Other domains where applications might have serious ethical aspects include the financial sector, law, and medicine.

We do not have the answers here, nor will we venture into a deeper discussions of these aspects, but you should think over these topics in a more overarching way. A statistical data analysis with its dry numbers and graphs meant to guide the eye, does not necessarily reflect the truth, whatever that is. As a scientist, and after a university education, you are supposedly a highly qualified citizen, with an improved critical view and understanding of the scientific method, and perhaps some deeper understanding of the ethics of science at large. Use these insights. You owe it to our society.