
There might be something fundamentally wrong with the data used to train the AI systems we are increasingly relying on, scientists have warned.
Much of the data that is used to train those systems could rely on images that were taken without consent, are not diverse and do not protect privacy, researchers have warned. As well as being potentially unethical in itself, that could also have consequences for those people who rely on the AI systems trained on that data.
The warning comes after researchers released the Fair Human-Centric Image Benchmark, which is developed by Sony AI and aims to offer an ethically-sourced set of training images. Researchers can then use that data to evaluate whether systems have been trained using similarly ethically collected sets of data.
The research focuses on computer vision systems, which power everything from facial recognition technology to driverless cars. Those systems must be trained on real images including those of people and their faces to ensure they are able to recognise them.
The new database looks to set best practices for training data to ensure it respects consent, diversity, and privacy. It includes 10,318 images of 1,981 people from 81 distinct countries or regions, for instance.
Because the data is labelled – with demographic and physical attributes, including age, pronoun category, ancestry, and hair and skin colour – and the people in it were able to give informed consent, it can be used to check against existing systems to understand whether they also respect those important ethical rules, the researchers behind it suggest.
In a paper describing the findings, the researchers note that the creation of the dataset was more expensive and complicated than other similar training images. But using such ethically collected data could also help develop trust in AI, they hope.
The work is reported in a new paper, ‘Fair human-centric image dataset for ethical AI benchmarking’, published in the journal Nature.