FAQ
< All Topics

Won’t data lakes make AI more trustworthy?

Defining datalakes

A data lake is simply a large repository of data that is unstructured and un analysed that can receive contributions from many different sources, whether structured data from business processes or unstructured data from the Internet of Things (IoT) and social media.

How they are used

Users of the data can examine and sample whatever they want. Businesses that have used data lakes have been shown to improve organic sales growth by 9% by being able to perform better data analytics on their data (Lock).

Data lakes are potentially larger and more varied, and could provide more comprehensive data sets for Machine Learning (ML). There is not however a direct link between the size, nor variability of a data set and the so called “Trustworthiness” of AI.

Machine Learning

Analytics

Real Time Data

Data Lake

Trustworthy AI

The idea of trustworthiness comes from various sources such as the EU High Level Expert Group on AI. Such groups  seek to recognise the problems of data bias in AI training sets and the need for the public to be able to trust that AI applications will be reliable and accurate.

Trust goes beyond the notion that more data means better AI.

Trust is aligned with confidence that an application will not harm aspects of human personhood such as freedom and privacy, justice and moral agency.

Data sets will always contain bias because humans are biased and it will not be possible to remove that bias. Humans however have been used to being able to appeal to each other or a higher authority when bad decisions are made, even if that process is not perfect. Many AP applications do not provide such a recourse. One of the key issues in trustworthiness of AI is whether it is right to hand over moral agency to a machine – that is to allow it to make decisions on our behalf.

Limitations

Data lakes are unlikely to be as secure as data warehouses as data security in warehouses is more mature and usually decentralized. They also have the disadvantage that if there is no oversight of the contents they could in fact become data swamps, to the detriment of Machine Learning (ML)!

Another challenge with data lakes is the extent to which it contains internal organisational data or also external “private” data collected unknowingly from customers, staff or others. It is well known that companies such as Google and Facebook harvest masses of personal data that is used in their ML systems for targeted advertising. However, in the UK, the National Health Service required its users to opt out of their private data being pooled and used in a way that a data lake would. It is highly likely that this data will be used in various ways commercially including ML. 

User control of data

There are suggestions that such data lakes could allow individual users to control their own data and who is allowed to access it

When a person’s data is collected, knowingly or unknowingly it becomes public, even when anonymised. Data sources and personal attributes can be reconstructed from anonymised data sets.

References

Michael Lock, Angling for insight in today’s data lake, Aberdeen.

[https://s3-ap-southeast-1.amazonaws.com/mktg-apac/Big+Data+Refresh+Q4+Campaign/Aberdeen+Research+-+Angling+for+Insights+in+Today’s+Data+Lake.pdf]

Table of Contents