Based in Sydney, Australia this a Blog of Associate Professor AmandeeP Sidhu. it focuses on everything Artificial Intelligence, Big Data and Biomedical Informatics.

AI without Quality Data is useless

Algorithmic Accountability Act of 2019 - April 10, 2019

Algorithms are increasingly involved in the most important decisions affecting lives - whether or not someone can buy a home, get a job or even go to jail. But instead of eliminating bias, too often these algorithms depend on biased assumptions or data that can actually reinforce discrimination.

Algorithms control various aspects of a digital economy. They determine which candidates will be interviewed and how much they will be paid; who will be targeted for or excluded from advertisements; and how much consumers will pay for goods and shipping online. Still, the public is largely in the dark about how personal data is collected, manipulated, shared, and stored.

"Automated decisions are not neutral decisions. They turn on human data. As long as humans are biased, algorithms will be biased too. This bill will force companies to reckon with that reality." - Center on Privacy & Technology at Georgetown Law

Artificial Intelligence and Data

Even though AI technologies have existed for several decades, it’s the explosion of data that has allowed it to advance at incredible speeds. It’s the billions of searches done every day on Google that provide a sizable real-time data set for Google to learn from our typos and search preferences. Siri and Cortana would have only a rudimentary understanding of our requests without the billions of hours of spoken word now digitally available that helped them learn our language. 

Each year, the amount of data we produce doubles and it is predicted that within the next decade there will be 150 billion networked sensors (more than 20 times the people on Earth). This data is instrumental in helping AI devices learn how humans think and feel, and accelerates their learning curve and also allows for the automation of data analysis. The more information there is to process, the more data the system is given, the more it learns and ultimately the more accurate it becomes.

In the past, AI’s growth was stunted due to limited data sets, representative samples of data rather than real-time, real-life data and the inability to analyze massive amounts of data in seconds. Today, there’s real-time, always-available access to the data and tools that enable rapid analysis. This has propelled AI and machine learning and allowed the transition to a data-first approach.

Quality Data for AI systems

As someone who has worked nearly two decades with large data sets, the obvious challenge in every single case I have encountered was to effectively query heterogeneous data sources and then extract and transform data. The non-obvious challenge was the early identification of data issues, which in most cases were unknown to the data owners as well.

AI systems need to become aware of the quality in data: they must instantly identify potential issues and avoid exposing dirty, inaccurate or incomplete data. This implies that even if there is a sudden problematic situation resulting in poor-data quality entries, the AI will be able to handle the quality issue and proactively notify the right users; depending on how critical the issues are, it might also deny serving data or serve data while flagging the potential issues. 

AI systems of future should be designed assuming that at some point there will be problematic data feeds and unexpected quality issues.

Conscious Choices about Ethics and Governance of AI

Managing 163ZB of Data in 2025