Understanding the Data

Throughout the life cycle of AI systems, understanding the input and training data is essential. The quality, structure, and distribution of the data directly shape the potential performance and reliability of an AI system. As the saying goes, “Garbage in, garbage out”.

Data understanding is especially valuable at the beginning of the AI life cycle when data needs to be collected and prepared for AI training (exploratory data analysis). However, data understanding remains important during the operation of an AI system to detect data drift, which can cause severe performance issues. In other words, understanding data is an ongoing process that helps both detect biases and ensure that the AI model remains valid under changing conditions.

Methods and Key Concepts for Data Understanding

  • Data visualization techniques (scatter plots, histograms, heatmaps)
  • Clustering and dimensionality reduction (e.g., PCA, t-SNE, UMAP)
  • Outlier and anomaly detection methods
  • Bias and fairness checks (e.g., summary statistics for demographic parity)
  • Methods for data drift monitoring (e.g., Kolmogorov-Smirnov Test, hypothesis testing)

Further Reading