Glossary of Terms

Core Terms
Synthetic Data
Data generated artificially rather than collected from real-world events, often used to preserve privacy or augment datasets.
Synthetic data is valuable for testing, research, and machine learning model training. It helps organizations maintain data privacy while achieving similar results as real-world data.
Generative Adversarial Networks (GANs)
A machine learning framework that uses two neural networks (generator and discriminator) to produce data similar to a given dataset.
GANs consist of a generator model, which creates synthetic data, and a discriminator model, which evaluates its realism. The two models work together in a game-theoretic manner to improve output quality.
CTGAN
A GAN-based model specifically designed for generating tabular data, handling categorical variables and imbalanced datasets.
CTGAN is useful for generating high-quality synthetic data in tabular formats, which is common in data science tasks involving structured data.
Target Column
The specific column or feature in a dataset that the model is aiming to predict or analyze.
Selecting a target column helps the model understand which variable to focus on for predictive tasks or data transformation processes.
Sensitive Columns
Features in a dataset that contain sensitive information, which may need special handling for privacy reasons.
Sensitive columns may contain personally identifiable information (PII) and are often masked or modified to protect individual privacy.
Quality and Evaluation Terms
Utility
The usefulness of synthetic data in preserving the insights and patterns of the original dataset for practical purposes, such as training machine learning models.
Utility is often evaluated through downstream tasks, such as the performance of models trained on synthetic data compared to those trained on real data.
Privacy
The degree to which synthetic data protects the identities and sensitive information of individuals, often measured by privacy-preserving techniques like differential privacy.
Privacy-preserving methods ensure that synthetic data cannot be traced back to any individual in the original dataset, safeguarding sensitive information.
Performance
The effectiveness of synthetic data in maintaining the performance of machine learning models or analytics tasks compared to using real data.
Performance is often measured by evaluating models trained on synthetic data and comparing their accuracy, F1 score, or other metrics to models trained on real data.
Fidelity
The extent to which synthetic data accurately represents the statistical properties and structure of the original dataset.
High fidelity ensures that synthetic data resembles real data closely enough to be useful for similar analysis and modeling tasks.
Diversity
The variety and range of values generated in synthetic data, ensuring coverage across different categories and ranges found in the original data.
Ensuring diversity in synthetic data prevents over-representing specific values or patterns, promoting balanced and unbiased analysis.
Disclosure Risk
The potential for synthetic data to reveal or infer sensitive information about individuals in the original dataset.
Disclosure risk assessment helps evaluate and reduce the chance that synthetic data may inadvertently reveal private information.
Similarity Metrics
Measures used to evaluate how closely synthetic data matches the original data, such as Kullback-Leibler divergence and mean absolute error.
Similarity metrics help validate the quality of synthetic data by quantifying differences or similarities to the original dataset.
Attribute Disclosure
The risk that specific characteristics or attributes of individuals in the original data could be revealed through synthetic data.
Techniques like attribute masking or generalization can help prevent attribute disclosure in synthetic data.
Statistical Integrity
The accuracy of statistical properties in synthetic data, such as mean, variance, and correlation, compared to the original data.
Maintaining statistical integrity ensures that synthetic data retains essential patterns and relationships needed for accurate analysis.
Overfitting in Synthetic Data
When synthetic data too closely replicates the training data, leading to low privacy levels and poor generalization to new data.
Overfitting synthetic data can compromise privacy by inadvertently copying individual data points from the original dataset.