Synthetic Data Generation For Semi-Supervised Image Classification AI Models
AI & Data
Semester programme:Master of Applied IT
Research group:Sustainable Data & AI Application
Project group members:Viktor Pavlov
Boyan Lazarov
Project description
This project investigates how synthetic image generation can be used to improve image-based classification systems in data-scarce and biased environments across multiple domains. The main challenge is creating synthetic images that are realistic, diverse, and useful for training, while avoiding overfitting, domain shift, and high operational cost. A diffusion-based generation pipeline is explored to produce class-conditioned images, which are then combined with real data to train and evaluate deep learning models. The work focuses on validating whether synthetic data can (1) increase performance and robustness, (2) support underrepresented classes/subgroups, and (3) enable a more scalable and cost-efficient workflow compared to relying solely on manual data collection and labeling.
Context
Many real-world computer vision applications depend on large labeled datasets, but in practice, these datasets are often limited, imbalanced, expensive to annotate, or difficult to share. This is especially true in domains where data collection is restricted by privacy, expert availability, rare classes, or seasonal/geographic constraints.
This project is positioned in the broader domain of AI-assisted image understanding, spanning both healthcare (medical image classification under bias and limited diversity) and environmental monitoring (species recognition for biodiversity support). In both contexts, model development is constrained by insufficient coverage of important subgroups (e.g., rare categories, demographic/appearance diversity, or real-world condition variability).
To address this, the project applies synthetic image generation (diffusion models) as a complementary data source and integrates it into a structured experimentation workflow. The approach includes generating class-specific samples via controlled prompting, combining synthetic and real images for training, and validating outcomes using both quantitative metrics (quality and predictive performance) and interpretability checks (ensuring models learn relevant visual features rather than artifacts).
Results
The project delivers two key outcomes: (1) an end-to-end prototype workflow for synthetic image generation + model training, and (2) validated insights on when synthetic data adds value and when it becomes inefficient or risky.
1) Functional prototype pipeline (product outcome)
A reproducible pipeline was developed to support:
Prompted synthetic image generation (class- or metadata-guided) using diffusion-based models.
Dataset composition experiments, comparing training on real-only, synthetic-only, and mixed datasets.
Evaluation and reporting, including structured experiment logs and automated analysis artifacts.
This workflow demonstrates practical integration of synthetic data into an applied ML development loop rather than treating generation as a standalone step.
2) Measured improvements in downstream classification (insight outcome)
Across experiments, adding synthetic images to real data consistently improved model performance, even when the synthetic dataset was relatively small. Mixed training setups (real + synthetic) outperformed real-only baselines, indicating that synthetic images can contribute a useful complementary signal rather than noise. Synthetic-only training showed partial generalization to real data, meaning generated images encoded transferable features but did not fully replace real-world diversity.
3) Domain shift + realism vs efficiency trade-offs (insight outcome)
Validation also highlighted key limitations:
Synthetic-to-real domain gaps remain a risk, requiring careful evaluation beyond “looks realistic.”
Stronger generators can improve similarity/realism, but may introduce high compute cost (runtime/energy) and reduce scalability for large augmentation volumes.
Results suggest synthetic generation is most effective when applied selectively, for example targeting underrepresented classes/subgroups or difficult edge cases, instead of attempting full dataset replacement.
4) Stronger validation through explainability and structured metrics (TRL positioning)
Model behavior was validated using a combination of:
Performance metrics (accuracy, confusion matrices, stability checks)
Image/quality metrics and operational KPIs (where applicable)
Explainable AI checks (e.g., attention focusing on relevant visual regions)
Together, this supports a proof-of-concept maturity level: the system is technically feasible, measurable, and reproducible, with clear guidance on how to scale responsibly and where additional validation is needed before deployment.