How synthetic data actually performs

Now let's think together. In The clinical-truth gap I said clinical-truth verification belongs in medical empiricism. A fair objection: why bother with real-patient infrastructure at all when synthetic data exists? Synthea, MDClone, Syntegra, mostly.ai — generate fake patients with the statistical properties of real ones, train models on those, ship.

The honest answer is to look at how synthetic data actually performs, not how it's pitched.

What synthetic does well#

Three uses where it earns its place.

Pipeline testing. No PHI, no HIPAA review, no consent overhead. Engineers stress-test ingestion code, validate FHIR mappings, exercise edge cases. Synthea — the MITRE-developed open-source generator — was built explicitly for this¹ and most US health-IT projects use it.

Training augmentation. For rare conditions where real-data samples are clinically inadequate, synthetic supplementation lifts model performance measurably. A 2024 study on rare thyroid cancer subtype classification used text-guided diffusion to generate synthetic images and improved subtype-classification AUC from 0.7364 to 0.8442². The gain came from hybrid training. Synthetic + real beat real alone.

Aggregate statistical research. Questions like what's the average HbA1c trajectory or what's the comorbidity prevalence often produce similar answers on synthetic and real data, with no individual-level exposure. A JMIR comparison study of five MDClone-generated cohorts against their real counterparts found the analyses "provide a close estimate of real data results in general," with caveats depending on the patient-to-variable ratio³.

That's a real value proposition. The series doesn't dismiss it.

What the benchmarks show#

Three places the numbers cut against synthetic-as-substitute.

Rare-event performance plateaus. SHEPHERD — a Harvard/Zitnik-lab model trained on 40,000+ synthetic patients across 2,134 rare diseases — achieved 40% top-1 accuracy in causal gene discovery when evaluated against the real-world Undiagnosed Diseases Network cohort⁴. Forty percent is useful as a triage tool. It is not clinical-grade. The gap between synthetic-trained performance and real-world ground truth is precisely the gap synthetic data can't close on its own.

A two-column decision matrix. Left column 'SYNTHETIC SUFFICES' lists pipeline testing, training augmentation, aggregate research, hypothesis generation. Right column 'REAL DATA REQUIRED' lists regulatory submission, outcome verification, AI accountability, rare-event prediction. — Synthetic data does real work in the left column. The right column is what HAVEN's real-patient infrastructure exists to serve.

Hybrid almost always wins, and hybrid needs real data. Across healthcare AI benchmarks, models trained on synthetic + real outperform either alone. An AMD fundus-image study using ResNet-18 reached 85% accuracy when combined data was used — outperforming the same architecture trained on synthetic-only by a clinically meaningful margin⁵. The destination is rarely synthetic. It's augmentation.

Privacy isn't as clean as advertised. Membership-inference attacks against synthetic health data work. A 2022 JMIR analysis (since extended by multiple 2024 papers) demonstrated attackers can infer with substantial confidence whether a specific real patient's record was used to generate a synthetic cohort⁶. The re-identification risk rises for unique cases — older patients, rare conditions — which is exactly the population synthetic data is most often used for. Differential privacy mitigates this, but only at meaningful utility cost.

Where synthetic structurally can't go#

Two categorical limits. Better generators don't fix them.

Real outcomes. A synthetic patient doesn't develop sepsis. Doesn't survive their cancer. Doesn't die from heart failure five years later. Synthetic outcome data is fictional — produced to match training distributions, not real biology. For Prometheno's longer-term horizon — paying or penalizing AI vendors when their predictions match or miss reality — the outcome side cannot be synthetic. No algorithm turns simulation into observation.

Regulatory ground truth. FDA Center for Devices issued updated real-world-evidence guidance December 2025⁷. The framework rests on observational data from actual patients in actual care. Synthetic control arms have a defined pathway as supplements to real evidence, not substitutes for it. EMA position is similar. For any AI/ML medical device seeking clearance, the path runs through real data.

What HAVEN does that synthetic can't#

The strongest argument for HAVEN comes from accepting synthetic data's strengths.

If synthetic-only training plateaus well below clinical-grade for rare events, the path forward runs through hybrid models — and hybrid needs governed real data. Consent, audit, and quality grading are what make hybrid defensible at population scale.

If real outcomes can't be synthesized, AI accountability runs on real outcome data. The infrastructure for collecting outcomes, tying them to the predictions that preceded them, and attributing value back to the contributing patients is what HAVEN's four primitives enable.

If membership-inference attacks compromise synthetic privacy claims, the answer isn't to abandon real data. It's to govern access to real data properly. Consent-attestation and hash-chained audit produce traceability that de-identification alone never did.

Synthetic data is complementary. It strengthens the case for a patient-sovereign protocol layer rather than replacing it.

What comes next#

The next post commits to what would prove the whole argument wrong.

Walonoski, J., et al. "Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record." Journal of the American Medical Informatics Association 25, no. 3 (2018): 230-238. Open-source, MITRE-maintained, used widely for testing and demonstration. ↩
Frontiers in Digital Health, "Synthetic data generation: a privacy-preserving approach to accelerate rare disease research" (2025). Text-guided diffusion produced synthetic images with 92.2% realism rate; hybrid training improved AUC from 0.7364 to 0.8442. ↩
JMIR Medical Informatics 8, no. 2 (2020), "Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies." https://medinform.jmir.org/2020/2/e16492/ ↩
Alsentzer, E., et al. "Deep learning for diagnosing patients with rare genetic diseases." Zitnik Lab, Harvard. SHEPHERD model evaluated against the NIH Undiagnosed Diseases Network real-world cohort; published results show 40% top-1 accuracy on causal gene discovery. ↩
npj Digital Medicine, "Generating high-fidelity synthetic patient data for assessing machine learning healthcare software" (2020). https://www.nature.com/articles/s41746-020-00353-9. ResNet-18 on AMD fundus images: 85% accuracy with combined real+synthetic data. ↩
Hyeong, J., et al. "Membership inference attacks against synthetic health data." Journal of Biomedical Informatics 125 (2022). https://pmc.ncbi.nlm.nih.gov/articles/PMC8766950/. Extended by multiple 2024 papers including work on differentially private synthetic data and re-identification on tabular GANs. ↩
U.S. Food and Drug Administration. Use of Real-World Evidence to Support Regulatory Decision-Making for Medical Devices. Final guidance, December 16, 2025 (supersedes 2017 guidance). Real-World Data quality criteria emphasize relevance and reliability of observational data from actual patients in actual care. ↩