2025 IEEE International Conference on Cyber Security and Resilience

Full Program

Summary:

Synthetic data generation has become a powerful solution for producing high-quality, privacy-preserving datasets, especially in domains where data sensitivity is crucial. Large Language Models (LLMs) excel in generating tabular data but often lack privacy safeguards. To address this limitation, this work proposed DP-Tabula, an LLM-based model for tabular data generation integrating Differential Privacy into its training process. Additionally, an outlier handling technique is employed to stabilize model performance under the noisy training conditions introduced by DP-SGD. Experiments conducted across multiple datasets reveal a privacy-utility trade-off, where the optimal noise level depends on dataset-specific characteristics. Furthermore, an intriguing finding emerges: the order of features in the input sequence significantly influences the quality of the synthetic data produced by the LLM-based model. This research offers a framework that not only strengthens privacy in synthetic tabular data generation but also uncovers insights into the mechanics of LLM-driven data synthesis.

Author(s):

Weijie Niu    
Switzerland

Zehao Zhang    
Switzerland

Alberto Huertas    
Spain

Chao Feng    
Switzerland

Jan von der Assen    
Switzerland

Nasim Nezhadsistani    
Switzerland

Burkhard Stiller    
Switzerland

 


Copyright © 2025 SUMMIT-TEC GROUP LTD