IEEE-CSR - Peer Review & Conference Management System

Summary:

Synthetic data generation has become a powerful solution for producing high-quality, privacy-preserving datasets, especially in domains where data sensitivity is crucial. Large Language Models (LLMs) excel in generating tabular data but often lack privacy safeguards. To address this limitation, this work proposed DP-Tabula, an LLM-based model for tabular data generation integrating Differential Privacy into its training process. Additionally, an outlier handling technique is employed to stabilize model performance under the noisy training conditions introduced by DP-SGD. Experiments conducted across multiple datasets reveal a privacy-utility trade-off, where the optimal noise level depends on dataset-specific characteristics. Furthermore, an intriguing finding emerges: the order of features in the input sequence significantly influences the quality of the synthetic data produced by the LLM-based model. This research offers a framework that not only strengthens privacy in synthetic tabular data generation but also uncovers insights into the mechanics of LLM-driven data synthesis.

Author(s):

Weijie Niu
Switzerland

Zehao Zhang
Switzerland

Alberto Huertas
Spain

Chao Feng
Switzerland

Jan von der Assen
Switzerland

Nasim Nezhadsistani
Switzerland

Burkhard Stiller
Switzerland