Skip to main content

Random forests for generating partially synthetic, categorical data

Publication ,  Journal Article
Caiola, G; Reiter, JP
Published in: Transactions on Data Privacy
April 1, 2010

Several national statistical agencies are now releasing partially synthetic, public use microdata. These comprise the units in the original database with sensitive or identifying values replaced with values simulated from statistical models. Specifying synthesis models can be daunting in databases that include many variables of diverse types. These variables may be related in ways that can be difficult to capture with standard parametric tools. In this article, we describe how random forests can be adapted to generate partially synthetic data for categorical variables. Using an empirical study, we illustrate that the random forest synthesizer can preserve relationships reasonably well while providing low disclosure risks. The random forest synthesizer has some appealing features for statistical agencies: it can be applied with minimal tuning, easily incorporates numerical, categorical, and mixed variables as predictors, operates efficiently in high dimensions, and automatically fits non-linear relationships.

Duke Scholars

Published In

Transactions on Data Privacy

EISSN

2013-1631

ISSN

1888-5063

Publication Date

April 1, 2010

Volume

3

Issue

1

Start / End Page

27 / 42
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Caiola, G., & Reiter, J. P. (2010). Random forests for generating partially synthetic, categorical data. Transactions on Data Privacy, 3(1), 27–42.
Caiola, G., and J. P. Reiter. “Random forests for generating partially synthetic, categorical data.” Transactions on Data Privacy 3, no. 1 (April 1, 2010): 27–42.
Caiola G, Reiter JP. Random forests for generating partially synthetic, categorical data. Transactions on Data Privacy. 2010 Apr 1;3(1):27–42.
Caiola, G., and J. P. Reiter. “Random forests for generating partially synthetic, categorical data.” Transactions on Data Privacy, vol. 3, no. 1, Apr. 2010, pp. 27–42.
Caiola G, Reiter JP. Random forests for generating partially synthetic, categorical data. Transactions on Data Privacy. 2010 Apr 1;3(1):27–42.

Published In

Transactions on Data Privacy

EISSN

2013-1631

ISSN

1888-5063

Publication Date

April 1, 2010

Volume

3

Issue

1

Start / End Page

27 / 42