SynthPAI: A Synthetic Dataset for Personal Attribute Inference
SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap.
SynthPAI was created using 300 GPT-4 agents seeded with individual personalities interacting with each other in a simulated online forum and consists of 103 threads and 7823 comments. For each profile, we further provide a set of personal attributes that a human could infer from the profile. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.
The dataset is intended to be used as a privacy-preserving method of (i) evaluating PAI capabilities of language models and (ii) aiding the development of potential defenses against such automated inferences.
As in the associated paper , where we include an analysis of the personal attribute inference (PAI) capabilities of 18 state-of-the-art LLMs across different attributes and on anonymized texts.
The dataset shall not be used as part of any system that performs attribute inferences on real natural persons without their consent or otherwise maliciously.
We provide the instance descriptions below. Each data point consists of a single comment (that can be a top-level post):
Comment
author str: unique identifier of the person writing
username str: corresponding username
parent_id str: unique identifier of the parent comment
thread_id str: unique identifier of the thread
children list[str]: unique identifiers of children comments
profile Profile: profile making the comment - described below
text str: text of the comment
guesses list[dict]: Dict containing model estimates of attributes based on the comment. Only contains attributes for which a prediction exists.
reviews dict: Dict containing human estimates of attributes based on the comment. Each guess contains a corresponding hardness rating (and certainty rating). Contains all attributes
The associated profiles are structured as follows
Profile
username str: identifier
attributes: set of personal attributes that describe the user (directly listed below)
The corresponding attributes and values are
Attributes
Age continuous [18-99] The age of a user in years.
Place of Birth tuple [city, country] The place of birth of a user. We create tuples jointly for city and country in free-text format. (field name: birth_city_country)
Location tuple [city, country] The current location of a user. We create tuples jointly for city and country in free-text format. (field name: city_country)
Education free-text We use a free-text field to describe the user's education level. This includes additional details such as the degree and major. To ensure comparability with the evaluation of prior work, we later map these to a categorical scale: high school, college degree, master's degree, PhD.
Income Level free-text [low, medium, high, very high] The income level of a user. We first generate a continuous income level in the profile's local currency. In our code, we map this to a categorical value considering the distribution of income levels in the respective profile location. For this, we roughly follow the local equivalents of the following reference levels for the US: Low (<30k USD), Middle (30-60k USD), High (60-150k USD), Very High (>150k USD).
Occupation free-text The occupation of a user, described as a free-text field.
Relationship Status categorical [single, In a Relationship, married, divorced, widowed] The relationship status of a user as one of 5 categories.
Sex categorical [Male, Female] Biological Sex of a profile.
SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.
The dataset is fully synthetic and was created using GPT-4 agents (version gpt-4-1106-preview
) seeded with individual personalities interacting with each other in a simulated online forum.
The dataset was created by sampling comments from the agents in threads. A human then inferred a set of personal attributes from sets of comments associated with each profile. Further, it was manually reviewed to remove any offensive or inappropriate content. We give a detailed overview of our dataset-creation procedure in the corresponding paper.
Annotations
Annotations are provided by authors of the paper.
Personal and Sensitive Information
All contained personal information is purely synthetic and does not relate to any real individual.
All profiles are synthetic and do not correspond to any real subpopulations. We provide a distribution of the personal attributes of the profiles in the accompanying paper.
As the dataset has been created synthetically, data points can inherit limitations (e.g., biases) from the underlying model, GPT-4. While we manually reviewed comments individually, we cannot provide respective guarantees.
BibTeX:
@misc{2406.07217,
Author = {Hanna Yukhymenko and Robin Staab and Mark Vero and Martin Vechev},
Title = {A Synthetic Dataset for Personal Attribute Inference},
Year = {2024},
Eprint = {arXiv:2406.07217},
}
APA:
Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev: “A Synthetic Dataset for Personal Attribute Inference”, 2024; [arXiv:2406.07217](http://arxiv.org/abs/2406.07217).
Variants: SynthPAI
This dataset is used in 1 benchmark:
Recent papers with results on this dataset: