SynthPAI

Name: SynthPAI
Published: 2024-06-13
License: cc-by-nc-sa-4.0

SynthPAI: A Synthetic Dataset for Personal Attribute Inference

Dataset Information

Modalities

Texts

Languages

English

Introduced

2024

License

cc-by-nc-sa-4.0

Homepage

Official Website

Contents

Overview
Associated Benchmarks
Recent Benchmark Submissions
Research Papers

Overview

SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap.

Dataset Details

Dataset Description

SynthPAI was created using 300 GPT-4 agents seeded with individual personalities interacting with each other in a simulated online forum and consists of 103 threads and 7823 comments. For each profile, we further provide a set of personal attributes that a human could infer from the profile. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.

Curated by: The dataset was created by SRILab at ETH Zurich. It was not created on behalf of any outside entity.
Funded by: Two authors of this work are supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) (SERI-funded ERC Consolidator Grant). This project did, however, not receive explicit funding by SERI and was devised independently. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the SERI-funded ERC Consolidator Grant.
Shared by: SRILab at ETH Zurich
Language(s) (NLP): English
License: CC-BY-NC-SA-4.0

Dataset Sources

Repository: https://github.com/eth-sri/SynthPAI
Paper: https://arxiv.org/abs/2406.07217

Uses

The dataset is intended to be used as a privacy-preserving method of (i) evaluating PAI capabilities of language models and (ii) aiding the development of potential defenses against such automated inferences.

Direct Use

As in the associated paper , where we include an analysis of the personal attribute inference (PAI) capabilities of 18 state-of-the-art LLMs across different attributes and on anonymized texts.

Out-of-Scope Use

The dataset shall not be used as part of any system that performs attribute inferences on real natural persons without their consent or otherwise maliciously.

Dataset Structure

We provide the instance descriptions below. Each data point consists of a single comment (that can be a top-level post):

Comment

author str: unique identifier of the person writing
username str: corresponding username
parent_id str: unique identifier of the parent comment
thread_id str: unique identifier of the thread
children list[str]: unique identifiers of children comments
profile Profile: profile making the comment - described below
text str: text of the comment
guesses list[dict]: Dict containing model estimates of attributes based on the comment. Only contains attributes for which a prediction exists.
reviews dict: Dict containing human estimates of attributes based on the comment. Each guess contains a corresponding hardness rating (and certainty rating). Contains all attributes

The associated profiles are structured as follows

Profile

username str: identifier
attributes: set of personal attributes that describe the user (directly listed below)

The corresponding attributes and values are

Attributes

Age continuous [18-99] The age of a user in years.
Place of Birth tuple [city, country] The place of birth of a user. We create tuples jointly for city and country in free-text format. (field name: birth_city_country)
Location tuple [city, country] The current location of a user. We create tuples jointly for city and country in free-text format. (field name: city_country)
Education free-text We use a free-text field to describe the user's education level. This includes additional details such as the degree and major. To ensure comparability with the evaluation of prior work, we later map these to a categorical scale: high school, college degree, master's degree, PhD.
Income Level free-text [low, medium, high, very high] The income level of a user. We first generate a continuous income level in the profile's local currency. In our code, we map this to a categorical value considering the distribution of income levels in the respective profile location. For this, we roughly follow the local equivalents of the following reference levels for the US: Low (<30k USD), Middle (30-60k USD), High (60-150k USD), Very High (>150k USD).
Occupation free-text The occupation of a user, described as a free-text field.
Relationship Status categorical [single, In a Relationship, married, divorced, widowed] The relationship status of a user as one of 5 categories.
Sex categorical [Male, Female] Biological Sex of a profile.

Dataset Creation

Curation Rationale

SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.

Source Data

The dataset is fully synthetic and was created using GPT-4 agents (version gpt-4-1106-preview) seeded with individual personalities interacting with each other in a simulated online forum.

Data Collection and Processing

The dataset was created by sampling comments from the agents in threads. A human then inferred a set of personal attributes from sets of comments associated with each profile. Further, it was manually reviewed to remove any offensive or inappropriate content. We give a detailed overview of our dataset-creation procedure in the corresponding paper.

Annotations

Annotations are provided by authors of the paper.

Personal and Sensitive Information

All contained personal information is purely synthetic and does not relate to any real individual.

Bias, Risks, and Limitations

All profiles are synthetic and do not correspond to any real subpopulations. We provide a distribution of the personal attributes of the profiles in the accompanying paper.
As the dataset has been created synthetically, data points can inherit limitations (e.g., biases) from the underlying model, GPT-4. While we manually reviewed comments individually, we cannot provide respective guarantees.

Citation

BibTeX:

@misc{2406.07217,
Author = {Hanna Yukhymenko and Robin Staab and Mark Vero and Martin Vechev},
Title = {A Synthetic Dataset for Personal Attribute Inference},
Year = {2024},
Eprint = {arXiv:2406.07217},
}

APA:

Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev: “A Synthetic Dataset for Personal Attribute Inference”, 2024; [arXiv:2406.07217](http://arxiv.org/abs/2406.07217).

Dataset Card Authors

Hanna Yukhymenko
Robin Staab
Mark Vero

Variants: SynthPAI

Associated Benchmarks

This dataset is used in 1 benchmark:

Personality Trait Recognition - Metrics: Average accuracy in %

Recent Benchmark Submissions

Task	Model	Paper	Date
Personality Trait Recognition	Claude-3 Sonnet	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	Gemini 1.5 Pro	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	Qwen1.5 110B	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	Gemini 1.0 Pro	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	GPT-4	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	LLama-3 70B	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	Mixtral 8x22B	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	Claude-3 Opus	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	Claude-3 Haiku	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	Yi 34B	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	LLama-2 70B	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	GPT-3.5	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	LLama-3 8B	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	Mixtral 8x7B	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	LLama-2 13B	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	Mistral 7B	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	Gemma 7B	A Synthetic Dataset for Personal …	2024-06-11
Personality Trait Recognition	LLama-2 7B	A Synthetic Dataset for Personal …	2024-06-11

Research Papers

Recent papers with results on this dataset:

A Synthetic Dataset for Personal Attribute Inference (2024) -

External Links: