PAL: Pluralistic ALignment Framework

for Learning from Heterogeneous Preferences


ICML TF2M/MFHAIA workshop 2024 (Oral)
University of Wisconsin-Madison

🔥[NEW!] PAL has been accepted at 2024 ICML workshops: TF2M and MFHAIA.

Description of the image

PAL is a framework designed to learn reward models for alignment that suit diverse, heterogeneous human preferences. For some user \(i\), the probability of preferring the left item \(x_l\) to right item \(x_r\) given conditioning \(x_c\) is given by reward model \(r_\theta\). PAL learns \(r_\theta\) with a mixture modeling approach over \(K\) prototypes that represent \(K\) subgroups - this captures the shared structure of human preferences (subpopulations), while also suiting the specific differences between individual people.

Abstract

We propose PAL, an alignment framework complementary to existing pretraining strategies, which is designed to model the diverse, heterogeneous nature (plurality) of human preferences from the ground up. Contrary to the status-quo paired preference Bradley-Terry model used by many popular LLMs and LVMs today, we reframe the alignment problem using Coombs' ideal point model and a mixture modeling approach. PAL captures the plurality of the preferences of a population of users while simultaneously learning a shared preference latent space, which can few-shot generalize to new, unseen users. PAL demonstrates the efficacy of cheap and efficient reward modeling - our approach learns reward functions via simple MLP layers (on top of the penultimate-layer dense representations learned by large pretrained foundation models) that are on-par with the existing large state-of-the-art reward models which are typically in the billion parameter regime. We show that PAL achieves competitive reward model accuracy compared to strong baselines on 1) Language models with OpenAI's TL;DR dataset ; 2) Text-to-Image models with the Pick-a-Pic dataset ; 3) A new semi-synthetic heterogeneous dataset generated using Anthropic Personas.

Our experiments also highlight a key shortcoming of current preference datasets that are created using rigid rubrics which wash away heterogeneity, and we make a case for more nuanced data collection approaches.


  What is the Ideal Point Model ?

Description of the image

The ideal point model (Coombs, 1950) is a statistical model that is used to analyze the preferences of individuals or groups. It is broadly used in political science, sociology, and economics to model the preferences of voters, legislators, or other decision-makers. The ideal point model assumes that each individual has an "ideal point" in some high-dimensional space \(\mathbb{R}^d\), and that the individual's preference for a particular alternative is a function of the distance between the alternative and the individual's ideal point. The ideal point model is well suited for heterogeneous preference learning, which we discuss below.

  PAL: Mixture Modeling for Heterogeneous Preferences

Description of the image

In reality, different people can have different preferences that are not just noisy perturbations of a universal model, i.e. people can differ in systematically different ways! People's preferences are not completely unique - there are shared aspects of preferences within subgroups of people, for example owing to similar demographics and educational, socio-cultural, or other types of similarities. PAL is designed to suit this structure of human preferences - in particular, we use a mixture modeling approach to capture diverse individual preferences across \(K\) subgroups, where each user's preference (ideal point) is a convex combination of:


  •  Model A
    : \(K\) prototypical ideal points.
  •  Model B: \(K\) protypical functions mapping input prompts to ideal points.


Here the \(K\) prototypes represent the shared structure across subpopulations, while each users' weights \(W\) over the prototypes represent their individuality.


  Experiments: Gaussian Synthetic Dataset

Hypothesis: if we create a synthetic dataset with a "ground truth" subpopulation structure (injecting heterogeneity), PAL should be able to learn these groups well.

Assume \( K^* \) "true" user prototypes \(\{\mathbf{p}_i\}_{i=1}^{K^*}\), where \(\mathbf{p}_i \sim \mathcal{N}(0,(1/d)I)\). We consider two settings:


  •    1) Mixture setting: each user is located in the convex hull of \( K \) learned prototypes
  •    2) Partition setting: \( N \) users are evenly sampled from \( K \) learned prototypes, with \(\mathbf{a}_i \in \{\mathbf{p}_k\}_{k=1}^{K}\)


Each sample is generated as follows: we randomly draw two items \(\{\mathbf{x}_l, \mathbf{x}_r\}\) and one user \(\mathbf{a}_i\), and label the user's preference as \(\text{sign}(\|f^*(\mathbf{x}_l)-f^*(\mathbf{a}_i)\|_2-\|f^*(\mathbf{x}_r)-f^*(\mathbf{a}_i)\|_2)\). We generate a total of \( n \) samples per user to learn the user's ideal point. We use model A with a single-layer MLP (without bias) with hinge loss and evaluate on the held-out test set, which we split into two disjoint sets:


  •    1) Seen user: user who provides labeled preferences in train and test set.
  •    2) Unseen user: user who provides labeled preferences in test set but not train set.


Results:

  •  (a) Learnability: PAL can align the learned and true user ideal points in the representation space \(\mathbb{R}^3\).
  •  (b) Adapting to Plurality: a PAL homogeneous reward model (\( K = 1 \)) is suboptimal when diverse preferences exist (\( K^* \gt 1 \)). When we allow for learning plurality by setting \( K \gt 1 \), PAL enables a significant 5-7% accuracy gain.
  •  (c) Sample Complexity: as we increase # training samples for seen users, PAL achieves higher test accuracy, and is also more accurate in capturing # true prototypes in the dataset (accuracy peaks at \( K = K^* \)).
  •  (d) Generalization: for unseen users (users not in the train set), PAL can generalize to accurately predict preferences with \( \sim 50 \) labeled examples.


  Experiments: Heterogeneous Semi-Synthetic Datasets

Hypothesis: if we synthetically inject diverse preferences into real datasets, PAL should still be able to learn these groups reasonably well.

Persona dataset
Description of the image

The Anthropic Personas dataset contains a collection of personalities or personas, each associated with 500 statements that align with the persona and 500 statements that do not.

We create a heterogeneous preference dataset by sampling pairs of persona statements and imitating the preference choices of subpopulation groups with diverse personality preferences.


  Results:

Description of the image

(a) On the heterogeneous persona dataset, we observe that as # learned prototypes approach the # true prototypes, i.e. \(K \to K^\star\), the seen accuracy increases to \(100\%\) given a sufficient number of users and number of comparisons per user ; (b) As we get more comparisons per seen user \(n_p\), PAL eventually saturates to 100% accuracy ( when \(K\geq K^\star=3\))

Pick-a-Filter dataset
Description of the image

The Pick-a-Pic dataset is a large, crowdsourced open dataset of human preferences over text-to-image generation, designed to align pre-trained models with human preferences.

We construct the Pick-a-Filter dataset by assuming two subpopulation groups that prefer warm (red) or cool (blue) tones. We apply simple color filters to Pick-a-Pic V1 to semi-synthetically inject this heterogeneity.


  Results:

Description of the image

PAL enables learning beyond a universal preference \(K^* > 1\) to identify diverse user preference groups. We observe that PAL significantly outperforms the homogeneous reward model in predicting user preferences - at a mixture ratio of 1, PAL achieves \(95.2\%\) test accuracy compared to \(75.4\%\) from the homogeneous reward model (\(K=1\)).


BibTeX

@article{chen2024pal,
        title={PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences},
        author={Chen, Daiwei and Chen, Yi and Rege, Aniket and Vinayak, Ramya Korlakai},
        journal={arXiv preprint arXiv:2406.08469},
        year={2024}
      }

Acknowledgement

This work was supported by NSF grants NCS-FO 2219903 and NSF CAREER Award CCF 2238876.

Usage and License Notice: The data, code and model checkpoints are intended for research use.