We propose PAL, an alignment framework complementary to existing pretraining strategies, which is designed to model the diverse, heterogeneous nature (plurality) of human preferences from the ground up. Contrary to the status-quo paired preference Bradley-Terry model used by many popular LLMs and LVMs today, we reframe the alignment problem using Coombs' ideal point model and a mixture modeling approach. PAL captures the plurality of the preferences of a population of users while simultaneously learning a shared preference latent space, which can few-shot generalize to new, unseen users. PAL demonstrates the efficacy of cheap and efficient reward modeling - our approach learns reward functions via simple MLP layers (on top of the penultimate-layer dense representations learned by large pretrained foundation models) that are on-par with the existing large state-of-the-art reward models which are typically in the billion parameter regime. We show that PAL achieves competitive reward model accuracy compared to strong baselines on 1) Language models with OpenAI's TL;DR dataset ; 2) Text-to-Image models with the Pick-a-Pic dataset ; 3) A new semi-synthetic heterogeneous dataset generated using Anthropic Personas.
Our experiments also highlight a key shortcoming of current preference datasets that are created using rigid rubrics which wash away heterogeneity, and we make a case for more nuanced data collection approaches.
The ideal point model (Coombs, 1950) is a statistical model that is used to analyze the preferences of individuals or groups. It is broadly used in political science, sociology, and economics to model the preferences of voters, legislators, or other decision-makers. The ideal point model assumes that each individual has an "ideal point" in some high-dimensional space \(\mathbb{R}^d\), and that the individual's preference for a particular alternative is a function of the distance between the alternative and the individual's ideal point. The ideal point model is well suited for heterogeneous preference learning, which we discuss below.
In reality, different people can have different preferences that are not just noisy perturbations of a universal model, i.e. people can differ in systematically different ways! People's preferences are not completely unique - there are shared aspects of preferences within subgroups of people, for example owing to similar demographics and educational, socio-cultural, or other types of similarities. PAL is designed to suit this structure of human preferences - in particular, we use a mixture modeling approach to capture diverse individual preferences across \(K\) subgroups, where each user's preference (ideal point) is a convex combination of:
Here the \(K\) prototypes represent the shared structure across subpopulations, while each users' weights \(W\) over the prototypes represent their individuality.
Assume \( K^* \) "true" user prototypes \(\{\mathbf{p}_i\}_{i=1}^{K^*}\), where \(\mathbf{p}_i \sim \mathcal{N}(0,(1/d)I)\). We consider two settings:
Each sample is generated as follows: we randomly draw two items \(\{\mathbf{x}_l, \mathbf{x}_r\}\) and one user \(\mathbf{a}_i\), and label the user's preference as \(\text{sign}(\|f^*(\mathbf{x}_l)-f^*(\mathbf{a}_i)\|_2-\|f^*(\mathbf{x}_r)-f^*(\mathbf{a}_i)\|_2)\). We generate a total of \( n \) samples per user to learn the user's ideal point. We use model A with a single-layer MLP (without bias) with hinge loss and evaluate on the held-out test set, which we split into two disjoint sets:
The Anthropic Personas dataset contains a collection of personalities or personas, each associated with 500 statements that align with the persona and 500 statements that do not.
We create a heterogeneous preference dataset by sampling pairs of persona statements and imitating the preference choices of subpopulation groups with diverse personality preferences.
Results:
(a) On the heterogeneous persona dataset, we observe that as # learned prototypes approach the # true prototypes, i.e. \(K \to K^\star\), the seen accuracy increases to \(100\%\) given a sufficient number of users and number of comparisons per user ; (b) As we get more comparisons per seen user \(n_p\), PAL eventually saturates to 100% accuracy ( when \(K\geq K^\star=3\))
The Pick-a-Pic dataset is a large, crowdsourced open dataset of human preferences over text-to-image generation, designed to align pre-trained models with human preferences.
We construct the Pick-a-Filter dataset by assuming two subpopulation groups that prefer warm (red) or cool (blue) tones. We apply simple color filters to Pick-a-Pic V1 to semi-synthetically inject this heterogeneity.
Results:
PAL enables learning beyond a universal preference \(K^* > 1\) to identify diverse user preference groups. We observe that PAL significantly outperforms the homogeneous reward model in predicting user preferences - at a mixture ratio of 1, PAL achieves \(95.2\%\) test accuracy compared to \(75.4\%\) from the homogeneous reward model (\(K=1\)).
@article{chen2024pal,
title={PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences},
author={Chen, Daiwei and Chen, Yi and Rege, Aniket and Vinayak, Ramya Korlakai},
journal={arXiv preprint arXiv:2406.08469},
year={2024}
}
This work was supported by NSF grants NCS-FO 2219903 and NSF CAREER Award CCF 2238876.
Usage and License Notice: The data, code and model checkpoints are intended for research use.