I've long been fascinated by the task of learning personal image preferences: predicting subjective aesthetic ratings given to a group of images by a single user. Previous content-based approaches involve learning CLIP-derived representations of image features, or training general aesthetic preference models and adapting them to individual preferences. In this work, I show that it is possible to “discover” semantic attributes that linearly separate user preferences using VLMs. A group of 5-10 attributes, carefully selected, can predict user preferences demonstrably better than previous approaches (predictions are correlated with user ratings with a Spearman correlation coefficient 𝜌 of 0.75, prior SOTA 𝜌 is 0.66.) I contend that this suggests that single-user aesthetic preferences can generally be understood as being linear within a single data domain.
Trained on preferences of rater A14W0IW2KGR80K
My approach was broadly motivated by recent rapid improvements in the cost and quality of available Vision Language Models (VLMs) capable of taking a combination of images and detailed instructions as input. While in principle this approach is modality-agnostic, image aesthetic preferences are a convenient starting point because:
I specifically chose to use the Gemini series of models as the base VLM for this experiment because they are among the cheapest available while still remaining performant.
My algorithm can be described as follows:
While true
:
This process can be repeated indefinitely with a guarantee of strictly increasing performance on the validation set.
The Flickr-AES dataset, introduced in Personalized Image Aesthetics (Ren & Foran, 2017), is ideal for testing this method, and has been used by most treatments of the subject since its release. This dataset consists of 40,000 public domain images from image sharing site Flickr. 210 Amazon Mechanical Turk workers were asked to rate their preferences for on these images, on a scale of integers 1-5.
I and others tackling this literature have used the Spearman correlation coefficient 𝜌 between predicted user ratings and actual user ratings in a held-out test set when evaluating the efficacy of personalized aesthetic models at predicting individual user tastes. Here's how my results compare to prior treatments:
Method | Spearman 𝜌 on Flickr-AES |
0.516 | |
0.561 | |
0.667 | |
0.668 | |
Mine | 0.75 |
I'm excited by these results because they suggest that recent advancements in mixed modality language models may lead to significantly improved content-based filtering for general-purpose content recommendation. The strategy of generate natural language descriptors → evaluate content according to them → combine using simple heuristics is extremely extensible and should be applied to other data domains promptly.
It should also be noted that all prior treatments of the PIAA task have trained models that rely on learned features in latent space rather than natural language space. I find the approach of learning natural language features compelling not least because it naturally lends itself to interpretability (natural language descriptions of a user's personal aesthetic taste, along with coefficients that empirically describe the importance of those descriptions.)
A plausible explanation for why nonlinear heads tacked onto CLIP (or other deep aesthetic encoders) have not eclipsed a simple linear combination of VLM-generated semantic attributes is that CLIP’s contrastive pre-training already warps the image manifold so that most high-level, linguistically describable concepts lie along approximately linear directions. Empirically, both linear probes and sparse linear concept decompositions recover surprisingly clean semantic axes inside CLIP’s embedding space, implying that the representation has been “pre-factorised” by the text-alignment objective (see: CLIP knows image aesthetics, Interpreting CLIP with Sparse Linear Concept Embeddings). When we then fine-tune a high-capacity nonlinear regressor on a per-user subset of only a few hundred rated images, two problems emerge:
It should also be noted that the cost of my approach is likely well in excess of prior treatments. "Training" a regression model that achieves an SROCC of >0.7 on a single user's ratings requires sending something on the order of several million tokens to the VLM, at a cost of about $1 at current Gemini API pricing. However, I believe this is acceptable, as VLM API prices are constantly falling, and this project aims to serve as a proof-of-concept for "brute-forcing" user preferences.