Aesthetics are Semantically Linear: Learning Personalized Image Aesthetics with Linear Regression in Subjective Space

Jasper Gilley
April 2025

Abstract

I've long been fascinated by the task of learning personal image preferences: predicting subjective aesthetic ratings given to a group of images by a single user. Previous content-based approaches involve learning CLIP-derived representations of image features, or training general aesthetic preference models and adapting them to individual preferences. In this work, I show that it is possible to “discover” semantic attributes that linearly separate user preferences using VLMs. A group of 5-10 attributes, carefully selected, can predict user preferences demonstrably better than previous approaches (predictions are correlated with user ratings with a Spearman correlation coefficient 𝜌 of 0.75, prior SOTA 𝜌 is 0.66.) I contend that this suggests that single-user aesthetic preferences can generally be understood as being linear within a single data domain.

Example Image and Attribute Scores

Trained on preferences of rater A14W0IW2KGR80K

Example aesthetic image: landscape with trees and sky

VLM-Generated Scores for Discovered Attributes:

Attribute	Score (0-1)	Coefficient
Absence of prominent people	1.0	0.2803
Primarily landscape or nature scenes	1.0	0.8043
Depicts group activities or social events	0.0	-0.2226
High photographic quality	1.0	0.0298
Scenic or aesthetically pleasing subjects	1.0	1.6743
Absence of significant foreground obstruction	1.0	0.1397
Features natural elements or poignant human subjects	1.0	0.3020
Unconventional or dramatic perspective	0.0	0.0599
Amateurish or snapshot quality	0.4	-1.2130
Depicts a place or atmosphere	1.0	0.2263

(Intercept: 1.6623)

True User Rating: 5

Predicted Rating: 4.63

Methods

My approach was broadly motivated by recent rapid improvements in the cost and quality of available Vision Language Models (VLMs) capable of taking a combination of images and detailed instructions as input. While in principle this approach is modality-agnostic, image aesthetic preferences are a convenient starting point because:

Individual images represent a self-contained "complete" aesthetic unit expressible in a few hundred tokens, whereas complete aesthetic units in the modalities of text or video can get into the hundreds of thousands of tokens.
VLMs are capable of making detailed subjective assessments about images, in a way that they qualitatively still struggle with in the domain of e.g. music or audio.

I specifically chose to use the Gemini series of models as the base VLM for this experiment because they are among the cheapest available while still remaining performant.

My algorithm can be described as follows:

While true:

Attribute proposal (using Gemini 2.5 Pro) – given ≤ 4 liked and ≤ 4 disliked examples, return one new phrase that intuitively separates the "liked" from "disliked" examples. Ensure that the new attribute is sufficiently orthogonal to earlier ones.
Attribute scoring (using Gemini 2.5 Flash) – score every image ∈ train ∪ valid ∪ test on [0, 1].
Model fitting – append the new column to the feature matrix and refit a linear regression.
Validation check – compute Spearman 𝜌 on the validation split. If 𝜌 improves, keep the axis; otherwise discard it.
Backwards elimination (optional) – if the new axis improves 𝜌 on the validation set, try pruning previously existing axes to see if the new axis makes them redundant

This process can be repeated indefinitely with a guarantee of strictly increasing performance on the validation set.

The Flickr-AES dataset, introduced in Personalized Image Aesthetics (Ren & Foran, 2017), is ideal for testing this method, and has been used by most treatments of the subject since its release. This dataset consists of 40,000 public domain images from image sharing site Flickr. 210 Amazon Mechanical Turk workers were asked to rate their preferences for on these images, on a scale of integers 1-5.

Results

I and others tackling this literature have used the Spearman correlation coefficient 𝜌 between predicted user ratings and actual user ratings in a held-out test set when evaluating the efficacy of personalized aesthetic models at predicting individual user tastes. Here's how my results compare to prior treatments:

Method	Spearman 𝜌 on Flickr-AES
Ren & Foran (2017)	0.516
Zhu & Li (2020)	0.561
Li & Yang (2022)	0.667
Yun & Choo (2024)	0.668
Mine	0.75

Discussion

I'm excited by these results because they suggest that recent advancements in mixed modality language models may lead to significantly improved content-based filtering for general-purpose content recommendation. The strategy of generate natural language descriptors → evaluate content according to them → combine using simple heuristics is extremely extensible and should be applied to other data domains promptly.

It should also be noted that all prior treatments of the PIAA task have trained models that rely on learned features in latent space rather than natural language space. I find the approach of learning natural language features compelling not least because it naturally lends itself to interpretability (natural language descriptions of a user's personal aesthetic taste, along with coefficients that empirically describe the importance of those descriptions.)

A plausible explanation for why nonlinear heads tacked onto CLIP (or other deep aesthetic encoders) have not eclipsed a simple linear combination of VLM-generated semantic attributes is that CLIP’s contrastive pre-training already warps the image manifold so that most high-level, linguistically describable concepts lie along approximately linear directions. Empirically, both linear probes and sparse linear concept decompositions recover surprisingly clean semantic axes inside CLIP’s embedding space, implying that the representation has been “pre-factorised” by the text-alignment objective (see: CLIP knows image aesthetics, Interpreting CLIP with Sparse Linear Concept Embeddings). When we then fine-tune a high-capacity nonlinear regressor on a per-user subset of only a few hundred rated images, two problems emerge:

Data-to-parameter mismatch & overfitting. A nonlinear head can in theory model higher-order interactions, but with <1 k training points per user it mostly learns to chase noise or idiosyncrasies of the training split, hurting generalisation. A sparse linear model, by contrast, has just enough capacity to capture the dominant directions of preference without overfitting.
Semantic entanglement. Because CLIP’s dense vectors intertwine multiple visual factors, nonlinear networks often need to disentangle those factors implicitly before weighting them. Our attribute-generation stage outsources that disentanglement to the VLM: the model proposes attributes in natural language, scores them over the entire corpus, and returns a nearly orthogonal basis that is already human-interpretable. Once the space is expressed in these semantically pure coordinates, a linear regressor is not a restriction but the natural aggregator of additive preferences.

Put differently, nonlinear heads shine when signal really does live in complex interactions, but for PIAA the evidence suggests that user preference is well-approximated by a low-rank, additive model in a semantically aligned space. The modest gains reported by recent nonlinear approaches (≈ 0.56 → 0.67 𝜌) likely reflect representation drift and regularisation tricks rather than genuine exploitation of higher-order terms. Our 0.75 𝜌 therefore supports the hypothesis that, once the right semantic basis is exposed, adding depth offers diminishing—and sometimes negative—returns.

It should also be noted that the cost of my approach is likely well in excess of prior treatments. "Training" a regression model that achieves an SROCC of >0.7 on a single user's ratings requires sending something on the order of several million tokens to the VLM, at a cost of about $1 at current Gemini API pricing. However, I believe this is acceptable, as VLM API prices are constantly falling, and this project aims to serve as a proof-of-concept for "brute-forcing" user preferences.

Jasper Gilley
Twitter
Github