Salient Object Ranking via Cyclical Perception-Viewing Interaction Modeling

Guo, Rongjin; Xu, Ke; Lau, Rynson W. H.

Salient Object Ranking via Cyclical Perception-Viewing Interaction Modeling

Rongjin Guo, Ke Xu^*, Rynson W. H. Lau^*

City University of Hong Kong
ICLR 2026
^*Corresponding authors

Overview of the cyclical perception–viewing interaction framework

We model an explicit cyclical interaction between scene understanding (perception) and attention shift (viewing), enabling iterative refinement of saliency ranks guided by evolving captions.

Qualitative comparisons with SeqRank, DSGNN, PoseSOR, and our method

Qualitative comparisons on representative examples.

Abstract

Salient Object Ranking (SOR) aims to predict human attention shift across different salient objects in a scene. Although a number of methods have been proposed for the task, they typically rely on modeling the bottom-up influences of image features on attention shifts. In this work, we observe that when free-viewing an image, humans instinctively browse the objects in such a way as to maximize contextual understanding of the image. This implies a cyclical interaction between content (or story) understanding of the image and attention shift over it. Based on this observation, we propose a novel SOR approach that models this explicit top-down cognitive pathway with two novel modules: a story prediction (SP) module and a guided ranking (GR) module. By formulating content understanding as the image caption generation task, the SP module learns to generate and complete the image captions conditioned on the salient object queries of the GR module, while the GR module learns to detect salient objects and their viewing orders guided by the SP module. Extensive experiments on SOR benchmarks demonstrate that our approach outperforms state-of-the-art SOR methods.

Model Architecture

Our framework models a top-down cognitive pathway through a cyclical interaction between scene understanding (perception) and attention shift (viewing). It consists of two modules: (i) a Story Prediction (SP) module that performs caption generation/completion to capture high-level contextual understanding, and (ii) a Guided Ranking (GR) module that detects salient objects and predicts their viewing order, guided by the evolving captions from SP. The two modules are executed iteratively to refine both captions and saliency ranks.

Overview of our cyclical perception–viewing interaction modeling framework.

Results

Visual comparison between results of our method and those of eight state-of-the-art methods. Our method produces more faithful salient object ranking results.