Publications | Seoyeon Kim

2023

Extending CLIP’s Image-Text Alignment to Referring Image Segmentation

Seoyeon Kim, Minguk Kang, Dongwon Kim, Jaesik Park, Suha Kwak

In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), 2024

Abstract Paper

Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP's inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIP's image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP's image-text alignment to RIS.
PAM: Patch Aware Matching for Vision Transformer based Self-Supervised Learning Frameworks

Seoyeon Kim, Minguk Kang, and Jaesik Park

34th Workshop on Image Processing and Image Understanding (IPIU), Bronze Award, 2023

Abstract

Self-supervised Learning (SSL) is a field that learns meaningful representations from large unlabeled datasets and finetunes such representations to target downstream tasks. Recently, many SSL methods that employ the Vision Transformer based Teacher-Student framework with different losses have been proven effective—even effective enough to outperform their supervised counterparts trained with labeled data. In line with this effective ViT-based Teacher-Student framework, we propose a new loss named “Patch Aware Matching (PAM)” which performs patch token feature distillation across the local view features output from the student and global view features output from the teacher. We hypothesize that such an approach learns “local-to-global” correspondence along with local, structural information on the patch level. When trained and evaluated under the k-NN protocol with a ViT-Small/16 backbone, our approach outperforms state-of-the-art methods on the 100 easiest classes of ImageNet-1K but falls behind on 10% of ImageNet. Also, we come across interesting results that suggest that easier classes of ImageNet-1K are more sample efficient. As a work in progress, we aim to further develop our method and investigate sample efficiency of different ImageNet-1K classes.