SAM-Adapter: Finetune Your Own Segment Anything Model in an Effective and Efficient Manner

AI research has witnessed a paradigm shift with models trained on vast amounts of data at scale. These models, or known as foundation models, such as BERT, DALL-E, and GPT have shown promising results in many language or vision tasks. Recently, among the foundation models, Segment Anything (SAM) has a distinct position as a generic image segmentation model trained on the large visual corpus. It has been demonstrated that SAM has successful segmentation capabilities in diverse scenarios, which makes it a groundbreaking step toward image segmentation and related fields of computer vision.
SAM is also extensively applied in many other fields, such as relation extraction, object tracking, image editing, etc., demonstrating its importance in the field of computer vision.

However, as computer vision encompasses a broad spectrum of problems, SAM’s incompleteness is evident, which is similar to other foundation models since the training data cannot encompass the entire corpus, and working scenarios are subject to variation, so it may fail in processing some special types of images and tasks. In this study, the authors first test SAM in some challenging low-level structural segmentation tasks including camouflaged object detection (concealed scenes) and shadow detection, and they find that the SAM model trained on general images cannot perfectly “Segment Anything” in these cases.

As such, a crucial research problem is: How to harness the capabilities acquired by large models from massive corpora and leverage them to benefit downstream tasks? In this work, researchers from Singapore University of Technology and Design, Motion Tech, and Zhejiang University, introduce the SAM-Adapter, which serves as a solution to the research problem mentioned above. This pioneering work is the first attempt to adapt the large pre-trained image segmentation model SAM to specific downstream tasks with enhanced performance. As its name states, SAM-Adaptor is a very simple yet effective adaptation technique that leverages internal knowledge and external control signal. Specifically, it is a lightweight model that can learn alignment with a relatively small amount of data and serves as an additional network to inject task-specific guidance information from the samples of that task. Information is conveyed to the network using visual prompts, which has been demonstrated to be efficient and effective in adapting a frozen large foundation model to many downstream tasks with a minimum number of additional trainable parameters. Specifically, the proposed method is: (1) Generalizable: SAM-Adapter can be directly applied to customized datasets of various tasks to enhance performance with the assistance of SAM. (2) Composable: It is effortless to combine multiple explicit conditions to fine-tune SAM with multi-condition control.

To verify the effectiveness of SAM-Adapter, the researchers have performed extensive experiments on multiple datasets and tasks including shadow detection, camouflaged object detection task, and polyp segmentation (medical image segmentation) task. Benefiting from the capability of SAM and the proposed SAM-Adapter, this method can achieve state-of-the-art (SOTA) performance on these tasks. Some visualization results to show the effectiveness of SAM-Adapter is presented in the below figure.

In the future, the authors believe that SAM-Adapter, as a universal framework, can be applied to a wide array of downstream segmentation tasks in various fields, including medical imaging diagnosis, agriculture, and industrial inspection. The authors have released all codes of their SAM-Adapter and look forward to more users applying this method in their research or work domains to collectively advance the wider applications of artificial intelligence technology.