Science / Thu, 11 Jun 2026 Nature

Language-assisted multimodal convolutional transformer pipeline for retinal lesions segmentation

Retinal lesion segmentation is one of the critical tasks to analyze retinal diseases. Many researchers have proposed deep-learning models to extract lesions from the retinal scans. However, these models often rely on image features that might not be clinically meaningful for identifying the retinal lesions. Additionally, these models need pixel-level ground truths, which are challenging to procure in the real world. To overcome these issues, we present a novel language-assisted multimodal convolutional transformer pipeline that aligns image features with the text features, where the text features are extracted from the prompts that contain clinically meaningful information about the retinal lesions, and the image features are generated from the retinal scans. This alignment between image and text features is established with one-time training using the proposed loss function. Afterward, the proposed network can robustly extract retinal lesions across different datasets at the inference stage. Moreover, since the proposed network infers learning from the text prompts, it does not require additional training rounds using pixel-level ground truth annotations to adapt to new datasets like the state-of-the-art methods. The proposed network is thoroughly tested on six public datasets, and it outperforms the state-of-the-art by achieving up to 7.77% improvements in terms of intersection-over-union.