使用Python计算图像与文字的语义相似度

这是图像和自然语言的交叉应用。无论是计算图像与图像的相似度，还是计算图像与文字或者文字与文字，本质都是计算特征向量的相似度。

计算图像与文字的相似度，实际上是评价文字描述图像的准确性。在Image Caption、Video Caption、VQA等视觉理解领域都非常有用。

本文代码来源：https://github.com/hila-chefer/Transformer-MM-Explainability/tree/main/CLIP

官方网站：https://openai.com/blog/clip/

从官方给的算法流程图可以看出，计算图像与文字的相似度，就是将图像和文字映射到特征空间，然后通过度量学习，拉近语义相同的图像特征与文字特征的距离。

（好了，超过两分钟了...原理就不看了，直接跑代码，反正能用就行。）

实验环境：

cuda 10.0.130

torch 1.7.1

torchvision 0.8.1

ftfy、regex、tqdm

配置工程：新建空的pyCharm工程，将所有文件放在工程路径。下载大佬的模型：https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt 也放在工程路径。

运行代码：新建main.py文件，将下面的代码复制进去。不出意外应该没啥错。

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("/home/cbl/caoyi/wq/ViT-B-32", device=device)

image = preprocess(Image.open("/home/cbl/caoyi/wq/1.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a dog is running", "a people is jump", "some people are running"]).to(device)

#image = preprocess(Image.open("/home/cbl/caoyi/wq/2.jpeg")).unsqueeze(0).to(device)
#text = clip.tokenize(["a man is singing", "a man is eating", "a woman is eating"]).to(device)

#image = preprocess(Image.open("/home/cbl/caoyi/wq/3.jpeg")).unsqueeze(0).to(device)
#text = clip.tokenize(["a dog is runnning", "a man is running", "a man is walking the dogs"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)

测试结果：

"a dog is running"：相似度 0.1931

"a people is jump"：相似度 0.29

"some people are running"：相似度 0.517

"a man is singing"：相似度 0.003326

"a man is eating"：相似度 0.9517

"a woman is eating"：相似度 0.045

"a dog is runnning"：相似度 0.11115

"a man is running"：相似度 0.001089

"a man is walking the dogs"：相似度 0.8877

使用Python计算图像与文字的语义相似度

猜你喜欢