这是图像和自然语言的交叉应用。无论是计算图像与图像的相似度,还是计算图像与文字或者文字与文字,本质都是计算特征向量的相似度。
计算图像与文字的相似度,实际上是评价文字描述图像的准确性。在Image Caption、Video Caption、VQA等视觉理解领域都非常有用。
本文代码来源:https://github.com/hila-chefer/Transformer-MM-Explainability/tree/main/CLIP
官方网站:https://openai.com/blog/clip/
从官方给的算法流程图可以看出,计算图像与文字的相似度,就是将图像和文字映射到特征空间,然后通过度量学习,拉近语义相同的图像特征与文字特征的距离。
(好了,超过两分钟了...原理就不看了,直接跑代码,反正能用就行。)
实验环境:
cuda 10.0.130
torch 1.7.1
torchvision 0.8.1
ftfy、regex、tqdm
配置工程:新建空的pyCharm工程,将所有文件放在工程路径。下载大佬的模型:https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt 也放在工程路径。
运行代码:新建main.py文件,将下面的代码复制进去。不出意外应该没啥错。
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("/home/cbl/caoyi/wq/ViT-B-32", device=device)
image = preprocess(Image.open("/home/cbl/caoyi/wq/1.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a dog is running", "a people is jump", "some people are running"]).to(device)
#image = preprocess(Image.open("/home/cbl/caoyi/wq/2.jpeg")).unsqueeze(0).to(device)
#text = clip.tokenize(["a man is singing", "a man is eating", "a woman is eating"]).to(device)
#image = preprocess(Image.open("/home/cbl/caoyi/wq/3.jpeg")).unsqueeze(0).to(device)
#text = clip.tokenize(["a dog is runnning", "a man is running", "a man is walking the dogs"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs)
测试结果:
"a dog is running":相似度 0.1931
"a people is jump":相似度 0.29
"some people are running":相似度 0.517
"a man is singing":相似度 0.003326
"a man is eating":相似度 0.9517
"a woman is eating":相似度 0.045
"a dog is runnning":相似度 0.11115
"a man is running":相似度 0.001089
"a man is walking the dogs":相似度 0.8877