多模态篇
Grounded-SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & BLIP & Whisper - Automatically Detect , Segment and Generate Anything with Image, Text, and Speech Inputs.
https://github.com/IDEA-Research/Grounded-Segment-Anything
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
https://github.com/IDEA-Research/Grounded-Segment-Anything
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
https://github.com/salesforce/BLIP
LAVIS - A Library for Language-Vision Intelligence
https://github.com/salesforce/LAVIS