Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add BLIP support in TransformersImageToText #4912

Merged
merged 3 commits into from
May 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions haystack/nodes/image_to_text/transformers.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,11 @@

# supported models classes should be extended when HF image-to-text pipeline willl support more classes
# see /~https://github.com/huggingface/transformers/issues/21110
SUPPORTED_MODELS_CLASSES = ["VisionEncoderDecoderModel"]
SUPPORTED_MODELS_CLASSES = [
"VisionEncoderDecoderModel",
"BlipForConditionalGeneration",
"Blip2ForConditionalGeneration",
]

UNSUPPORTED_MODEL_MESSAGE = (
f"The supported classes are: {SUPPORTED_MODELS_CLASSES}. \n"
Expand All @@ -33,8 +37,6 @@ class TransformersImageToText(BaseImageToText):
"""
A transformer-based model to generate captions for images using the Hugging Face's transformers framework.

Currently, this node supports `VisionEncoderDecoderModel` models.

**Example**

```python
Expand Down Expand Up @@ -64,7 +66,7 @@ class TransformersImageToText(BaseImageToText):

def __init__(
self,
model_name_or_path: str = "nlpconnect/vit-gpt2-image-captioning",
model_name_or_path: str = "Salesforce/blip-image-captioning-base",
model_version: Optional[str] = None,
generation_kwargs: Optional[dict] = None,
use_gpu: bool = True,
Expand All @@ -74,15 +76,14 @@ def __init__(
devices: Optional[List[Union[str, torch.device]]] = None,
):
"""
Load a `VisionEncoderDecoderModel` model from transformers.
Load an Image-to-Text model from transformers.

:param model_name_or_path: Directory of a saved model or the name of a public model.
Currently, only `VisionEncoderDecoderModel` models are supported.
To find these models:
1. Visit [Hugging Face image to text models](https://huggingface.co/models?pipeline_tag=image-to-text).`
2. Open the model you want to check.
3. On the model page, go to the "Files and Versions" tab.
4. Open the `config.json` file and make sure the `architectures` field contains `VisionEncoderDecoderModel`.
4. Open the `config.json` file and make sure the `architectures` field contains `VisionEncoderDecoderModel`, `BlipForConditionalGeneration`, or `Blip2ForConditionalGeneration`.
:param model_version: The version of the model to use from the Hugging Face model hub. This can be the tag name, branch name, or commit hash.
:param generation_kwargs: Dictionary containing arguments for the `generate()` method of the Hugging Face model.
See [generate()](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationMixin.generate) in Hugging Face documentation.
Expand Down
9 changes: 0 additions & 9 deletions test/nodes/test_image_to_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,12 +91,3 @@ def test_image_to_text_unsupported_model_after_loading():
match="The model 'deepset/minilm-uncased-squad2' \(class 'BertForQuestionAnswering'\) is not supported for ImageToText",
):
_ = TransformersImageToText(model_name_or_path="deepset/minilm-uncased-squad2")


@pytest.mark.integration
def test_image_to_text_unsupported_model_before_loading():
with pytest.raises(
ValueError,
match=r"The model '.*' \(class '.*'\) is not supported for ImageToText. The supported classes are: \['VisionEncoderDecoderModel'\]",
):
_ = TransformersImageToText(model_name_or_path="Salesforce/blip-image-captioning-base")