-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation improvements #133
Merged
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
7741586
docs improvements
capjamesg 86e636b
add cta to file a feature request
capjamesg 79489a7
A few minor updates to index
paulguerrie 6ec2e3a
respond to feedback
capjamesg c140b8a
update table of contents
capjamesg 8a27b2f
consolidate video, webcam, rtsp pages
capjamesg File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
Foundation models are machine learning models that have been trained on vast amounts of data to accomplish a specific task. | ||
|
||
For example, OpenAI trained CLIP, a foundation model. CLIP enables you to classify images. You can also compare the similarity of images and text with CLIP. | ||
|
||
The CLIP training process, which was run using over 400 million pairs of images and text, allowed the model to build an extensive range of knowledge, which can be applied to a range of domains. | ||
|
||
Foundation models are being built for a range of vision tasks, from image segmentation to classification to zero-shot object detection. | ||
|
||
Inference supports the following foundation models: | ||
|
||
- Gaze (LC2S-Net): Detect the direction in which someone is looking. | ||
- CLIP: Classify images and compare the similarity of images and text. | ||
- DocTR: Read characters in images. | ||
- Grounding DINO: Detect objects in images using text prompts. | ||
- Segment Anything (SAM): Segment objects in images. | ||
|
||
All of these models can be used over a HTTP request with Inference. This means you don't need to spend time setting up and configuring each model. | ||
|
||
## How Are Foundation Models Used? | ||
|
||
Use cases vary depending on the foundation model with which you are working. For example, CLIP has been used extensively in the field of computer vision for tasks such as: | ||
|
||
1. Clustering images to identify groups of similar images and outliers; | ||
2. Classifying images; | ||
3. Moderating image content; | ||
4. Identifying if two images are too similar or too different, ideal for dataset management and cleaning; | ||
5. Building dataset search experiences, | ||
6. And more. | ||
|
||
Grounding DINO, on the other hand, can be used out-of-the-box to detect a range of objects. Or you can use Grounding DINO to automatically label data for use in training a smaller, faster object detection model that is fine-tuned to your use case. | ||
|
||
## How to Use Foundation Models | ||
|
||
The guides in this section walk through how to use each of the foundation models listed above with Inference. No machine learning experience is required to use each model. Our code snippets and accompanying reference material provide the knowledge you need to get started working with foundation models. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,167 @@ | ||
[CLIP](https://github.com/openai/CLIP) is a computer vision model that can measure the similarlity between text and images. | ||
|
||
CLIP can be used for, among other things: | ||
|
||
- Image classification | ||
- Automated labeling for classification models | ||
- Image clustering | ||
- Gathering images for model training that are sufficiently dissimilar from existing samples | ||
- Content moderation | ||
|
||
With Inference, you can calculate CLIP embeddings for images and text in real-time. | ||
|
||
In this guide, we will show: | ||
|
||
1. How to classify video frames with CLIP in real time, and; | ||
2. How to calculate CLIP image and text embeddings for use in clustering and comparison. | ||
|
||
## Classify Video Frames | ||
|
||
With CLIP, you can classify images and video frames without training a model. This is because CLIP has been pre-trained to recognize many different objects. | ||
|
||
To use CLIP to classify video frames, you need a prompt. In the example below, we will use the prompt "cell phone". | ||
|
||
We can compare the similarity of "cell phone" to each video frame and use that to classify the video frame. | ||
|
||
Below is a demo of CLIP classifying video frames in real time. The code for the example is below the video. | ||
|
||
|
||
<video width="100%" autoplay loop muted> | ||
<source src="https://media.roboflow.com/clip-coffee.mp4" type="video/mp4"> | ||
</video> | ||
|
||
Create a new Python file and add the following code: | ||
|
||
```python | ||
import cv2 | ||
import inference | ||
from inference.core.utils.postprocess import cosine_similarity | ||
|
||
from inference.models import Clip | ||
clip = Clip() | ||
|
||
prompt = "an ace of spades playing card" | ||
text_embedding = clip.embed_text(prompt) | ||
|
||
def render(result, image): | ||
# get the cosine similarity between the prompt & the image | ||
similarity = cosine_similarity(result["embeddings"][0], text_embedding[0]) | ||
|
||
# scale the result to 0-100 based on heuristic (~the best & worst values I've observed) | ||
range = (0.15, 0.40) | ||
similarity = (similarity-range[0])/(range[1]-range[0]) | ||
similarity = max(min(similarity, 1), 0)*100 | ||
|
||
# print the similarity | ||
text = f"{similarity:.1f}%" | ||
cv2.putText(image, text, (10, 310), cv2.FONT_HERSHEY_SIMPLEX, 12, (255, 255, 255), 30) | ||
cv2.putText(image, text, (10, 310), cv2.FONT_HERSHEY_SIMPLEX, 12, (206, 6, 103), 16) | ||
|
||
# print the prompt | ||
cv2.putText(image, prompt, (20, 1050), cv2.FONT_HERSHEY_SIMPLEX, 2, (255, 255, 255), 10) | ||
cv2.putText(image, prompt, (20, 1050), cv2.FONT_HERSHEY_SIMPLEX, 2, (206, 6, 103), 5) | ||
|
||
# display the image | ||
cv2.imshow("CLIP", image) | ||
cv2.waitKey(1) | ||
|
||
# start the stream | ||
inference.Stream( | ||
source="webcam", | ||
model=clip, | ||
|
||
output_channel_order="BGR", | ||
use_main_thread=True, | ||
|
||
on_prediction=render | ||
) | ||
``` | ||
|
||
Run the code to use CLIP on your webcam. | ||
|
||
**Note:** The model will take a minute or two to load. You will not see output while the model is loading. | ||
|
||
## Calculate a CLIP Embedding | ||
|
||
CLIP enables you to calculate embeddings. Embeddings are numeric, semantic representations of images and text. They are useful for clustering and comparison. | ||
|
||
You can use CLIP embeddings to compare the similarity of text and images. | ||
|
||
There are two types of CLIP embeddings: image and text. | ||
|
||
Below we show how to calculate, then compare, both types of embeddings. | ||
|
||
### Image Embedding | ||
|
||
In the code below, we calculate an image embedding. | ||
|
||
Create a new Python file and add this code: | ||
|
||
```python | ||
import requests | ||
#Define Request Payload | ||
infer_clip_payload = { | ||
#Images can be provided as urls or as bas64 encoded strings | ||
"image": { | ||
"type": "url", | ||
"value": "https://i.imgur.com/Q6lDy8B.jpg", | ||
}, | ||
} | ||
|
||
# Define inference server url (localhost:9001, infer.roboflow.com, etc.) | ||
base_url = "https://infer.roboflow.com" | ||
|
||
# Define your Roboflow API Key | ||
api_key = os.environ["API_KEY"] | ||
|
||
res = requests.post( | ||
f"{base_url}/clip/embed_image?api_key={api_key}", | ||
json=infer_clip_payload, | ||
) | ||
|
||
embeddings = res.json()['embeddings'] | ||
|
||
print(embeddings) | ||
``` | ||
|
||
### Text Embedding | ||
|
||
In the code below, we calculate a text embedding. | ||
|
||
```python | ||
import requests | ||
#Define Request Payload | ||
infer_clip_payload = { | ||
"text": "the quick brown fox jumped over the lazy dog", | ||
} | ||
|
||
res = requests.post( | ||
f"{base_url}/clip/embed_text?api_key={api_key}", | ||
json=infer_clip_payload, | ||
) | ||
|
||
embeddings = res.json()['embeddings'] | ||
|
||
print(embeddings) | ||
``` | ||
|
||
### Compare Embeddings | ||
|
||
To compare embeddings for similarity, you can use cosine similarity. | ||
|
||
The code you need to compare image and text embeddings is the same. | ||
|
||
```python | ||
from inference.core.utils.postprocess import cosine_similarity | ||
|
||
similarity = cosine_similarity(image_embedding, text_embedding) | ||
``` | ||
|
||
The resulting number will be between 0 and 1. The higher the number, the more similar the image and text are. | ||
|
||
## See Also | ||
|
||
- [What is CLIP?](https://blog.roboflow.com/openai-clip/) | ||
- [Build an Image Search Engine with CLIP and Faiss](https://blog.roboflow.com/clip-image-search-faiss/) | ||
- [Build a Photo Memories App with CLIP](https://blog.roboflow.com/build-a-photo-memories-app-with-clip/) | ||
- [Analyze and Classify Video with CLIP](https://blog.roboflow.com/how-to-analyze-and-classify-video-with-clip/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
[DocTR](https://github.com/mindee/doctr) is an Optical Character Recognition model. | ||
|
||
You can use DocTR with Inference to identify and recognize characters in images. | ||
|
||
### How to Use DocTR | ||
|
||
To use DocTR with Inference, you will need a Roboflow API key. If you don't already have a Roboflow account, [sign up for a free Roboflow account](https://app.roboflow.com). Then, retrieve your API key from the Roboflow dashboard. Run the following command to set your API key in your coding environment: | ||
|
||
``` | ||
export API_KEY=<your api key> | ||
``` | ||
|
||
Create a new Python file and add the following code: | ||
|
||
```python | ||
import requests | ||
import base64 | ||
from PIL import Image | ||
import supervision as sv | ||
import os | ||
|
||
API_KEY = os.environ["API_KEY"] | ||
IMAGE = "container1.jpeg" | ||
|
||
image = Image.open(IMAGE) | ||
|
||
data = { | ||
"image": { | ||
"type": "base64", | ||
"value": base64.b64encode(image.tobytes()).decode("utf-8"), | ||
} | ||
} | ||
|
||
ocr_results = requests.post("http://localhost:9001/doctr/ocr?api_key=" + API_KEY, json=data).json() | ||
|
||
print(ocr_results, class_name) | ||
``` | ||
|
||
Above, replace `container1.jpeg` with the path to the image in which you want to detect objects. | ||
|
||
Then, run the Python script you have created: | ||
|
||
``` | ||
python app.py | ||
``` | ||
|
||
The results of DocTR will appear in your terminal: | ||
|
||
``` | ||
... | ||
``` | ||
|
||
## Further Reading | ||
|
||
- [Use DocTR as part of a two-step detection system](https://blog.roboflow.com/ocr-api/) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, have we talked about how to start the roboflow inference server docker container? If now, we should mention it here and link to that part of the docs.