Andrew Ng’s VisionAgent: Streamlining Imaginative and prescient AI Options


Mannequin Recall Precision F1 Rating
Touchdown AI 77.0% 82.6% 79.7% (highest)
Microsoft Florence-2 43.4% 36.6% 39.7%
Google OWLv2 81.0% 29.5% 43.2%
Alibaba Qwen2.5-VL-7B-Instruct 26.0% 54.0% 35.1%

4. Key Takeaways

  • Touchdown AI’s Agentic Object Detection achieved the best F1 Rating (79.7%), that means it balances precision and recall higher than the others.
  • Google OWLv2 had the best recall (81.0%), that means it detects probably the most objects however has decrease precision.
  • Alibaba’s Qwen2.5-VL-7B-Instruct had comparatively low recall however average precision.
  • Microsoft Florence-2 confirmed decrease efficiency throughout all metrics in comparison with the opposite fashions.

Touchdown AI’s strategy appears to supply probably the most balanced and efficient efficiency in object detection duties, as mirrored in its highest F1 rating. The opposite fashions have trade-offs, with some excelling in recall however missing in precision, or vice versa.

Sensible Working of VisionAgent

VisionAgent follows a structured course of to supply outputs successfully:

  • Add a picture you need to analyze
  • Write a immediate (e.g., “detect folks with glasses”)
  • Our AI agent analyzes the picture totally
  • Obtain detection outcomes in your picture

1. Immediate: “Detect greens in and across the basket”

Step 1: Dialog

A consumer initiates a request, describing the imaginative and prescient job in pure language. VisionAgent interprets this immediate and confirms the duty.

Enter Picture

Input Image

Dialog

“I’ll enable you to write code to detect greens in and across the basket within the picture. I’ll use object detection to establish varied greens within the scene.”

Step 2: Planner

I must first perceive what’s within the picture and get recommendations on tips on how to detect greens in and across the basket. I’ll begin by loading the picture and utilizing claude35_vqa and suggestion features.

Planning Steps

  • Perceive the picture contents utilizing VQA (Visible Query Answering).
  • Generate related recommendations for the most effective detection strategy.
  • Choose instruments that go well with the duty, resembling object detection and color-based classification.

Instance Plan

  • Determine all objects within the picture.
  • Decide which objects are greens.
  • Differentiate greens inside and outdoors the basket.
  • Use a nested detection construction to research the spatial relationships.

Execution:

picture = load_image('vegetablebasket.jpg')
claude35_vqa('Are you able to describe this picture? What greens do you see in and across the basket?', [image])
suggestion('How can I detect greens in and across the basket on this picture?', [image])

Remark

[claude35_vqa output]
This picture reveals a wonderful array of contemporary greens organized round and in a wicker basket. The greens embrace:

1. Contemporary carrots with inexperienced tops
2. Cauliflower
3. Onions
4. Potatoes
5. Broccoli
6. Eggplant
7. Pink tomatoes
8. Yellow squash/zucchini
9. Purple cabbage
10. Inexperienced bell pepper
11. Pink bell pepper
12. Cucumbers
13. Contemporary garlic
14. Contemporary herbs (seems to be parsley)
15. Inexperienced banana peppers or chili peppers

The greens seem like very contemporary and vibrant in colour, organized on what appears like a wood floor or slicing board. The composition reveals a pleasant number of colours, from deep purples to vivid reds, oranges, and greens, making it visually interesting and suggesting a wholesome, nutritious choice of produce.
[end of claude35_vqa output]

[suggestions]

1. The greens are each in and round a basket, so relative_position would assist in figuring out their places with respect to the basket as an anchor level.

2. nested_structure applies as a result of some greens are nested/contained inside the basket whereas others are organized exterior it.

3. colour can be helpful because the greens are distinctly coloured (pink tomatoes, orange carrots, inexperienced broccoli, purple cabbage, and so forth.), making colour a key identifier.

4. The issue doesn't contain small textual content, grid patterns, monitoring in video, evaluating sizes particularly, or coping with extraordinarily small objects in a big picture. The depth place is seen however not essential to the detection job.


[suggestion 0]
It's worthwhile to discover the colour of objects within the picture, you need to use the next code to assist with this job:

import numpy as np
import cv2

color_ranges = {
    "red_lower": ((0, 100, 100), (int(179 * 20 / 360), 255, 255)),
    "orange": ((int(179 * 21 / 360), 100, 100), (int(179 * 50 / 360), 255, 255)),
    "yellow": ((int(179 * 51 / 360), 100, 100), (int(179 * 70 / 360), 255, 255)),
    "inexperienced": ((int(179 * 71 / 360), 100, 100), (int(179 * 150 / 360), 255, 255)),
    "cyan": ((int(179 * 151 / 360), 100, 100), (int(179 * 180 / 360), 255, 255)),
    "blue": ((int(179 * 181 / 360), 100, 100), (int(179 * 265 / 360), 255, 255)),
    "purple": ((int(179 * 266 / 360), 100, 100), (int(179 * 290 / 360), 255, 255)),
    "pink": ((int(179 * 291 / 360), 100, 100), (int(179 * 330 / 360), 255, 255)),
    "red_upper": ((int(179 * 331 / 360), 100, 100), (179, 255, 255)),
    "white": ((0, 0, 200), (179, 25, 255)),
    "grey": ((0, 0, 50), (179, 50, 200)),
    "black": ((0, 0, 0), (179, 255, 30)),
}

def get_color(picture, color_ranges):
    hsv_image = cv2.cvtColor(picture, cv2.COLOR_RGB2HSV)
    detected_colors = {}
    for colour, (decrease, higher) in color_ranges.objects():
        upper_range = np.array(higher, dtype=np.uint8)
        lower_range = np.array(decrease, dtype=np.uint8)
        masks = cv2.inRange(hsv_image, lower_range, upper_range)
        detected_pixels = cv2.countNonZero(masks)
        detected_colors[color] = detected_pixels

    if "red_lower" in detected_colors and "red_upper" in detected_colors:
        detected_colors["red"] = detected_colors["red_lower"] + detected_colors["red_upper"]
        del detected_colors["red_lower"]
        del detected_colors["red_upper"]
    return sorted(detected_colors, key=detected_colors.get, reverse=True)[0]

[end of suggestion 0]
[suggestion 1]
You are attempting to rely objects inside objects, or a nested construction. You'll be able to remedy this by first detecting the outer objects, then cropping the picture to the bounding field of every outer object and detecting the internal objects. You need to use the next code to assist with this job:

all_dets = <an object detection device>("object", picture)

top, width = picture.form[:2]

def space(field):
    return (field[2] - field[0]) * (field[3] - field[1])

# solely verify internal detections on prime 25 largest outer detections
largest_dets = sorted(dets, key=lambda x: space(x["bbox"]), reverse=True)[:25]
for det in largest_dets:
    x1 = int(det["bbox"][0] * width)
    y1 = int(det["bbox"][1] * top)
    x2 = int(det["bbox"][2] * width)
    y2 = int(det["bbox"][3] * top)

    crop = picture[y1:y2, x1:x2]
    crop_height, crop_width = crop.form[:2]

    inner_dets = <an object detection device>("object", crop)
    for inner_det in inner_dets:
        x1_inner = int(inner_det["bbox"][0] * crop_width)
        y1_inner = int(inner_det["bbox"][1] * crop_height)
        x2_inner = int(inner_det["bbox"][2] * crop_width)
        y2_inner = int(inner_det["bbox"][3] * crop_height)

        bbox = [
            x1 + x1_inner,
            y1 + y1_inner,
            x1 + x2_inner,
            y1 + y2_inner,
        ]
        norm_bbox = [
            bbox[0] / width,
            bbox[1] / top,
            bbox[2] / width,
            bbox[3] / top,
        ]
        all_dets.append(
            {
                "label": inner_det["label"],
                "rating": inner_det["score"],
                "bbox": norm_bbox,
            }
        )

[end of suggestion 1]
[suggestion 2]
You are attempting to find an objects relative to 'anchor' objects. The 'anchor' objects could be detected fantastic, however there are a lot of of the opposite objects and also you solely need to return those which might be situated relative to the 'anchor' objects as specified by the consumer. You need to use the next code to assist with this job:

# First discover a mannequin that may detect the placement of the anchor objects
anchor_dets = <a mannequin that may detect the placement of the anchor objects>("anchor object", picture)
# Then discover a mannequin that may detect the placement of the relative objects
relative_dets = <a mannequin that may detect the placement of the relative objects>("relative object", picture)

# This offers you relative objects 'above' the anchor objects since it is the
# distance between the decrease left nook of the relative object and the higher left
# nook of the anchor object. The remaining features can be utilized to get the opposite
# relative positions.
def above_distance(box1, box2):
    return (box1["bbox"][0] - box2["bbox"][0]) ** 2 + (
        box1["bbox"][3] - box2["bbox"][1]
    ) ** 2

def below_distance(box1, box2):
    return (box1["bbox"][0] - box2["bbox"][0]) ** 2 + (
        box1["bbox"][1] - box2["bbox"][3]
    ) ** 2

def right_distance(box1, box2):
    return (box1["bbox"][0] - box2["bbox"][2]) ** 2 + (
        box1["bbox"][1] - box2["bbox"][1]
    ) ** 2

def left_distance(box1, box2):
    return (box1["bbox"][2] - box2["bbox"][0]) ** 2 + (
        box1["bbox"][1] - box2["bbox"][1]
    ) ** 2

closest_boxes = []
for anchor_det in anchor_dets:
    # You need to use any of the above features to get the relative place
    distances = [
        (relative_det, above_distance(relative_det, anchor_det))
        for relative_det in relative_dets
    ]
    # You could seize the closest object for every of the anchors. This line will give
    # you the field instantly above the anchor field (or under, left, proper relying on
    # the perform used)
    closest_box = min(distances, key=lambda x: x[1])[0]
    closest_boxes.append(closest_box)

[end of suggestion 2]
[end of suggestions]

<rely>9</rely>

Step 3: Execution

The deliberate steps are executed utilizing the VisionAgent Library and Instruments Library. This entails loading the picture, working fashions, and producing detection outputs.

Planner

Based mostly on Claude’s description and the recommendations, I must detect greens each in and across the basket. It is a nested construction downside the place I first must detect the basket after which detect greens in relation to it. Let me first get a device that may detect each the basket and greens.

Instance Execution Code

picture = load_image('vegetablebasket.jpg')
claude35_vqa('Are you able to describe this picture? What greens do you see in and across the basket?', [image])
suggestion('How can I detect greens in and across the basket on this picture?', [image])
get_tool_for_task('Detect basket and greens within the picture', [image])

Remark and Closing Output

After execution, VisionAgent produces structured outcomes:

  • Detected greens categorized inside and outdoors the basket.
  • Bounding field coordinates for every detected vegetable.
  • A deployable AI mannequin that can be utilized for real-world functions.

Output

Output
Output
Output

2. Immediate: “Determine pink automotive within the video”

Enter Video: Video Hyperlink

Step 1: Dialog

I’ll enable you to write code to establish and observe pink vehicles within the video. I’ll use object detection and color-based filtering to perform this job.

Step 2: Planner

I must first perceive what’s within the video and get recommendations on tips on how to establish a pink automotive. I’ll extract some frames and use claude35_vqa to explain the video content material, then get recommendations.

Step 3: Execution

frames = extract_frames_and_timestamps('videoplayback(4).mp4', 1)
frames = [f["frame"] for f in frames]
claude35_vqa('Are you able to describe this video? Is there a pink automotive in it?', frames[:5])
suggestion('How can I establish the pink automotive within the video?', frames[:5])

Remark

[claude35_vqa output]
Sure, there's a pink automotive (#21) on this video sequence, which seems to be a NASCAR race clip displaying an incident between two vehicles - a pink #21 automotive and a black/gold #3 automotive with Winchester branding. The video reveals them making contact round flip 4 of the observe, with the vehicles getting shut to one another and doubtlessly buying and selling paint. The sequence is labeled as "Josh Berry vs Austin Dillon" and reveals what seems to be a tense racing second between the 2 drivers. The textual content overlays point out "WOAH TURN 4", "CONTACT", "BERRY", "A LOT", and "THERE" because the sequence progresses.
[end of claude35_vqa output]

[suggestions]

Based mostly on the consumer's request to establish the pink automotive within the video and the supplied video frames, that is primarily a case of:

1. finding_features_with_video_tracking - The sequence reveals a NASCAR race interplay between two vehicles (Josh Berry and Austin Dillon) the place monitoring the pink #21 automotive by means of the flip and call is vital
2. colour - The particular request mentions figuring out the "pink automotive", making colour identification related
3. relative_position - The interplay between the 2 vehicles requires understanding their positions relative to one another by means of the flip and call

The opposite classes do not apply as a result of:
- This is not about small textual content or massive photographs
- There aren't any grid patterns or lacking parts
- This is not about evaluating sizes or nested constructions
- Whereas there's some depth notion in racing, the request is not particularly about figuring out depth place


[suggestion 0]
It's worthwhile to discover the colour of objects within the picture, you need to use the next code to assist with this job:

import numpy as np
import cv2

color_ranges = {
    "red_lower": ((0, 100, 100), (int(179 * 20 / 360), 255, 255)),
    "orange": ((int(179 * 21 / 360), 100, 100), (int(179 * 50 / 360), 255, 255)),
    "yellow": ((int(179 * 51 / 360), 100, 100), (int(179 * 70 / 360), 255, 255)),
    "inexperienced": ((int(179 * 71 / 360), 100, 100), (int(179 * 150 / 360), 255, 255)),
    "cyan": ((int(179 * 151 / 360), 100, 100), (int(179 * 180 / 360), 255, 255)),
    "blue": ((int(179 * 181 / 360), 100, 100), (int(179 * 265 / 360), 255, 255)),
    "purple": ((int(179 * 266 / 360), 100, 100), (int(179 * 290 / 360), 255, 255)),
    "pink": ((int(179 * 291 / 360), 100, 100), (int(179 * 330 / 360), 255, 255)),
    "red_upper": ((int(179 * 331 / 360), 100, 100), (179, 255, 255)),
    "white": ((0, 0, 200), (179, 25, 255)),
    "grey": ((0, 0, 50), (179, 50, 200)),
    "black": ((0, 0, 0), (179, 255, 30)),
}

def get_color(picture, color_ranges):
    hsv_image = cv2.cvtColor(picture, cv2.COLOR_RGB2HSV)
    detected_colors = {}
    for colour, (decrease, higher) in color_ranges.objects():
        upper_range = np.array(higher, dtype=np.uint8)
        lower_range = np.array(decrease, dtype=np.uint8)
        masks = cv2.inRange(hsv_image, lower_range, upper_range)
        detected_pixels = cv2.countNonZero(masks)
        detected_colors[color] = detected_pixels

    if "red_lower" in detected_colors and "red_upper" in detected_colors:
        detected_colors["red"] = detected_colors["red_lower"] + detected_colors["red_upper"]
        del detected_colors["red_lower"]
        del detected_colors["red_upper"]
    return sorted(detected_colors, key=detected_colors.get, reverse=True)[0]

[end of suggestion 0]
[suggestion 1]
First attempt to remedy the issue utilizing a VQA device earlier than utilizing the monitoring strategy for a sooner and simpler answer:

reply = <VQA device to reply your query>("<your immediate right here>", picture)

If that doesn't work, you'll be able to observe the objects within the video after which establish options on these objects. It's worthwhile to first get a device that may observe objects in a video, after which for every object discover one other device to establish the options on the item. You need to use the next code to assist with this job:

track_predictions = <object monitoring device>("object", video_frames)


# Step 1: undergo every body and every prediction and extract the expected bounding packing containers as crops
obj_to_info = {}
for body, frame_predictions in zip(video_frames, track_predictions):
    for obj in frame_predictions:
        if obj["label"] not in obj_to_info:
            obj_to_info[obj["label"]] = []
        top, width = body.form[:2]
        crop = body[
            int(obj["bbox"][1] * top) : int(obj["bbox"][3] * top),
            int(obj["bbox"][0] * width) : int(obj["bbox"][2] * width),
            :,
        ]
        # For every crop use an object detection device, VQA device or classification device to establish if the item incorporates the options you need
        output = <device to establish your function or a number of options>("<your function(s) right here>", crop)
        obj_to_info[obj["label"]].prolong(output)

print(f"{len(obj_to_info)} objects tracked")

objects_with_info = set()
for infos in obj_to_info:
    for information in information:
        if information["label"] == "<your function right here>":
            objects_with_info.add(information)
            break

print(f"{len(objects_with_info)} objects with options discovered")

[end of suggestion 1]
[suggestion 2]
You are attempting to find an objects relative to 'anchor' objects. The 'anchor' objects could be detected fantastic, however there are a lot of of the opposite objects and also you solely need to return those which might be situated relative to the 'anchor' objects as specified by the consumer. You need to use the next code to assist with this job:

# First discover a mannequin that may detect the placement of the anchor objects
anchor_dets = <a mannequin that may detect the placement of the anchor objects>("anchor object", picture)
# Then discover a mannequin that may detect the placement of the relative objects
relative_dets = <a mannequin that may detect the placement of the relative objects>("relative object", picture)

# This offers you relative objects 'above' the anchor objects since it is the
# distance between the decrease left nook of the relative object and the higher left
# nook of the anchor object. The remaining features can be utilized to get the opposite
# relative positions.
def above_distance(box1, box2):
    return (box1["bbox"][0] - box2["bbox"][0]) ** 2 + (
        box1["bbox"][3] - box2["bbox"][1]
    ) ** 2

def below_distance(box1, box2):
    return (box1["bbox"][0] - box2["bbox"][0]) ** 2 + (
        box1["bbox"][1] - box2["bbox"][3]
    ) ** 2

def right_distance(box1, box2):
    return (box1["bbox"][0] - box2["bbox"][2]) ** 2 + (
        box1["bbox"][1] - box2["bbox"][1]
    ) ** 2

def left_distance(box1, box2):
    return (box1["bbox"][2] - box2["bbox"][0]) ** 2 + (
        box1["bbox"][1] - box2["bbox"][1]
    ) ** 2

closest_boxes = []
for anchor_det in anchor_dets:
    # You need to use any of the above features to get the relative place
    distances = [
        (relative_det, above_distance(relative_det, anchor_det))
        for relative_det in relative_dets
    ]
    # You could seize the closest object for every of the anchors. This line will give
    # you the field instantly above the anchor field (or under, left, proper relying on
    # the perform used)
    closest_box = min(distances, key=lambda x: x[1])[0]
    closest_boxes.append(closest_box)

[end of suggestion 2]
[end of suggestions]

<rely>9</rely>

Planner

Based mostly on Claude’s description and the recommendations, I would like to trace vehicles within the video and establish the pink one. Let me first get a device for monitoring vehicles within the video.

Execution

get_tool_for_task('Monitor vehicles in a NASCAR race video', frames[:5])

Output

Output

Conclusion

VisionAgent is remodeling the way in which builders construct, take a look at, and deploy AI-driven imaginative and prescient functions. By automating tedious processes and offering ready-to-use instruments, it considerably reduces improvement time whereas making certain high-quality outcomes. Whether or not you’re an AI researcher, a developer, or a enterprise seeking to implement laptop imaginative and prescient options, VisionAgent supplies a quick, versatile, and scalable technique to obtain your objectives.

With ongoing developments in AI, VisionAgent is anticipated to evolve additional, incorporating even extra highly effective fashions and increasing its ecosystem to assist a wider vary of functions. Now’s the right time to discover how VisionAgent can improve your AI-driven imaginative and prescient tasks.

Hello, I’m Pankaj Singh Negi – Senior Content material Editor | Captivated with storytelling and crafting compelling narratives that remodel concepts into impactful content material. I like studying about expertise revolutionizing our way of life.

We use cookies important for this website to perform effectively. Please click on to assist us enhance its usefulness with further cookies. Find out about our use of cookies in our Privateness Coverage & Cookies Coverage.

Present particulars

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here