Segment Anything 2 (SAM 2) Ball Tracking and Real Time Code Demo! | YouTube Summarizer

Category: Tech Tutorial

Tags: AI demo segmentation tracking video

Entities: Kevin Wood sa dataset Sam 2 Segment Anything 2 YOLO

Summary

Introduction to Segment Anything 2

Segment Anything 2, also known as Sam 2, is a foundation model for segmenting objects in images and videos.
The model was trained on a diverse dataset of 51,000 videos and 643,000 mask clips.

Key Features and Benefits

Sam 2 eliminates the need for manual training and annotation, unlike previous models like YOLO.
It can segment objects even with occlusions, maintaining accuracy where other models may fail.
Sam 2 is significantly faster, achieving a frame rate of 24.2 FPS even on large models.

Technical Overview

The architecture includes an image encoder, mask decoder, and memory attention for video tracking.
Memory attention links frames in a video to maintain object tracking.

Limitations

Sam 2 may lose track in long occlusions, crowded scenes, or extended videos.
It can miss fine details in fast-moving objects.

Setting Up and Using Sam 2

Set up a virtual environment and handle dependencies for running Sam 2 locally.
Use a series of scripts to extract frames, mask the first frame, and segment the video.

Real-time Demo and Performance

The demo shows Sam 2's ability to track objects in real-time, even on weak hardware.
The model accurately tracks a soccer ball through occlusions and fast movements.

Takeaways

Segment Anything 2 offers a robust solution for video object tracking without manual training.
The model excels in speed and accuracy, even in challenging conditions.
Setting up and running Sam 2 requires some technical steps but provides significant benefits.
Real-time demos highlight the practical applications and capabilities of Sam 2.
Sam 2's limitations can be mitigated with prompt refinement and human intervention.

Transcript

00:00

meta just released segment anything2 we're going to be going over what is segment anything2 why use it how does it work limitations of segment anything to how to prepare your code environment and handling errors go over

00:16

a realtime code demo and finally we're going to test the limits by using a soccer ball tracking video code demo as you can see on the right all my code and Doc will be available on my website at kevinwood robotics

00:33

[Music] so what is segment anything 2 segment anything 2 also known as Sam 2 is a foundation model that can segment objects in images and videos so check it

00:50

out you could come over here to this demo page and you could select the object that you want so you could click on the object and then all you got to do is Click track object and you can see that is tracking this ball pretty nicely in this video everything here was trained on the sa data set there's a

01:07

total of 51,000 videos 643,000 masc lits each video had about 12 masc lits and also these video had a very high resolution all of these videos were gathered from all over the world so you can see that we have a lot of

01:23

variety and you can see here on the right these are just some examples of the different variety that we get in our training data set so why use segment anything to I would say one of the biggest reasons why is you no longer have to do training so

01:39

previously for those that have used YOLO you know that the typical process is you actually have to go into to a bunch of images and have to annotate the location of your object so in here you may have to go through several thousand images and annotate the location of the ball

01:54

but now we're going to see if we could actually use segment anything without training and just a simple to see if you can actually track the ball in the video that we'll see later on another huge benefit of segment anything to is that it can segment objects in videos with

02:10

occlusion so here you can see two examples on the bottom here so these two images of the foot is in the front or if the boy hides behind the tree and we'll take a look at a video of this happening in action so here you can see a video of the ball and soccer foot that I showed

02:27

earlier and you can see that even if the foot includes the ball in some scenes it can still segment the ball properly now here's a second example of the boy hiding behind the tree and you can see as the boy crosses the tree we can still segment the boy without losing track not

02:43

only that segment anything too is much more accurate especially compared with Sam one so you can see that here on the left when you click a seed of this fish right here it actually ends up taking uh segmenting another fish that is not

02:58

supposed to segment but here on the right you can see that when we only click the fish right here we only get the fish and we don't get any other extra objects which would make it more inaccurate another huge Improvement of segment anything 2 is that it's fast

03:14

just look at this chart here we could see that we have different models from Tiny to all the way to large and even the large one we're getting a FPS of 24.2 so that's incredibly fast compared to the old one if you saw my previous video you could see that how long it

03:29

takes so in our real-time demo you're going to see how fast it actually runs even on a pretty weak computer that I have and we're going to test the limits later on so how does segment anything to work so you can see here is a example of the architecture of Sam here so

03:46

previously what you have is a image here and these plus are your seeds that you choose and then once you have your image you would pass it inside a image encoder and then here is the main part that does the magic so you have a mask decoder and you have a prompt encoder that um passes

04:03

information to your mask decoder so you will have things like your mask points or box and then finally the output would be your image with The Masks of the objects that you're interested in so what Sam 2 does is it takes a step further and applies it to a video so

04:21

previously was Sam with is only with images now we could do it with videos so pretty much this part is going to be the same but what they've added is this part called the memory attention so the memory attention is what links the frames between the different frames in

04:37

the same video so that it can keep track of the same object so here again you have the same part in the center as we saw in Sam one um the main thing here that we'll be using is prompts such as the XY coordinates but this main part is

04:52

the memory encoder and memory bank and all of this once you segment one time is going to pass that information back in this feedback loop to use some of that information for further mask decoding so that it can keep track of the images but

05:08

it's not without its limitations so some problems that it sees is it can lose track under certain conditions such as long occlusion crowded scenes or extended videos so here you can see that um the horse running around you're going to see pay attention to these two horses

05:24

here so if you notice a head one has the white mark on it and then later on it ends up attacking another horse that doesn't have a white Mark so similar objects can cause a problem and current solution right now is to refine The Prompt throughout the video so there's going to require some human intervention

05:41

so let's take a look at the video of this in action so you can see here the horse is running around and at some point when the inclusion is for extended period of time the horse tends to lose track another limitation is that it can miss fine details and fast moving

05:57

objects so you can see here on the bottom with with the wheel some of the spokes are missing and you can see here this is the video in real time running and you can see that because it's moving so fast and the details are so small some of the details can't be captured

06:12

okay so if you're trying to run segment anything to locally on vs code like what I'm doing what you might want to do is set up your virtual environment so go ahead and do that uh you might run into some challenges as you're setting up the repo so I'll go over some of the things that I had to go through and then the

06:28

solutions that I came up with so once you get your virtual environment set up you could download the segment anything to repo uh if you see the error the dll dependency a OS a FBG mm. dll what you want to do is make sure you have your uh

06:44

C++ compiler installed if you haven't already and if pip install D doesn't work and you see the error Cuda home environment variable is not set what you want to do is just go ahead and pip install the modules separately instead of using the PIP install d e and to get

07:02

the model ways just go ahead and go to the repo and download it and put it in the checkpoints folder to get that set up and along the way you might see the error cannot import name c from Sam 2 to resolve that just do python setup.py build ex-- in place and to run all the

07:21

code that I'll be showing just go ahead and place a files inside of the segment anything to repo that you've cloned and then in your root folder which what you want to do is have your videos folder in there with your video so you can see here we have our real time demo up and

07:37

running and you can see that right now I've set the prompt so that it'll select any object that's at the center of the screen and you can see here that I'm rotating my phone a little bit and you can see that it updates as I'm

07:52

rotating um so this is treating each frame as an image here so it's not the video not not actually tracking the object but just detecting each frame as if it's new but this is just to show the realtime performance and you can see

08:09

that uh considering that I'm running it on a pretty weak laptop right now and considering how Sam one would took would take like several minutes with a large model this is actually pretty fast just to use a large model to detect an object

08:24

so I'm going to go ahead and show you some other objects just to show you how it perform forms and detecting uh unseen objects here so it's tracking my hand right now if I move it down you see that it's tracking this lid and I could rotate it a little bit and you can see

08:41

that it captures the this cup pretty well or this is actually my ma the a cap that I made for my mic and it's doing pretty well and then let's just give it one more object this is a transparent bottle here and you can see that um it's

08:59

doing pretty well it's getting the bottle and when I rotate it you can see that it's getting different parts of the bottle so overall I would say it's pretty good job in terms of speed for treating each frame as an image so

09:16

excited to see how this will perform in our video coming up so before I jump into the real time demo I'd like to just go over the architecture of my program so what I have here is a video of our soccer video and then from there we're going to get the video path and I

09:32

created a function called extract frames some of their documentation says you need the ffmpeg to do some of the frame extraction but you really don't so I went ahead and made a function to get my frames and then from there what we're going to do is view the first frame so that we can select the location of the

09:49

object we're trying to choose and this is what's called our prompt which is the XY coordinate and then we're going to mask our first frame and then from there it's going to set up some inference dates and then and it's going to run our main function called segment video and then from there we're going to get our video that's labeled with the object

10:06

that we're trying to track so you can see here this is the video that we're going to be testing out a ball that's moving around with the soccer players blocking it and one thing I like about this video is that not only is there occlusion the ball is moving fast and

10:21

it's not very clear because the ball is small and we're going to see how well the segment anything2 model does in this example so if we take take a look at um our script here so what I've done is I ran the first part which is run extract frames and what this did was it

10:38

converted all of these soccer videos to uh images as you can see here these are all the images because some of the way the program is set up from the Sam uh repo is that it's going to expect all these video frames to be files so I went

10:54

ahead and converted that and then what we're going to see here is run the uh view first frame and what this will do is allow us to see the first frame okay so once we have the first frame showed up the main point of this is so that we could put our cursor here and then we could read the coordinate location of

11:11

the ball and then use that information as our prompt later on so once you extract that information you can go ahead and close this and next up what you want to do is run the next command which is run mask for the first frame so I'm going to go ahead and run this and what this will do is it's going to mask

11:28

the first frame just so we could verify if it's doing it correctly so you can see here that it has now masked the first frame with the object that we selected so if you look closely here you can actually see this is the ball that we've uh selected so um if you zoom in

11:44

you can kind of see here is that orange dot here so that is the ball that we're going to track throughout the video now for the final step we're going to run segment video and this will segment our video using some of the point prompt that we chose earlier finally you can

12:01

see our output is done now and you can see that here is the ball that is tracking and is doing it pretty darn well so if I zoom in here for you to see better um you can see that here I slowed it down by a few frames but you can see that even as the people are blocking the

12:18

ball is still being tracked pretty well and you can see that here some people is coming closer and he's about to kick the ball soon and so far so good it has not been lost during the tracking at all so just imagine if you were to do this

12:33

using yellow how long it would have take and here you can see he's kicking the ball so the ball's being blocked Ball's moving super fast and it still keeps track so again if you want my code check out check it out on my website at Kevin robotics.com if you found this video

12:48

helpful give a like And subscribe and I'll see you in the next one [Music]