Video Surveillance with YOLO+llava

  • If you're interested in DIY security+AI, check out Frigate NVR(https://frigate.video/), Scrypted(https://www.scrypted.app/) and Viseron(https://viseron.netlify.app/).

  • Congrats! What hardware you use to run the inference 24/7? I built a simpler version for running on low end hardware [0] for recognizing if there’s a person on my parcel, so I know someone have trespassed and I can launch siren, lights etc.

    https://github.com/jmaczan/yolov3-tiny-openvino

  • This runs with a Geforce GTX 1060. By a quick search it's 120 W. Maybe it's only the peak power consumption but it's still a lot. Do commercial products, if there are any, consume that much power?

  • A suggestion: I'd swap llava for Florence-2 for your open set text description. Florence-2 seems uniformly more descriptive in its outputs.

  • I'm confused about why you need yolo and llava. Can't you simply use yolo without a multimodal LLM? What does that add? You can use yolo to detect and grab screen coordinates on its own, right?

  • Hello from the privacy crowd! Please use this responsibly. Tech can be a lot of fun and I encourage you to play around with things and I appreciate it when you push the boundaries of what is technically feasible. But please be mindful that surveillance tech can also be used to oppress people and infringe on their freedoms. Use tech for good!

  • MobileNetV3 and EfficientDet are othwr possible alternatives to YOLO. I was able to get higher than 1.5 FPS on Raspberry Pi Zero 2W which draws 1W on average. With efficient queuing approach, one can eliminate all bottlenecks.

  • Can you specify ideal hardware (camera, computer) to deploy the solution? Thanks

  • >> It calculates the center of every detection box, pinpoint on screen and gives 16px tolerance on all directions. Script tries to find closest object as fallback and creates a new object in memory in last resort. You can observe persistent objects in /elements folder

    I’ve never implemented this kind of object persistence algo - is this a good approach? Seems naive but maybe that’s just because it’s simple.

  • how about llama3.2 vision? should it get better performance?

  • All I see, usually, is some AI YOLO algorithm applied to an offline video.

    This is the first time that I've seen a "complete" setup. Any info to learn more on applying YOLO and similar models to real time streams (whatever the format)?

  • Could try with Florence by Microsoft instead of Yolo and Llava, though the results are not going to be as great. Florence will do the inference on CPU. This is just for fun.