Hacker News

Video Surveillance with YOLO+llava

by psychipon 10/8/2024, 12:21:03 AM with 12 comments

by 01100011on 10/8/2024, 2:54:16 PM
If you're interested in DIY security+AI, check out Frigate NVR(https://frigate.video/), Scrypted(https://www.scrypted.app/) and Viseron(https://viseron.netlify.app/).
by yu3zhou4on 10/8/2024, 6:13:36 AM
Congrats! What hardware you use to run the inference 24/7? I built a simpler version for running on low end hardware [0] for recognizing if there’s a person on my parcel, so I know someone have trespassed and I can launch siren, lights etc.
https://github.com/jmaczan/yolov3-tiny-openvino
by pmontraon 10/8/2024, 6:38:33 AM
This runs with a Geforce GTX 1060. By a quick search it's 120 W. Maybe it's only the peak power consumption but it's still a lot. Do commercial products, if there are any, consume that much power?
by rocaucon 10/8/2024, 5:12:22 AM
A suggestion: I'd swap llava for Florence-2 for your open set text description. Florence-2 seems uniformly more descriptive in its outputs.
by xrdon 10/8/2024, 11:53:03 AM
I'm confused about why you need yolo and llava. Can't you simply use yolo without a multimodal LLM? What does that add? You can use yolo to detect and grab screen coordinates on its own, right?
by vaylianon 10/8/2024, 9:53:26 AM
Hello from the privacy crowd! Please use this responsibly. Tech can be a lot of fun and I encourage you to play around with things and I appreciate it when you push the boundaries of what is technically feasible. But please be mindful that surveillance tech can also be used to oppress people and infringe on their freedoms. Use tech for good!
by matrikon 10/8/2024, 5:11:28 PM
MobileNetV3 and EfficientDet are othwr possible alternatives to YOLO. I was able to get higher than 1.5 FPS on Raspberry Pi Zero 2W which draws 1W on average. With efficient queuing approach, one can eliminate all bottlenecks.
by feraron 10/8/2024, 4:09:30 AM
Can you specify ideal hardware (camera, computer) to deploy the solution? Thanks
by doctorhandshakeon 10/8/2024, 10:38:23 AM
>> It calculates the center of every detection box, pinpoint on screen and gives 16px tolerance on all directions. Script tries to find closest object as fallback and creates a new object in memory in last resort. You can observe persistent objects in /elements folder
I’ve never implemented this kind of object persistence algo - is this a good approach? Seems naive but maybe that’s just because it’s simple.
by nikolayasdf123on 10/8/2024, 6:37:23 AM
how about llama3.2 vision? should it get better performance?
by _giorgio_on 10/8/2024, 2:53:54 AM
All I see, usually, is some AI YOLO algorithm applied to an offline video.
This is the first time that I've seen a "complete" setup. Any info to learn more on applying YOLO and similar models to real time streams (whatever the format)?
by anshumankmron 10/8/2024, 1:28:26 PM
Could try with Florence by Microsoft instead of Yolo and Llava, though the results are not going to be as great. Florence will do the inference on CPU. This is just for fun.