Violence detection in videos is a challenging task which has gotten much attention in the research community. In this paper, we propose a three-stream network framework for violence detection in binocular stereo vision. To capture the complementary information from the video we adopt the appearance, motion and depth information. The spatial part, we use the RGB as the individual frame appearance. Then, we use the sparse stereo matching method to extract the feature points and obtain the vision disparity of the point. The 3D coordinates of the points are calculated through the standard 3D measurement theory. The 3D motion vector conveys the movement of the camera and the objects as the motion information. Besides, the depth information flow is the third input of the network which can improved recognition rate.
展开▼