Automatic recognition of semantic event from videos is a very important issue in computer vision, which is also key issue in many practical applications. To tackle the problem, we propose an extended stochastic grammar based approach to the recognition of visual events. Major contributions of this thesis include the following: 1.By analyzing motion trajectory, we propose a semantic point guided approach to learning the primitives in the grammar system. First, a number of semantic points in the video scene are acquired by semantic scene modeling or manually. Then the trajectory clustering is performed to obtain some basic motion patterns that serve as the primitives. 2.We propose a Minimum Description Length (MDL) principle based rule induction algorithm to obtain a set of event rules. First, we extend the original grammar with Allen's temporal logic to represent the parallel relations between sub events. Then based on the idea “it prefers to combine the events which are similar with each other to form a new concept”, a multi-level induction strategy is proposed to guide the rule induction process. We also suggest a kind of coding scheme to encode the rules and input event stream in the using of the MDL principle. The experimental results in gymnastic exercises and traffic events demonstrate our method can learn the event structures effectively. 3.Based on the extended grammar representation, we propose a multi-thread parsing algorithm to recognize the complex events in the given primitive stream. By relax the ID set constraint, the parsing algorithm can handle the parallel temporal relations between sub events; Additionally, a Viterbi-like error recovery strategy is embedded in the parsing process to correct the large scale errors, such as insertion and deletion errors. Extensive experiments including gymnastic exercises and traffic events are performed to demonstrate the effectiveness of our method. 4.To further validate the capability of the proposed method, we also adopt the extended stochastic grammar system to recognize multi-agent interactions. Different from the primitive learning in static scene, we firstly set a local relative coordinate for each motion object. Then, the motion trajectories of other moving objects are projected into the local coordinate. Some of interaction primitives (IPs) are acquired by motion points clustering in the local coordinate. Finally, the multi-agent interactions is represented as a primitive stream that comprised of several IP string in different local coordinates, then we can use the extend grammar system to model and recognize the interaction. The satisfying experimental results in recognizing five interactions validate the proposed method
修改评论