Social-networking service (SNS) is a very popular internet application, which not only connects the web users based on their real social relations or their common interests, but also provides on-line communication and interactive functions. In this service, information acquisition becomes more diversified, and information diffuses much faster and more widely than ever before. In all kinds of information that flows in SNS platform, news is one of the most important information. Mining news information in SNS platform has already attracted attention of the researchers. In this thesis, we investigate several key problems of news mining based on a large-scale dataset from real social-networking service. Following are the main contributions of this paper: Retrospective news event detection: Retrospective detection is to detect news events based on looking backward the existing news data. The previous studies of this problem mainly dealt with the clean news data. However, the SNS dataset we use is very noisy. It contains many other kinds of information not just the news. So, a novel framework is described to solve this problem. First, we use LDA to model the dataset. Then, the concept of word-meaning unit is proposed. After that, we monitor the frequencies of every word-meaning unit in different time slices, and filter out the most news-related data according to the characteristic of news event. Finally, a two-level hybrid clustering algorithm is used to group the filtered-out data into different news events. Experimental results show that the framework is very simple and effective. On-line news event detection and tracking: On-line detection and tracking should process the data stream in real-time. Traditionally, single-pass clustering method is used to solve this problem. However, the original single-pass clustering method treats all different news events in the same way, with a fixed threshold and a constant life span. Obviously, it is not very reasonable. So, we first propose an energy function, which is used to model the life cycle of a news event, and its value indicates the relative-hotness of a news event. Then, we improve the original single-pass clustering method based on the energy function. Experimental results show that the proposed method improves both the precision and recall with a very little extra time-cost. However it still satisfies the real-time demands. Trends analyzing of news events and topics: First,...
修改评论