We introduce a method for finding duplicates in machine-coded event data with potential applications in several high-profile data projects. First we select likely duplicates based on temporal and spatial proximity. In a second step, we rely on Natural Language Processing tools to generate various distance metrics for the text information associated with each event. These distances serve as a basis for multivariate classification models that are trained on a subset of human coded duplicates. We apply this method to two empirical samples from the ICEWS data collection and achieve strong classification performance at low rate of false positives.
Download paper here