Finding Duplicates in Machine Coded Event Data

We introduce a method for finding duplicates in machine-coded event data with potential applications in several high-profile data projects. First we select likely duplicates based on temporal and spatial proximity. In a second step, we rely on Natural Language Processing tools to generate various distance metrics for the text information associated with each event. These distances serve as a basis for multivariate classification models that are trained on a subset of human coded duplicates. We apply this method to two empirical samples from the ICEWS data collection and achieve strong classification performance at low rate of false positives.