Options
Detecting Document Versions and Their Ordering in a Collection
Journal
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
ISSN
03029743
Date Issued
2021-01-01
Author(s)
Modani, Natwar
Maurya, Anurag
Verma, Gaurav
Nair, Inderjeet
Patil, Vaidehi
Kanfade, Anirudh
Abstract
Given the iterative and collaborative nature of authoring and the need to adapt the documents for different audience, people end up with a large number of versions of their documents. These additional versions of documents increase the required cognitive effort for various tasks for humans (such as finding the latest version of a document, or organizing documents), and may degrade the performance of machine tasks such as clustering or recommendation of documents. To the best of our knowledge, the task of identifying and ordering the versions of documents from a collection of documents has not been addressed in prior literature. We propose a three-stage approach for the task of identifying versions and ordering them correctly in this paper. We also create a novel dataset for this purpose from Wikipedia, which we are releasing to the research community (https://github.com/natwar-modani/versions ). We show that our proposed approach significantly outperforms state-of-the-art approach adapted for this task from the closest previously known task of Near Duplicate Detection, which justifies defining this problem as a novel challenge.
Volume
13081 LNCS
Subjects