The cleanup is done using a greedy heuristic as follows: Start with short
text blocks on the first page and than iterate over all other pages and
try to build a sequence of most similar TextBlocks to it.
The cleanup is done using a greedy heuristic as follows: Start with short
text blocks on the first page and than iterate over all other pages and
try to build a sequence of most similar TextBlocks to it. If we find a
similar TextBlock on all (or at least most) pages, the information is
redundant and we can exclude those TextBlocks from the document.
Copyright (C) 2013, 2014 Raphael Dickfelder, Jan Göpfert, Benjamin Paassen, Andreas Stöckel, licensed under the AGPL v. 3: http://openresearch.cit-ec.de/projects/scie