The cleanup is done using a greedy heuristic as follows: Start with short
text blocks on the first page and than iterate over all other pages and
try to build a sequence of most similar TextBlocks to it.
Copyright (C) 2013, 2014 Raphael Dickfelder, Jan Göpfert, Benjamin Paassen, Andreas Stöckel, licensed under the AGPL v. 3: http://openresearch.cit-ec.de/projects/scie