Class | Description |
---|---|
App |
This is the commandline/executable interface of the extractor.
|
DocumentBlockCleaner | |
Histogramm<H> |
A convenience implementation for histogramms.
|
ParagraphEstimator |
This class is able to estimate if a line break also indicates a new
paragraph.
|
PDFStructuredTextExtractor |
This class takes a PDF File as input and extracts the text of it in an
HTML-like hierarchical
object structure (see the package "structure" for the classes itself).
|
PreTextBlock |
A PreTextBlock represents a ThreadBead with some additional information.
|
PreTextLine |
This just aggregates all TextPosition objects that are part of one line.
|
StringSimilarity |
This implements an algorithm to determine the similarity between Strings by utilizing an
alignment/edit distance approach.
|
TextBlockRankEstimator |
This estimator has the purpose to determine if a TextBlock has a larger usual Font Size as the
usual Font Size for the whole page, an equal or a smaller one.
|
VerticalAlignmentEstimator |
This just determines the vertical alignment of a given glyph in relation to the line it is part
of.
|
WhiteSpaceEstimator |
This is based on the work of Ben Litchfield in the PDFTextStripper of Apache PDFBox.
|
Copyright (C) 2013, 2014 Raphael Dickfelder, Jan Göpfert, Benjamin Paassen, Andreas Stöckel, licensed under the AGPL v. 3: http://openresearch.cit-ec.de/projects/scie