A B C D E F G H I L M N O P R S T U V W Y 

A

AbstractLineSegment - Interface in de.unibi.techfak.scie.pdfextractor.structure
The AbstractLineSegment interface represents a simple line in one dimensional space, given by a start position and an end position.
addAll(Histogramm<H>) - Method in class de.unibi.techfak.scie.pdfextractor.Histogramm
 
addBlock(TextBlock, PreTextBlock) - Method in class de.unibi.techfak.scie.pdfextractor.TextBlockRankEstimator
 
addDataPoint(H) - Method in class de.unibi.techfak.scie.pdfextractor.Histogramm
 
addElement(TextPosition) - Method in class de.unibi.techfak.scie.pdfextractor.PreTextLine
 
addLine(PreTextLine) - Method in class de.unibi.techfak.scie.pdfextractor.PreTextBlock
 
addTextPosition(TextPosition) - Method in class de.unibi.techfak.scie.pdfextractor.PreTextBlock
 
App - Class in de.unibi.techfak.scie.pdfextractor
This is the commandline/executable interface of the extractor.
App() - Constructor for class de.unibi.techfak.scie.pdfextractor.App
 

B

begin - Variable in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Start position of the line.
blockCleanup(Document) - Method in class de.unibi.techfak.scie.pdfextractor.DocumentBlockCleaner
The cleanup is done using a greedy heuristic as follows: Start with short text blocks on the first page and than iterate over all other pages and try to build a sequence of most similar TextBlocks to it.
boundariesEqual(AbstractLineSegment, AbstractLineSegment) - Static method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns true if the boundaries of the two line segments are equal.
boundariesEqual(AbstractLineSegment) - Method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns true if the boundaries of the two line segments are equal.

C

calculate(String, String) - Method in class de.unibi.techfak.scie.pdfextractor.StringSimilarity
This implements an algorithm to determine the similarity between Strings by utilizing an alignment/edit distance approach.
calculateAlignment(TextPosition) - Method in class de.unibi.techfak.scie.pdfextractor.VerticalAlignmentEstimator
 
content - Variable in class de.unibi.techfak.scie.pdfextractor.PreTextLine
 
content - Variable in class de.unibi.techfak.scie.pdfextractor.structure.Document
 
content - Variable in class de.unibi.techfak.scie.pdfextractor.structure.Page
 
content - Variable in class de.unibi.techfak.scie.pdfextractor.structure.Paragraph
This is the Text content of this Paragraph.
content - Variable in class de.unibi.techfak.scie.pdfextractor.structure.TextBlock
This is the actual content of the TextBlock.

D

de.unibi.techfak.scie.pdfextractor - package de.unibi.techfak.scie.pdfextractor
 
de.unibi.techfak.scie.pdfextractor.structure - package de.unibi.techfak.scie.pdfextractor.structure
 
Document - Class in de.unibi.techfak.scie.pdfextractor.structure
This represents a parsed document which is defined as a sequence of pages.
Document() - Constructor for class de.unibi.techfak.scie.pdfextractor.structure.Document
 
DocumentBlockCleaner - Class in de.unibi.techfak.scie.pdfextractor
 
DocumentBlockCleaner() - Constructor for class de.unibi.techfak.scie.pdfextractor.DocumentBlockCleaner
 
doImport() - Method in class de.unibi.techfak.scie.pdfextractor.PDFStructuredTextExtractor
 

E

end - Variable in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
End position of the line.
equals(Object) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Document
 
equals(Object) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Page
 
equals(Object) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Paragraph
 
equals(Object) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
 
equals(Object) - Method in class de.unibi.techfak.scie.pdfextractor.structure.TextBlock
 

F

fontHisto - Variable in class de.unibi.techfak.scie.pdfextractor.PreTextLine
 
fontSizeHisto - Variable in class de.unibi.techfak.scie.pdfextractor.PreTextLine
 

G

getAverage() - Method in class de.unibi.techfak.scie.pdfextractor.Histogramm
This only works if the given class type is a number.
getBackingMap() - Method in class de.unibi.techfak.scie.pdfextractor.Histogramm
 
getBegin() - Method in interface de.unibi.techfak.scie.pdfextractor.structure.AbstractLineSegment
Returns the start index of the word in the text.
getBegin() - Method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns the begin position of the line.
getEnd() - Method in interface de.unibi.techfak.scie.pdfextractor.structure.AbstractLineSegment
Returns the end index of the word in the text.
getEnd() - Method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns the end position of the line.
getFontName() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
Get the value of fontName
getFontSize() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
Get the value of fontSize
getMaxElement() - Method in class de.unibi.techfak.scie.pdfextractor.Histogramm
Returns the element that was counted the most.
getNumber(H) - Method in class de.unibi.techfak.scie.pdfextractor.Histogramm
 
getPageNumber() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Page
Get the value of pageNumber
getRelativeFontSize() - Method in class de.unibi.techfak.scie.pdfextractor.structure.TextBlock
 
getRelativeFontSize(TextBlock) - Method in class de.unibi.techfak.scie.pdfextractor.TextBlockRankEstimator
Returns the relativ font size of this block in relation to the whole page.
getSize() - Method in class de.unibi.techfak.scie.pdfextractor.PreTextBlock
 
getText() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
Get the value of text
getVerticalAlignment() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
Get the value of verticalAlignment
getX_end() - Method in class de.unibi.techfak.scie.pdfextractor.PreTextLine
 
getX_start() - Method in class de.unibi.techfak.scie.pdfextractor.PreTextLine
 

H

hashCode() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Document
 
hashCode() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Page
 
hashCode() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Paragraph
 
hashCode() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
 
hashCode() - Method in class de.unibi.techfak.scie.pdfextractor.structure.TextBlock
 
hasWhiteSpace(TextPosition) - Method in class de.unibi.techfak.scie.pdfextractor.WhiteSpaceEstimator
 
Histogramm<H> - Class in de.unibi.techfak.scie.pdfextractor
A convenience implementation for histogramms.
Histogramm() - Constructor for class de.unibi.techfak.scie.pdfextractor.Histogramm
 

I

indexedToString(int) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Document
Does the same as toString but also inserts the beginning and end index of each objects respective text representation.
indexedToString(int) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Page
Does the same as toString but also inserts the beginning and end index of each objects respective text representation.
indexedToString(int) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Paragraph
Does the same as toString but also inserts the beginning and end index of each objects respective text representation.
indexedToString(int) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
Does the same as toString but also inserts the beginning and end index of each objects respective text representation.
indexedToString(int) - Method in class de.unibi.techfak.scie.pdfextractor.structure.TextBlock
Does the same as toString but also inserts the beginning and end index of each objects respective text representation.
intersection(AbstractLineSegment, AbstractLineSegment) - Static method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns a new line which is the union of the two lines.
isNewParagraph(PreTextLine) - Method in class de.unibi.techfak.scie.pdfextractor.ParagraphEstimator
 
isPartOfLine(TextPosition) - Method in class de.unibi.techfak.scie.pdfextractor.PreTextLine
 
isValid(AbstractLineSegment) - Static method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns true if the given line is valid (its begin is smaller or equal to its end).
isValid() - Method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns true if this line is valid (its begin is smaller or equal to its end).

L

length() - Method in class de.unibi.techfak.scie.pdfextractor.PreTextLine
 
length(AbstractLineSegment) - Static method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns the length of the line segment.
length() - Method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns the length of the line segment.
lengthHisto - Variable in class de.unibi.techfak.scie.pdfextractor.PreTextBlock
 
lines - Variable in class de.unibi.techfak.scie.pdfextractor.PreTextBlock
 
LineSegment - Class in de.unibi.techfak.scie.pdfextractor.structure
The LineSegment class implements the AbstractLineSegmentSegment interface and adds (static) utility functions that help to compare to lines.
LineSegment() - Constructor for class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Initializes the line segment as invalid, with begin being set to INF and end being set to -INF.
LineSegment(int, int) - Constructor for class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Initializes the line segment with the given begin and end.

M

main(String[]) - Static method in class de.unibi.techfak.scie.pdfextractor.App
 
MINIMUMBLOCKSIZE - Static variable in class de.unibi.techfak.scie.pdfextractor.PreTextBlock
 
MINIMUMPARSIZE - Static variable in class de.unibi.techfak.scie.pdfextractor.PDFStructuredTextExtractor
 

N

normalize(AbstractLineSegment) - Static method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Normalizes the line segment by swapping begin and end if they are in the wrong order.
normalize() - Method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Swaps begin and end if the line is not valid.
normalizedBounds(AbstractLineSegment) - Static method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns the boundaries of the normalized line as a two-element array -- normalization means that begin and end are swapped if end is larger than begin.

O

overlaps(AbstractLineSegment, AbstractLineSegment) - Static method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns true if the boundaries of the two line segments overlap.
overlaps(AbstractLineSegment) - Method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns true if the boundaries of the two line segments overlap.

P

Page - Class in de.unibi.techfak.scie.pdfextractor.structure
This represents one Page of a document, consisting of a (syntactically meaningful) sequence of TextBlock instances (e.g. columns in a two-column formatted Text).
Page() - Constructor for class de.unibi.techfak.scie.pdfextractor.structure.Page
 
Paragraph - Class in de.unibi.techfak.scie.pdfextractor.structure
This represents a paragraph of text that is defined as a sequence of Text objects that syntactically were grouped in a paragraph.
Paragraph() - Constructor for class de.unibi.techfak.scie.pdfextractor.structure.Paragraph
 
ParagraphEstimator - Class in de.unibi.techfak.scie.pdfextractor
This class is able to estimate if a line break also indicates a new paragraph.
ParagraphEstimator(PreTextBlock) - Constructor for class de.unibi.techfak.scie.pdfextractor.ParagraphEstimator
 
PDFStructuredTextExtractor - Class in de.unibi.techfak.scie.pdfextractor
This class takes a PDF File as input and extracts the text of it in an HTML-like hierarchical object structure (see the package "structure" for the classes itself).
PDFStructuredTextExtractor(InputStream) - Constructor for class de.unibi.techfak.scie.pdfextractor.PDFStructuredTextExtractor
 
PreTextBlock - Class in de.unibi.techfak.scie.pdfextractor
A PreTextBlock represents a ThreadBead with some additional information.
PreTextBlock() - Constructor for class de.unibi.techfak.scie.pdfextractor.PreTextBlock
 
PreTextLine - Class in de.unibi.techfak.scie.pdfextractor
This just aggregates all TextPosition objects that are part of one line.
PreTextLine() - Constructor for class de.unibi.techfak.scie.pdfextractor.PreTextLine
 

R

REMOVETHRESHOLD - Static variable in class de.unibi.techfak.scie.pdfextractor.DocumentBlockCleaner
 

S

setBegin(int) - Method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Sets the start position of the line.
setEnd(int) - Method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Sets the end position of the line.
setFontName(String) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
Set the value of fontName
setFontSize(float) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
Set the value of fontSize
setPageNumber(int) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Page
Set the value of pageNumber
setRelativeFontSize(double) - Method in class de.unibi.techfak.scie.pdfextractor.structure.TextBlock
 
setText(String) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
Set the value of text
setVerticalAlignment(Text.VerticalAlignment) - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
Set the value of verticalAlignment
setX_End() - Method in class de.unibi.techfak.scie.pdfextractor.PreTextLine
 
SMALLBLOCKSIZE - Static variable in class de.unibi.techfak.scie.pdfextractor.DocumentBlockCleaner
 
split() - Method in class de.unibi.techfak.scie.pdfextractor.PreTextBlock
This is supposed to split a TextBlock representing a whole page into different blocks that might represent columns in a two-column text Headings Foot notes Tables and figures The document abstract etc.
StringSimilarity - Class in de.unibi.techfak.scie.pdfextractor
This implements an algorithm to determine the similarity between Strings by utilizing an alignment/edit distance approach.
StringSimilarity() - Constructor for class de.unibi.techfak.scie.pdfextractor.StringSimilarity
 

T

Text - Class in de.unibi.techfak.scie.pdfextractor.structure
This is a wrapper class for text itself with additional information about the style of the text.
Text() - Constructor for class de.unibi.techfak.scie.pdfextractor.structure.Text
 
Text.VerticalAlignment - Enum in de.unibi.techfak.scie.pdfextractor.structure
 
TextBlock - Class in de.unibi.techfak.scie.pdfextractor.structure
This represents a syntatic block of Text, which can be a column on a page, a header or something similar.
TextBlock() - Constructor for class de.unibi.techfak.scie.pdfextractor.structure.TextBlock
 
TextBlockRankEstimator - Class in de.unibi.techfak.scie.pdfextractor
This estimator has the purpose to determine if a TextBlock has a larger usual Font Size as the usual Font Size for the whole page, an equal or a smaller one.
TextBlockRankEstimator() - Constructor for class de.unibi.techfak.scie.pdfextractor.TextBlockRankEstimator
 
toString() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Document
 
toString() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Page
 
toString() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Paragraph
 
toString() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
 
toString() - Method in class de.unibi.techfak.scie.pdfextractor.structure.TextBlock
 
toXML() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Document
 
toXML() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Page
 
toXML() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Paragraph
 
toXML() - Method in class de.unibi.techfak.scie.pdfextractor.structure.Text
 
toXML() - Method in class de.unibi.techfak.scie.pdfextractor.structure.TextBlock
 

U

union(AbstractLineSegment, AbstractLineSegment) - Static method in class de.unibi.techfak.scie.pdfextractor.structure.LineSegment
Returns a new line which is the union of the two lines.

V

valueOf(String) - Static method in enum de.unibi.techfak.scie.pdfextractor.structure.Text.VerticalAlignment
Returns the enum constant of this type with the specified name.
values() - Static method in enum de.unibi.techfak.scie.pdfextractor.structure.Text.VerticalAlignment
Returns an array containing the constants of this enum type, in the order they are declared.
VerticalAlignmentEstimator - Class in de.unibi.techfak.scie.pdfextractor
This just determines the vertical alignment of a given glyph in relation to the line it is part of.
VerticalAlignmentEstimator(PreTextLine) - Constructor for class de.unibi.techfak.scie.pdfextractor.VerticalAlignmentEstimator
 

W

WhiteSpaceEstimator - Class in de.unibi.techfak.scie.pdfextractor
This is based on the work of Ben Litchfield in the PDFTextStripper of Apache PDFBox.
WhiteSpaceEstimator() - Constructor for class de.unibi.techfak.scie.pdfextractor.WhiteSpaceEstimator
 

Y

yDistHisto - Variable in class de.unibi.techfak.scie.pdfextractor.PreTextBlock
 
yHisto - Variable in class de.unibi.techfak.scie.pdfextractor.PreTextLine
 
A B C D E F G H I L M N O P R S T U V W Y 

Copyright (C) 2013, 2014 Raphael Dickfelder, Jan Göpfert, Benjamin Paassen, Andreas Stöckel, licensed under the AGPL v. 3: http://openresearch.cit-ec.de/projects/scie