sparclur.parsers package 

Parsers

sparclur.parsers package

sparclur.parsers.Arlington 

class sparclur.parsers._arlington.Arlington(doc: Union[str, bytes], arlington_path: Optional[str] = None, version: Optional[Union[float, str]] = None, skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, temp_folders_dir: Optional[str] = None, timeout: Optional[int] = None)

Bases: sparclur._tracer.Tracer

Wrapper for the Arlington DOM TestGrammar (https://github.com/pdf-association/arlington-pdf-model)

property cleaned

Return a normalized collection of the warnings and errors with occurrence counts.

Returns: A dictionary with each normalized message as the key and the occurrence count as the value
Return type: Dict[str, int]

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns: String of the document path or first 15 bytes of the binary
Return type: str or bytes

static get_name()

Return the SPARCLUR defined name for the parser.

Returns: Parser name
Return type: str

property messages

Return the error and warnings for the document passed into the Parser instance.

Returns: The list of all raw messages from the parser over the given document
Return type: List[str]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns: The number of pages in the document
Return type: int

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns: The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
Return type: SparclurHash
Type: The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_tracer: Dict[str, Any]

Performs a validity check for this tracer.

Return type: bool

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns: A dictionary of dictionaries laying out the validity and statuses for the parser tools.
Return type: Dict[str, Dict[str, Any]]

sparclur.parsers.Ghostscript 

class sparclur.parsers._ghostscript.Ghostscript(doc: str, skip_check: Optional[bool] = None, temp_folders_dir: Optional[str] = None, dpi: Optional[int] = None, size: Optional[Union[Tuple[int], int]] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False)

Bases: sparclur._renderer.Renderer, sparclur._reforge.Reforger

Abstract class for PDF renderers.

property caching

Returns the caching setting for the renderer.

If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool

clear_renders(): Clears any PIL’s that have been retained in the renderer object.

clear_text(): Clear any text that has already been extracted for the document

compare(other: sparclur._renderer.Renderer, page=None, full=False)

Performs a structural similarity comparison between two renders

Parameters

other (Renderer) – The other Parser and document to compare this Parser and document to.
page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.
full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.

Return type

Dict[int, PRCSim] or PRCSim

compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)

Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.

Parameters

other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser
page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document
shingle_size (int, default=4) – The size of the shingled n-grams

Returns

The Jaccard Similarity score

Return type

float

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns: String of the document path or first 15 bytes of the binary
Return type: str or bytes

property dpi: Return dots per inch :rtype: int

static get_name()

Return the SPARCLUR defined name for the parser.

Returns: Parser name
Return type: str

get_renders(page: Optional[Union[int, List[int]]] = None)

Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.

Parameters: page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None
Return type: PngImageFile or Dict[int, PngImageFile]

get_text(page: Optional[int] = None)

Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.

Parameters: page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
Return type: str or Dict[int, str]

get_tokens(page: Optional[int] = None)

Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.

Parameters: page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
Return type: str or Dict[int, str]

property logs: View any gathered logs. :rtype: Dict[int, Dict[str, Any]]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns: The number of pages in the document
Return type: int

property reforge

The resulting reforged document.

Return type: bytes

save_reforge(save_path: str)

Saves the reforged document to the specified file location.

Parameters: save_path (str) – The file name and location to save the document.

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns: The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
Return type: SparclurHash
Type: The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_renderer

Performs a validity check for this tracer.

Return type: Dict[str, Any]

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns: A dictionary of dictionaries laying out the validity and statuses for the parser tools.
Return type: Dict[str, Dict[str, Any]]

sparclur.parsers.MuPDF 

class sparclur.parsers._mupdf.MuPDF(doc: Union[str, bytes], skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False, parse_streams: Optional[bool] = None, binary_path: Optional[str] = None, temp_folders_dir: Optional[str] = None, dpi: Optional[int] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None, ocr: Optional[bool] = None)

Bases: sparclur._tracer.Tracer, sparclur._hybrid.Hybrid, sparclur._reforge.Reforger

MuPDF parser

property caching

Returns the caching setting for the renderer.

If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool

property cleaned

Return a normalized collection of the warnings and errors with occurrence counts.

Returns: A dictionary with each normalized message as the key and the occurrence count as the value
Return type: Dict[str, int]

clear_renders(): Clears any PIL’s that have been retained in the renderer object.

clear_text(): Clear any text that has already been extracted for the document

compare(other: sparclur._renderer.Renderer, page=None, full=False)

Performs a structural similarity comparison between two renders

Parameters

other (Renderer) – The other Parser and document to compare this Parser and document to.
page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.
full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.

Return type

Dict[int, PRCSim] or PRCSim

compare_ocr(page=None, shingle_size=4)

Method that compares the OCR result to the built-in text extraction.

Parameters

page (int) – Indicates which page the comparison should be run over. If ‘None’, all pages are compared.
shingle_size (int, default=4) – The size of the token shingles used in the Jaccard similarity comparison between the OCR and the text extraction.

Returns

The Jaccard similarity between the OCR and the text extraction (for the specified shingle size).

Return type

float

compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)

Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.

Parameters

other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser
page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document
shingle_size (int, default=4) – The size of the shingled n-grams

Returns

The Jaccard Similarity score

Return type

float

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns: String of the document path or first 15 bytes of the binary
Return type: str or bytes

property dpi: Return dots per inch :rtype: int

static get_name()

Return the SPARCLUR defined name for the parser.

Returns: Parser name
Return type: str

get_renders(page: Optional[Union[int, List[int]]] = None)

Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.

Parameters: page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None
Return type: PngImageFile or Dict[int, PngImageFile]

get_text(page: Optional[int] = None)

Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.

Parameters: page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
Return type: str or Dict[int, str]

get_tokens(page: Optional[int] = None)

Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.

Parameters: page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
Return type: str or Dict[int, str]

property logs: View any gathered logs. :rtype: Dict[int, Dict[str, Any]]

property messages

Return the error and warnings for the document passed into the Parser instance.

Returns: The list of all raw messages from the parser over the given document
Return type: List[str]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns: The number of pages in the document
Return type: int

property reforge

The resulting reforged document.

Return type: bytes

save_reforge(save_path: str)

Saves the reforged document to the specified file location.

Parameters: save_path (str) – The file name and location to save the document.

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns: The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
Return type: SparclurHash
Type: The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_renderer

Performs a validity check for this tracer.

Return type: Dict[str, Any]

property validate_text: Dict[str, Any]

Performs a validity check for this text extractor.

Return type: Dict[str, Any]

property validate_tracer: Dict[str, Any]

Performs a validity check for this tracer.

Return type: bool

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns: A dictionary of dictionaries laying out the validity and statuses for the parser tools.
Return type: Dict[str, Dict[str, Any]]

sparclur.parsers.PDFCPU 

class sparclur.parsers._pdfcpu.PDFCPU(doc: Union[str, bytes], skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, binary_path: Optional[str] = None, temp_folders_dir: Optional[str] = None, timeout: Optional[int] = None)

Bases: sparclur._tracer.Tracer

Wrapper for PDFCPU (https://pdfcpu.io/)

property cleaned

Return a normalized collection of the warnings and errors with occurrence counts.

Returns: A dictionary with each normalized message as the key and the occurrence count as the value
Return type: Dict[str, int]

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns: String of the document path or first 15 bytes of the binary
Return type: str or bytes

static get_name()

Return the SPARCLUR defined name for the parser.

Returns: Parser name
Return type: str

property messages

Return the error and warnings for the document passed into the Parser instance.

Returns: The list of all raw messages from the parser over the given document
Return type: List[str]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns: The number of pages in the document
Return type: int

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns: The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
Return type: SparclurHash
Type: The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_tracer: Dict[str, Any]

Performs a validity check for this tracer.

Return type: bool

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns: A dictionary of dictionaries laying out the validity and statuses for the parser tools.
Return type: Dict[str, Dict[str, Any]]

sparclur.parsers.PDFium 

class sparclur.parsers._pdfium.PDFium(doc: Union[str, bytes], skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False, temp_folders_dir: Optional[str] = None, dpi: Optional[int] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None)

Bases: sparclur._renderer.Renderer

PDFium renderer

property caching

Returns the caching setting for the renderer.

If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool

clear_renders(): Clears any PIL’s that have been retained in the renderer object.

clear_text(): Clear any text that has already been extracted for the document

compare(other: sparclur._renderer.Renderer, page=None, full=False)

Performs a structural similarity comparison between two renders

Parameters

other (Renderer) – The other Parser and document to compare this Parser and document to.
page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.
full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.

Return type

Dict[int, PRCSim] or PRCSim

compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)

Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.

Parameters

other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser
page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document
shingle_size (int, default=4) – The size of the shingled n-grams

Returns

The Jaccard Similarity score

Return type

float

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns: String of the document path or first 15 bytes of the binary
Return type: str or bytes

property dpi: Return dots per inch :rtype: int

static get_name()

Return the SPARCLUR defined name for the parser.

Returns: Parser name
Return type: str

get_renders(page: Optional[Union[int, List[int]]] = None)

Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.

Parameters: page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None
Return type: PngImageFile or Dict[int, PngImageFile]

get_text(page: Optional[int] = None)

Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.

Parameters: page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
Return type: str or Dict[int, str]

get_tokens(page: Optional[int] = None)

Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.

Parameters: page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
Return type: str or Dict[int, str]

property logs: View any gathered logs. :rtype: Dict[int, Dict[str, Any]]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns: The number of pages in the document
Return type: int

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns: The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
Return type: SparclurHash
Type: The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_renderer

Performs a validity check for this tracer.

Return type: Dict[str, Any]

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns: A dictionary of dictionaries laying out the validity and statuses for the parser tools.
Return type: Dict[str, Dict[str, Any]]

sparclur.parsers.PDFMiner 

class sparclur.parsers._pdfminer.PDFMiner(doc: str, temp_folders_dir: Optional[str] = None, skip_check: Optional[bool] = None, hash_exclude: Optional[str] = None, timeout: Optional[int] = None, page_delimiter: Optional[str] = None, detect_vertical: Optional[bool] = None, all_texts: Optional[bool] = None, stream_output: Optional[str] = None, suppress_warnings: Optional[bool] = None)

Bases: sparclur._text_extractor.TextExtractor, sparclur._metadata_extractor.MetadataExtractor

PDFMiner Text Extraction https://pdfminersix.readthedocs.io/en/latest/

clear_text(): Clear any text that has already been extracted for the document

compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)

Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.

Parameters

other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser
page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document
shingle_size (int, default=4) – The size of the shingled n-grams

Returns

The Jaccard Similarity score

Return type

float

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns: String of the document path or first 15 bytes of the binary
Return type: str or bytes

static get_name()

Return the SPARCLUR defined name for the parser.

Returns: Parser name
Return type: str

get_text(page: Optional[int] = None)

Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.

Parameters: page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
Return type: str or Dict[int, str]

get_tokens(page: Optional[int] = None)

Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.

Parameters: page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
Return type: str or Dict[int, str]

property metadata: Dict[str, Any]

Return the dictionary of metadata.

Return type: Dict[str, Any]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns: The number of pages in the document
Return type: int

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns: The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
Return type: SparclurHash
Type: The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_metadata: Dict[str, Any]

Performs a validity check for this metadata extractor.

Return type: Dict[str, Any]

property validate_text: Dict[str, Any]

Performs a validity check for this text extractor.

Return type: Dict[str, Any]

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns: A dictionary of dictionaries laying out the validity and statuses for the parser tools.
Return type: Dict[str, Dict[str, Any]]

sparclur.parsers.Poppler 

class sparclur.parsers._poppler.Poppler(doc: str, skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False, trace: Optional[str] = None, binary_path: Optional[str] = None, temp_folders_dir: Optional[str] = None, page_delimiter: Optional[str] = None, maintain_layout: Optional[bool] = None, dpi: Optional[int] = None, size: Optional[Tuple[int]] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None, ocr: Optional[bool] = None)

Bases: sparclur._tracer.Tracer, sparclur._hybrid.Hybrid, sparclur._font_extractor.FontExtractor, sparclur._image_data_extractor.ImageDataExtractor, sparclur._reforge.Reforger

Poppler wrapper for pdftoppm, pdftocairo, and pdftotext

property caching

Returns the caching setting for the renderer.

If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool

property cleaned

Return a normalized collection of the warnings and errors with occurrence counts.

Returns: A dictionary with each normalized message as the key and the occurrence count as the value
Return type: Dict[str, int]

clear_renders(): Clears any PIL’s that have been retained in the renderer object.

clear_text(): Clear any text that has already been extracted for the document

compare(other: sparclur._renderer.Renderer, page=None, full=False)

Performs a structural similarity comparison between two renders

Parameters

other (Renderer) – The other Parser and document to compare this Parser and document to.
page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.
full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.

Return type

Dict[int, PRCSim] or PRCSim

compare_ocr(page=None, shingle_size=4)

Method that compares the OCR result to the built-in text extraction.

Parameters

page (int) – Indicates which page the comparison should be run over. If ‘None’, all pages are compared.
shingle_size (int, default=4) – The size of the token shingles used in the Jaccard similarity comparison between the OCR and the text extraction.

Returns

The Jaccard similarity between the OCR and the text extraction (for the specified shingle size).

Return type

float

compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)

Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.

Parameters

other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser
page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document
shingle_size (int, default=4) – The size of the shingled n-grams

Returns

The Jaccard Similarity score

Return type

float

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns: String of the document path or first 15 bytes of the binary
Return type: str or bytes

property dpi: Return dots per inch :rtype: int

property fonts

Extracts the detected fonts from the PDF file.

Returns: Dict[str, Any]

static get_name()

Return the SPARCLUR defined name for the parser.

Returns: Parser name
Return type: str

get_renders(page: Optional[Union[int, List[int]]] = None)

Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.

Parameters: page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None
Return type: PngImageFile or Dict[int, PngImageFile]

get_text(page: Optional[int] = None)

Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.

Parameters: page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
Return type: str or Dict[int, str]

get_tokens(page: Optional[int] = None)

Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.

Parameters: page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
Return type: str or Dict[int, str]

property logs: View any gathered logs. :rtype: Dict[int, Dict[str, Any]]

property messages

Return the error and warnings for the document passed into the Parser instance.

Returns: The list of all raw messages from the parser over the given document
Return type: List[str]

property non_embedded_fonts

Determine whether or not there are non-embedded fonts in the PDF. Returns True if there are missing fonts.

Return type: bool

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns: The number of pages in the document
Return type: int

property reforge

The resulting reforged document.

Return type: bytes

save_reforge(save_path: str)

Saves the reforged document to the specified file location.

Parameters: save_path (str) – The file name and location to save the document.

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns: The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
Return type: SparclurHash
Type: The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_fonts

Checks whether or not fonts can be successfully extracted from a document. Any issues or errors will result in a ‘Rejected’ classification.

Returns: A dictionary containing a boolean for validity, a classification label for validity, and relevant info for the classification
Return type: Dict[str, str]

property validate_image_data

Checks whether or not image data can be successfully extracted from a document. Any issues or errors will result: in a ‘Rejected’ classification.

Returns: A dictionary containing a boolean for validity, a classification label for validity, and relevant info for the classification
Return type: Dict[str, str]

property validate_renderer: Dict[str, Any]

Performs a validity check for this tracer.

Return type: Dict[str, Any]

property validate_text: Dict[str, Any]

Performs a validity check for this text extractor.

Return type: Dict[str, Any]

property validate_tracer: Dict[str, Any]

Performs a validity check for this tracer.

Return type: bool

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns: A dictionary of dictionaries laying out the validity and statuses for the parser tools.
Return type: Dict[str, Dict[str, Any]]

sparclur.parsers.QPDF 

class sparclur.parsers._qpdf.QPDF(doc: str, temp_folders_dir: Optional[str] = None, skip_check: Optional[bool] = None, hash_exclude: Optional[str] = None, binary_path: Optional[str] = None, timeout: Optional[int] = None)

Bases: sparclur._tracer.Tracer, sparclur._metadata_extractor.MetadataExtractor

QPDF tracer

property cleaned

Return a normalized collection of the warnings and errors with occurrence counts.

Returns: A dictionary with each normalized message as the key and the occurrence count as the value
Return type: Dict[str, int]

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns: String of the document path or first 15 bytes of the binary
Return type: str or bytes

static get_name()

Return the SPARCLUR defined name for the parser.

Returns: Parser name
Return type: str

property messages

Return the error and warnings for the document passed into the Parser instance.

Returns: The list of all raw messages from the parser over the given document
Return type: List[str]

property metadata: Dict[str, Any]

Return the dictionary of metadata.

Return type: Dict[str, Any]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns: The number of pages in the document
Return type: int

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns: The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
Return type: SparclurHash
Type: The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_metadata: Dict[str, Any]

Performs a validity check for this metadata extractor.

Return type: Dict[str, Any]

property validate_tracer: Dict[str, Any]

Performs a validity check for this tracer.

Return type: bool

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns: A dictionary of dictionaries laying out the validity and statuses for the parser tools.
Return type: Dict[str, Dict[str, Any]]

sparclur.parsers.XPDF 

class sparclur.parsers._xpdf.XPDF(doc: Union[str, bytes], skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False, binary_path: Optional[str] = None, temp_folders_dir: Optional[str] = None, page_delimiter: Optional[str] = None, maintain_layout: Optional[bool] = None, dpi: Optional[int] = None, size: Optional[Union[Tuple[int], int]] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None, ocr: Optional[bool] = None)

Bases: sparclur._tracer.Tracer, sparclur._hybrid.Hybrid, sparclur._font_extractor.FontExtractor

XPDF wrapper for pdftoppm, and pdftotext

property caching

Returns the caching setting for the renderer.

If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool

property cleaned

Return a normalized collection of the warnings and errors with occurrence counts.

Returns: A dictionary with each normalized message as the key and the occurrence count as the value
Return type: Dict[str, int]

clear_renders(): Clears any PIL’s that have been retained in the renderer object.

clear_text(): Clear any text that has already been extracted for the document

compare(other: sparclur._renderer.Renderer, page=None, full=False)

Performs a structural similarity comparison between two renders

Parameters

other (Renderer) – The other Parser and document to compare this Parser and document to.
page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.
full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.

Return type

Dict[int, PRCSim] or PRCSim

compare_ocr(page=None, shingle_size=4)

Method that compares the OCR result to the built-in text extraction.

Parameters

page (int) – Indicates which page the comparison should be run over. If ‘None’, all pages are compared.
shingle_size (int, default=4) – The size of the token shingles used in the Jaccard similarity comparison between the OCR and the text extraction.

Returns

The Jaccard similarity between the OCR and the text extraction (for the specified shingle size).

Return type

float

compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)

Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.

Parameters

other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser
page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document
shingle_size (int, default=4) – The size of the shingled n-grams

Returns

The Jaccard Similarity score

Return type

float

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns: String of the document path or first 15 bytes of the binary
Return type: str or bytes

property dpi: Return dots per inch :rtype: int

property fonts

Extracts the detected fonts from the PDF file.

Returns: Dict[str, Any]

static get_name()

Return the SPARCLUR defined name for the parser.

Returns: Parser name
Return type: str

get_renders(page: Optional[Union[int, List[int]]] = None)

Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.

Parameters: page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None
Return type: PngImageFile or Dict[int, PngImageFile]

get_text(page: Optional[int] = None)

Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.

Parameters: page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
Return type: str or Dict[int, str]

get_tokens(page: Optional[int] = None)

Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.

Parameters: page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
Return type: str or Dict[int, str]

property logs: View any gathered logs. :rtype: Dict[int, Dict[str, Any]]

property messages

Return the error and warnings for the document passed into the Parser instance.

Returns: The list of all raw messages from the parser over the given document
Return type: List[str]

property non_embedded_fonts

Determine whether or not there are non-embedded fonts in the PDF. Returns True if there are missing fonts.

Return type: bool

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns: The number of pages in the document
Return type: int

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns: The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
Return type: SparclurHash
Type: The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_fonts

Checks whether or not fonts can be successfully extracted from a document. Any issues or errors will result in a ‘Rejected’ classification.

Returns: A dictionary containing a boolean for validity, a classification label for validity, and relevant info for the classification
Return type: Dict[str, str]

property validate_renderer: Dict[str, Any]

Performs a validity check for this tracer.

Return type: Dict[str, Any]

property validate_text: Dict[str, Any]

Performs a validity check for this text extractor.

Return type: Dict[str, Any]

property validate_tracer: Dict[str, Any]

Performs a validity check for this tracer.

Return type: bool

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns: A dictionary of dictionaries laying out the validity and statuses for the parser tools.
Return type: Dict[str, Dict[str, Any]]