sparclur.parsers package

sparclur.parsers.Arlington

class sparclur.parsers._arlington.Arlington(doc: Union[str, bytes], arlington_path: Optional[str] = None, version: Optional[Union[float, str]] = None, skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, temp_folders_dir: Optional[str] = None, timeout: Optional[int] = None)

Bases: sparclur._tracer.Tracer

Wrapper for the Arlington DOM TestGrammar (https://github.com/pdf-association/arlington-pdf-model)

property cleaned

Return a normalized collection of the warnings and errors with occurrence counts.

Returns

A dictionary with each normalized message as the key and the occurrence count as the value

Return type

Dict[str, int]

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns

String of the document path or first 15 bytes of the binary

Return type

str or bytes

static get_name()

Return the SPARCLUR defined name for the parser.

Returns

Parser name

Return type

str

property messages

Return the error and warnings for the document passed into the Parser instance.

Returns

The list of all raw messages from the parser over the given document

Return type

List[str]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns

The number of pages in the document

Return type

int

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns

The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.

Return type

SparclurHash

Type

The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_tracer: Dict[str, Any]

Performs a validity check for this tracer.

Return type

bool

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns

A dictionary of dictionaries laying out the validity and statuses for the parser tools.

Return type

Dict[str, Dict[str, Any]]

sparclur.parsers.Ghostscript

class sparclur.parsers._ghostscript.Ghostscript(doc: str, skip_check: Optional[bool] = None, temp_folders_dir: Optional[str] = None, dpi: Optional[int] = None, size: Optional[Union[Tuple[int], int]] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False)

Bases: sparclur._renderer.Renderer, sparclur._reforge.Reforger

Abstract class for PDF renderers.

property caching

Returns the caching setting for the renderer.

If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool

clear_renders()

Clears any PIL’s that have been retained in the renderer object.

clear_text()

Clear any text that has already been extracted for the document

compare(other: sparclur._renderer.Renderer, page=None, full=False)

Performs a structural similarity comparison between two renders

Parameters
  • other (Renderer) – The other Parser and document to compare this Parser and document to.

  • page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.

  • full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.

Return type

Dict[int, PRCSim] or PRCSim

compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)

Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.

Parameters
  • other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser

  • page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document

  • shingle_size (int, default=4) – The size of the shingled n-grams

Returns

The Jaccard Similarity score

Return type

float

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns

String of the document path or first 15 bytes of the binary

Return type

str or bytes

property dpi

Return dots per inch :rtype: int

static get_name()

Return the SPARCLUR defined name for the parser.

Returns

Parser name

Return type

str

get_renders(page: Optional[Union[int, List[int]]] = None)

Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.

Parameters

page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None

Return type

PngImageFile or Dict[int, PngImageFile]

get_text(page: Optional[int] = None)

Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.

Parameters

page (int or None) – zero-indexed page to extract text from. Returns the whole document if None

Return type

str or Dict[int, str]

get_tokens(page: Optional[int] = None)

Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.

Parameters

page (int or None) – zero-indexed page to extract text from. Returns the whole document if None

Return type

str or Dict[int, str]

property logs

View any gathered logs. :rtype: Dict[int, Dict[str, Any]]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns

The number of pages in the document

Return type

int

property reforge

The resulting reforged document.

Return type

bytes

save_reforge(save_path: str)

Saves the reforged document to the specified file location.

Parameters

save_path (str) – The file name and location to save the document.

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns

The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.

Return type

SparclurHash

Type

The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_renderer

Performs a validity check for this tracer.

Return type

Dict[str, Any]

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns

A dictionary of dictionaries laying out the validity and statuses for the parser tools.

Return type

Dict[str, Dict[str, Any]]

sparclur.parsers.MuPDF

class sparclur.parsers._mupdf.MuPDF(doc: Union[str, bytes], skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False, parse_streams: Optional[bool] = None, binary_path: Optional[str] = None, temp_folders_dir: Optional[str] = None, dpi: Optional[int] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None, ocr: Optional[bool] = None)

Bases: sparclur._tracer.Tracer, sparclur._hybrid.Hybrid, sparclur._reforge.Reforger

MuPDF parser

property caching

Returns the caching setting for the renderer.

If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool

property cleaned

Return a normalized collection of the warnings and errors with occurrence counts.

Returns

A dictionary with each normalized message as the key and the occurrence count as the value

Return type

Dict[str, int]

clear_renders()

Clears any PIL’s that have been retained in the renderer object.

clear_text()

Clear any text that has already been extracted for the document

compare(other: sparclur._renderer.Renderer, page=None, full=False)

Performs a structural similarity comparison between two renders

Parameters
  • other (Renderer) – The other Parser and document to compare this Parser and document to.

  • page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.

  • full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.

Return type

Dict[int, PRCSim] or PRCSim

compare_ocr(page=None, shingle_size=4)

Method that compares the OCR result to the built-in text extraction.

Parameters
  • page (int) – Indicates which page the comparison should be run over. If ‘None’, all pages are compared.

  • shingle_size (int, default=4) – The size of the token shingles used in the Jaccard similarity comparison between the OCR and the text extraction.

Returns

The Jaccard similarity between the OCR and the text extraction (for the specified shingle size).

Return type

float

compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)

Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.

Parameters
  • other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser

  • page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document

  • shingle_size (int, default=4) – The size of the shingled n-grams

Returns

The Jaccard Similarity score

Return type

float

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns

String of the document path or first 15 bytes of the binary

Return type

str or bytes

property dpi

Return dots per inch :rtype: int

static get_name()

Return the SPARCLUR defined name for the parser.

Returns

Parser name

Return type

str

get_renders(page: Optional[Union[int, List[int]]] = None)

Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.

Parameters

page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None

Return type

PngImageFile or Dict[int, PngImageFile]

get_text(page: Optional[int] = None)

Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.

Parameters

page (int or None) – zero-indexed page to extract text from. Returns the whole document if None

Return type

str or Dict[int, str]

get_tokens(page: Optional[int] = None)

Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.

Parameters

page (int or None) – zero-indexed page to extract text from. Returns the whole document if None

Return type

str or Dict[int, str]

property logs

View any gathered logs. :rtype: Dict[int, Dict[str, Any]]

property messages

Return the error and warnings for the document passed into the Parser instance.

Returns

The list of all raw messages from the parser over the given document

Return type

List[str]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns

The number of pages in the document

Return type

int

property reforge

The resulting reforged document.

Return type

bytes

save_reforge(save_path: str)

Saves the reforged document to the specified file location.

Parameters

save_path (str) – The file name and location to save the document.

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns

The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.

Return type

SparclurHash

Type

The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_renderer

Performs a validity check for this tracer.

Return type

Dict[str, Any]

property validate_text: Dict[str, Any]

Performs a validity check for this text extractor.

Return type

Dict[str, Any]

property validate_tracer: Dict[str, Any]

Performs a validity check for this tracer.

Return type

bool

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns

A dictionary of dictionaries laying out the validity and statuses for the parser tools.

Return type

Dict[str, Dict[str, Any]]

sparclur.parsers.PDFCPU

class sparclur.parsers._pdfcpu.PDFCPU(doc: Union[str, bytes], skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, binary_path: Optional[str] = None, temp_folders_dir: Optional[str] = None, timeout: Optional[int] = None)

Bases: sparclur._tracer.Tracer

Wrapper for PDFCPU (https://pdfcpu.io/)

property cleaned

Return a normalized collection of the warnings and errors with occurrence counts.

Returns

A dictionary with each normalized message as the key and the occurrence count as the value

Return type

Dict[str, int]

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns

String of the document path or first 15 bytes of the binary

Return type

str or bytes

static get_name()

Return the SPARCLUR defined name for the parser.

Returns

Parser name

Return type

str

property messages

Return the error and warnings for the document passed into the Parser instance.

Returns

The list of all raw messages from the parser over the given document

Return type

List[str]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns

The number of pages in the document

Return type

int

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns

The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.

Return type

SparclurHash

Type

The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_tracer: Dict[str, Any]

Performs a validity check for this tracer.

Return type

bool

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns

A dictionary of dictionaries laying out the validity and statuses for the parser tools.

Return type

Dict[str, Dict[str, Any]]

sparclur.parsers.PDFium

class sparclur.parsers._pdfium.PDFium(doc: Union[str, bytes], skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False, temp_folders_dir: Optional[str] = None, dpi: Optional[int] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None)

Bases: sparclur._renderer.Renderer

PDFium renderer

property caching

Returns the caching setting for the renderer.

If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool

clear_renders()

Clears any PIL’s that have been retained in the renderer object.

clear_text()

Clear any text that has already been extracted for the document

compare(other: sparclur._renderer.Renderer, page=None, full=False)

Performs a structural similarity comparison between two renders

Parameters
  • other (Renderer) – The other Parser and document to compare this Parser and document to.

  • page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.

  • full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.

Return type

Dict[int, PRCSim] or PRCSim

compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)

Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.

Parameters
  • other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser

  • page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document

  • shingle_size (int, default=4) – The size of the shingled n-grams

Returns

The Jaccard Similarity score

Return type

float

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns

String of the document path or first 15 bytes of the binary

Return type

str or bytes

property dpi

Return dots per inch :rtype: int

static get_name()

Return the SPARCLUR defined name for the parser.

Returns

Parser name

Return type

str

get_renders(page: Optional[Union[int, List[int]]] = None)

Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.

Parameters

page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None

Return type

PngImageFile or Dict[int, PngImageFile]

get_text(page: Optional[int] = None)

Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.

Parameters

page (int or None) – zero-indexed page to extract text from. Returns the whole document if None

Return type

str or Dict[int, str]

get_tokens(page: Optional[int] = None)

Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.

Parameters

page (int or None) – zero-indexed page to extract text from. Returns the whole document if None

Return type

str or Dict[int, str]

property logs

View any gathered logs. :rtype: Dict[int, Dict[str, Any]]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns

The number of pages in the document

Return type

int

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns

The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.

Return type

SparclurHash

Type

The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_renderer

Performs a validity check for this tracer.

Return type

Dict[str, Any]

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns

A dictionary of dictionaries laying out the validity and statuses for the parser tools.

Return type

Dict[str, Dict[str, Any]]

sparclur.parsers.PDFMiner

class sparclur.parsers._pdfminer.PDFMiner(doc: str, temp_folders_dir: Optional[str] = None, skip_check: Optional[bool] = None, hash_exclude: Optional[str] = None, timeout: Optional[int] = None, page_delimiter: Optional[str] = None, detect_vertical: Optional[bool] = None, all_texts: Optional[bool] = None, stream_output: Optional[str] = None, suppress_warnings: Optional[bool] = None)

Bases: sparclur._text_extractor.TextExtractor, sparclur._metadata_extractor.MetadataExtractor

PDFMiner Text Extraction https://pdfminersix.readthedocs.io/en/latest/

clear_text()

Clear any text that has already been extracted for the document

compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)

Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.

Parameters
  • other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser

  • page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document

  • shingle_size (int, default=4) – The size of the shingled n-grams

Returns

The Jaccard Similarity score

Return type

float

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns

String of the document path or first 15 bytes of the binary

Return type

str or bytes

static get_name()

Return the SPARCLUR defined name for the parser.

Returns

Parser name

Return type

str

get_text(page: Optional[int] = None)

Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.

Parameters

page (int or None) – zero-indexed page to extract text from. Returns the whole document if None

Return type

str or Dict[int, str]

get_tokens(page: Optional[int] = None)

Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.

Parameters

page (int or None) – zero-indexed page to extract text from. Returns the whole document if None

Return type

str or Dict[int, str]

property metadata: Dict[str, Any]

Return the dictionary of metadata.

Return type

Dict[str, Any]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns

The number of pages in the document

Return type

int

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns

The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.

Return type

SparclurHash

Type

The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_metadata: Dict[str, Any]

Performs a validity check for this metadata extractor.

Return type

Dict[str, Any]

property validate_text: Dict[str, Any]

Performs a validity check for this text extractor.

Return type

Dict[str, Any]

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns

A dictionary of dictionaries laying out the validity and statuses for the parser tools.

Return type

Dict[str, Dict[str, Any]]

sparclur.parsers.Poppler

class sparclur.parsers._poppler.Poppler(doc: str, skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False, trace: Optional[str] = None, binary_path: Optional[str] = None, temp_folders_dir: Optional[str] = None, page_delimiter: Optional[str] = None, maintain_layout: Optional[bool] = None, dpi: Optional[int] = None, size: Optional[Tuple[int]] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None, ocr: Optional[bool] = None)

Bases: sparclur._tracer.Tracer, sparclur._hybrid.Hybrid, sparclur._font_extractor.FontExtractor, sparclur._image_data_extractor.ImageDataExtractor, sparclur._reforge.Reforger

Poppler wrapper for pdftoppm, pdftocairo, and pdftotext

property caching

Returns the caching setting for the renderer.

If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool

property cleaned

Return a normalized collection of the warnings and errors with occurrence counts.

Returns

A dictionary with each normalized message as the key and the occurrence count as the value

Return type

Dict[str, int]

clear_renders()

Clears any PIL’s that have been retained in the renderer object.

clear_text()

Clear any text that has already been extracted for the document

compare(other: sparclur._renderer.Renderer, page=None, full=False)

Performs a structural similarity comparison between two renders

Parameters
  • other (Renderer) – The other Parser and document to compare this Parser and document to.

  • page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.

  • full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.

Return type

Dict[int, PRCSim] or PRCSim

compare_ocr(page=None, shingle_size=4)

Method that compares the OCR result to the built-in text extraction.

Parameters
  • page (int) – Indicates which page the comparison should be run over. If ‘None’, all pages are compared.

  • shingle_size (int, default=4) – The size of the token shingles used in the Jaccard similarity comparison between the OCR and the text extraction.

Returns

The Jaccard similarity between the OCR and the text extraction (for the specified shingle size).

Return type

float

compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)

Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.

Parameters
  • other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser

  • page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document

  • shingle_size (int, default=4) – The size of the shingled n-grams

Returns

The Jaccard Similarity score

Return type

float

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns

String of the document path or first 15 bytes of the binary

Return type

str or bytes

property dpi

Return dots per inch :rtype: int

property fonts

Extracts the detected fonts from the PDF file.

Returns

Dict[str, Any]

static get_name()

Return the SPARCLUR defined name for the parser.

Returns

Parser name

Return type

str

get_renders(page: Optional[Union[int, List[int]]] = None)

Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.

Parameters

page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None

Return type

PngImageFile or Dict[int, PngImageFile]

get_text(page: Optional[int] = None)

Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.

Parameters

page (int or None) – zero-indexed page to extract text from. Returns the whole document if None

Return type

str or Dict[int, str]

get_tokens(page: Optional[int] = None)

Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.

Parameters

page (int or None) – zero-indexed page to extract text from. Returns the whole document if None

Return type

str or Dict[int, str]

property logs

View any gathered logs. :rtype: Dict[int, Dict[str, Any]]

property messages

Return the error and warnings for the document passed into the Parser instance.

Returns

The list of all raw messages from the parser over the given document

Return type

List[str]

property non_embedded_fonts

Determine whether or not there are non-embedded fonts in the PDF. Returns True if there are missing fonts.

Return type

bool

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns

The number of pages in the document

Return type

int

property reforge

The resulting reforged document.

Return type

bytes

save_reforge(save_path: str)

Saves the reforged document to the specified file location.

Parameters

save_path (str) – The file name and location to save the document.

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns

The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.

Return type

SparclurHash

Type

The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_fonts

Checks whether or not fonts can be successfully extracted from a document. Any issues or errors will result in a ‘Rejected’ classification.

Returns

A dictionary containing a boolean for validity, a classification label for validity, and relevant info for the classification

Return type

Dict[str, str]

property validate_image_data
Checks whether or not image data can be successfully extracted from a document. Any issues or errors will result

in a ‘Rejected’ classification.

Returns

A dictionary containing a boolean for validity, a classification label for validity, and relevant info for the classification

Return type

Dict[str, str]

property validate_renderer: Dict[str, Any]

Performs a validity check for this tracer.

Return type

Dict[str, Any]

property validate_text: Dict[str, Any]

Performs a validity check for this text extractor.

Return type

Dict[str, Any]

property validate_tracer: Dict[str, Any]

Performs a validity check for this tracer.

Return type

bool

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns

A dictionary of dictionaries laying out the validity and statuses for the parser tools.

Return type

Dict[str, Dict[str, Any]]

sparclur.parsers.QPDF

class sparclur.parsers._qpdf.QPDF(doc: str, temp_folders_dir: Optional[str] = None, skip_check: Optional[bool] = None, hash_exclude: Optional[str] = None, binary_path: Optional[str] = None, timeout: Optional[int] = None)

Bases: sparclur._tracer.Tracer, sparclur._metadata_extractor.MetadataExtractor

QPDF tracer

property cleaned

Return a normalized collection of the warnings and errors with occurrence counts.

Returns

A dictionary with each normalized message as the key and the occurrence count as the value

Return type

Dict[str, int]

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns

String of the document path or first 15 bytes of the binary

Return type

str or bytes

static get_name()

Return the SPARCLUR defined name for the parser.

Returns

Parser name

Return type

str

property messages

Return the error and warnings for the document passed into the Parser instance.

Returns

The list of all raw messages from the parser over the given document

Return type

List[str]

property metadata: Dict[str, Any]

Return the dictionary of metadata.

Return type

Dict[str, Any]

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns

The number of pages in the document

Return type

int

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns

The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.

Return type

SparclurHash

Type

The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_metadata: Dict[str, Any]

Performs a validity check for this metadata extractor.

Return type

Dict[str, Any]

property validate_tracer: Dict[str, Any]

Performs a validity check for this tracer.

Return type

bool

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns

A dictionary of dictionaries laying out the validity and statuses for the parser tools.

Return type

Dict[str, Dict[str, Any]]

sparclur.parsers.XPDF

class sparclur.parsers._xpdf.XPDF(doc: Union[str, bytes], skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False, binary_path: Optional[str] = None, temp_folders_dir: Optional[str] = None, page_delimiter: Optional[str] = None, maintain_layout: Optional[bool] = None, dpi: Optional[int] = None, size: Optional[Union[Tuple[int], int]] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None, ocr: Optional[bool] = None)

Bases: sparclur._tracer.Tracer, sparclur._hybrid.Hybrid, sparclur._font_extractor.FontExtractor

XPDF wrapper for pdftoppm, and pdftotext

property caching

Returns the caching setting for the renderer.

If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool

property cleaned

Return a normalized collection of the warnings and errors with occurrence counts.

Returns

A dictionary with each normalized message as the key and the occurrence count as the value

Return type

Dict[str, int]

clear_renders()

Clears any PIL’s that have been retained in the renderer object.

clear_text()

Clear any text that has already been extracted for the document

compare(other: sparclur._renderer.Renderer, page=None, full=False)

Performs a structural similarity comparison between two renders

Parameters
  • other (Renderer) – The other Parser and document to compare this Parser and document to.

  • page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.

  • full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.

Return type

Dict[int, PRCSim] or PRCSim

compare_ocr(page=None, shingle_size=4)

Method that compares the OCR result to the built-in text extraction.

Parameters
  • page (int) – Indicates which page the comparison should be run over. If ‘None’, all pages are compared.

  • shingle_size (int, default=4) – The size of the token shingles used in the Jaccard similarity comparison between the OCR and the text extraction.

Returns

The Jaccard similarity between the OCR and the text extraction (for the specified shingle size).

Return type

float

compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)

Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.

Parameters
  • other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser

  • page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document

  • shingle_size (int, default=4) – The size of the shingled n-grams

Returns

The Jaccard Similarity score

Return type

float

property doc

Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.

Returns

String of the document path or first 15 bytes of the binary

Return type

str or bytes

property dpi

Return dots per inch :rtype: int

property fonts

Extracts the detected fonts from the PDF file.

Returns

Dict[str, Any]

static get_name()

Return the SPARCLUR defined name for the parser.

Returns

Parser name

Return type

str

get_renders(page: Optional[Union[int, List[int]]] = None)

Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.

Parameters

page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None

Return type

PngImageFile or Dict[int, PngImageFile]

get_text(page: Optional[int] = None)

Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.

Parameters

page (int or None) – zero-indexed page to extract text from. Returns the whole document if None

Return type

str or Dict[int, str]

get_tokens(page: Optional[int] = None)

Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.

Parameters

page (int or None) – zero-indexed page to extract text from. Returns the whole document if None

Return type

str or Dict[int, str]

property logs

View any gathered logs. :rtype: Dict[int, Dict[str, Any]]

property messages

Return the error and warnings for the document passed into the Parser instance.

Returns

The list of all raw messages from the parser over the given document

Return type

List[str]

property non_embedded_fonts

Determine whether or not there are non-embedded fonts in the PDF. Returns True if there are missing fonts.

Return type

bool

property num_pages

Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.

Returns

The number of pages in the document

Return type

int

property sparclur_hash

image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).

Returns

The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.

Return type

SparclurHash

Type

The SPARCLUR hash attempts to distill the information from the different parser tools

property validate_fonts

Checks whether or not fonts can be successfully extracted from a document. Any issues or errors will result in a ‘Rejected’ classification.

Returns

A dictionary containing a boolean for validity, a classification label for validity, and relevant info for the classification

Return type

Dict[str, str]

property validate_renderer: Dict[str, Any]

Performs a validity check for this tracer.

Return type

Dict[str, Any]

property validate_text: Dict[str, Any]

Performs a validity check for this text extractor.

Return type

Dict[str, Any]

property validate_tracer: Dict[str, Any]

Performs a validity check for this tracer.

Return type

bool

property validity

Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.

Returns

A dictionary of dictionaries laying out the validity and statuses for the parser tools.

Return type

Dict[str, Dict[str, Any]]