sparclur.parsers package
Parsers
sparclur.parsers.Arlington
- class sparclur.parsers._arlington.Arlington(doc: Union[str, bytes], arlington_path: Optional[str] = None, version: Optional[Union[float, str]] = None, skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, temp_folders_dir: Optional[str] = None, timeout: Optional[int] = None)
Bases:
sparclur._tracer.TracerWrapper for the Arlington DOM TestGrammar (https://github.com/pdf-association/arlington-pdf-model)
- property cleaned
Return a normalized collection of the warnings and errors with occurrence counts.
- Returns
A dictionary with each normalized message as the key and the occurrence count as the value
- Return type
Dict[str, int]
- property doc
Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.
- Returns
String of the document path or first 15 bytes of the binary
- Return type
str or bytes
- static get_name()
Return the SPARCLUR defined name for the parser.
- Returns
Parser name
- Return type
str
- property messages
Return the error and warnings for the document passed into the Parser instance.
- Returns
The list of all raw messages from the parser over the given document
- Return type
List[str]
- property num_pages
Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.
- Returns
The number of pages in the document
- Return type
int
- property sparclur_hash
image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).
- Returns
The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
- Return type
SparclurHash
- Type
The SPARCLUR hash attempts to distill the information from the different parser tools
- property validate_tracer: Dict[str, Any]
Performs a validity check for this tracer.
- Return type
bool
- property validity
Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.
- Returns
A dictionary of dictionaries laying out the validity and statuses for the parser tools.
- Return type
Dict[str, Dict[str, Any]]
sparclur.parsers.Ghostscript
- class sparclur.parsers._ghostscript.Ghostscript(doc: str, skip_check: Optional[bool] = None, temp_folders_dir: Optional[str] = None, dpi: Optional[int] = None, size: Optional[Union[Tuple[int], int]] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False)
Bases:
sparclur._renderer.Renderer,sparclur._reforge.ReforgerAbstract class for PDF renderers.
- property caching
Returns the caching setting for the renderer.
If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool
- clear_renders()
Clears any PIL’s that have been retained in the renderer object.
- clear_text()
Clear any text that has already been extracted for the document
- compare(other: sparclur._renderer.Renderer, page=None, full=False)
Performs a structural similarity comparison between two renders
- Parameters
other (Renderer) – The other Parser and document to compare this Parser and document to.
page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.
full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.
- Return type
Dict[int, PRCSim] or PRCSim
- compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)
Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.
- Parameters
other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser
page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document
shingle_size (int, default=4) – The size of the shingled n-grams
- Returns
The Jaccard Similarity score
- Return type
float
- property doc
Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.
- Returns
String of the document path or first 15 bytes of the binary
- Return type
str or bytes
- property dpi
Return dots per inch :rtype: int
- static get_name()
Return the SPARCLUR defined name for the parser.
- Returns
Parser name
- Return type
str
- get_renders(page: Optional[Union[int, List[int]]] = None)
Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.
- Parameters
page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None
- Return type
PngImageFile or Dict[int, PngImageFile]
- get_text(page: Optional[int] = None)
Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.
- Parameters
page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
- Return type
str or Dict[int, str]
- get_tokens(page: Optional[int] = None)
Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.
- Parameters
page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
- Return type
str or Dict[int, str]
- property logs
View any gathered logs. :rtype: Dict[int, Dict[str, Any]]
- property num_pages
Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.
- Returns
The number of pages in the document
- Return type
int
- property reforge
The resulting reforged document.
- Return type
bytes
- save_reforge(save_path: str)
Saves the reforged document to the specified file location.
- Parameters
save_path (str) – The file name and location to save the document.
- property sparclur_hash
image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).
- Returns
The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
- Return type
SparclurHash
- Type
The SPARCLUR hash attempts to distill the information from the different parser tools
- property validate_renderer
Performs a validity check for this tracer.
- Return type
Dict[str, Any]
- property validity
Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.
- Returns
A dictionary of dictionaries laying out the validity and statuses for the parser tools.
- Return type
Dict[str, Dict[str, Any]]
sparclur.parsers.MuPDF
- class sparclur.parsers._mupdf.MuPDF(doc: Union[str, bytes], skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False, parse_streams: Optional[bool] = None, binary_path: Optional[str] = None, temp_folders_dir: Optional[str] = None, dpi: Optional[int] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None, ocr: Optional[bool] = None)
Bases:
sparclur._tracer.Tracer,sparclur._hybrid.Hybrid,sparclur._reforge.ReforgerMuPDF parser
- property caching
Returns the caching setting for the renderer.
If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool
- property cleaned
Return a normalized collection of the warnings and errors with occurrence counts.
- Returns
A dictionary with each normalized message as the key and the occurrence count as the value
- Return type
Dict[str, int]
- clear_renders()
Clears any PIL’s that have been retained in the renderer object.
- clear_text()
Clear any text that has already been extracted for the document
- compare(other: sparclur._renderer.Renderer, page=None, full=False)
Performs a structural similarity comparison between two renders
- Parameters
other (Renderer) – The other Parser and document to compare this Parser and document to.
page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.
full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.
- Return type
Dict[int, PRCSim] or PRCSim
- compare_ocr(page=None, shingle_size=4)
Method that compares the OCR result to the built-in text extraction.
- Parameters
page (int) – Indicates which page the comparison should be run over. If ‘None’, all pages are compared.
shingle_size (int, default=4) – The size of the token shingles used in the Jaccard similarity comparison between the OCR and the text extraction.
- Returns
The Jaccard similarity between the OCR and the text extraction (for the specified shingle size).
- Return type
float
- compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)
Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.
- Parameters
other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser
page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document
shingle_size (int, default=4) – The size of the shingled n-grams
- Returns
The Jaccard Similarity score
- Return type
float
- property doc
Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.
- Returns
String of the document path or first 15 bytes of the binary
- Return type
str or bytes
- property dpi
Return dots per inch :rtype: int
- static get_name()
Return the SPARCLUR defined name for the parser.
- Returns
Parser name
- Return type
str
- get_renders(page: Optional[Union[int, List[int]]] = None)
Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.
- Parameters
page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None
- Return type
PngImageFile or Dict[int, PngImageFile]
- get_text(page: Optional[int] = None)
Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.
- Parameters
page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
- Return type
str or Dict[int, str]
- get_tokens(page: Optional[int] = None)
Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.
- Parameters
page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
- Return type
str or Dict[int, str]
- property logs
View any gathered logs. :rtype: Dict[int, Dict[str, Any]]
- property messages
Return the error and warnings for the document passed into the Parser instance.
- Returns
The list of all raw messages from the parser over the given document
- Return type
List[str]
- property num_pages
Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.
- Returns
The number of pages in the document
- Return type
int
- property reforge
The resulting reforged document.
- Return type
bytes
- save_reforge(save_path: str)
Saves the reforged document to the specified file location.
- Parameters
save_path (str) – The file name and location to save the document.
- property sparclur_hash
image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).
- Returns
The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
- Return type
SparclurHash
- Type
The SPARCLUR hash attempts to distill the information from the different parser tools
- property validate_renderer
Performs a validity check for this tracer.
- Return type
Dict[str, Any]
- property validate_text: Dict[str, Any]
Performs a validity check for this text extractor.
- Return type
Dict[str, Any]
- property validate_tracer: Dict[str, Any]
Performs a validity check for this tracer.
- Return type
bool
- property validity
Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.
- Returns
A dictionary of dictionaries laying out the validity and statuses for the parser tools.
- Return type
Dict[str, Dict[str, Any]]
sparclur.parsers.PDFCPU
- class sparclur.parsers._pdfcpu.PDFCPU(doc: Union[str, bytes], skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, binary_path: Optional[str] = None, temp_folders_dir: Optional[str] = None, timeout: Optional[int] = None)
Bases:
sparclur._tracer.TracerWrapper for PDFCPU (https://pdfcpu.io/)
- property cleaned
Return a normalized collection of the warnings and errors with occurrence counts.
- Returns
A dictionary with each normalized message as the key and the occurrence count as the value
- Return type
Dict[str, int]
- property doc
Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.
- Returns
String of the document path or first 15 bytes of the binary
- Return type
str or bytes
- static get_name()
Return the SPARCLUR defined name for the parser.
- Returns
Parser name
- Return type
str
- property messages
Return the error and warnings for the document passed into the Parser instance.
- Returns
The list of all raw messages from the parser over the given document
- Return type
List[str]
- property num_pages
Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.
- Returns
The number of pages in the document
- Return type
int
- property sparclur_hash
image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).
- Returns
The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
- Return type
SparclurHash
- Type
The SPARCLUR hash attempts to distill the information from the different parser tools
- property validate_tracer: Dict[str, Any]
Performs a validity check for this tracer.
- Return type
bool
- property validity
Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.
- Returns
A dictionary of dictionaries laying out the validity and statuses for the parser tools.
- Return type
Dict[str, Dict[str, Any]]
sparclur.parsers.PDFium
- class sparclur.parsers._pdfium.PDFium(doc: Union[str, bytes], skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False, temp_folders_dir: Optional[str] = None, dpi: Optional[int] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None)
Bases:
sparclur._renderer.RendererPDFium renderer
- property caching
Returns the caching setting for the renderer.
If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool
- clear_renders()
Clears any PIL’s that have been retained in the renderer object.
- clear_text()
Clear any text that has already been extracted for the document
- compare(other: sparclur._renderer.Renderer, page=None, full=False)
Performs a structural similarity comparison between two renders
- Parameters
other (Renderer) – The other Parser and document to compare this Parser and document to.
page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.
full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.
- Return type
Dict[int, PRCSim] or PRCSim
- compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)
Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.
- Parameters
other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser
page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document
shingle_size (int, default=4) – The size of the shingled n-grams
- Returns
The Jaccard Similarity score
- Return type
float
- property doc
Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.
- Returns
String of the document path or first 15 bytes of the binary
- Return type
str or bytes
- property dpi
Return dots per inch :rtype: int
- static get_name()
Return the SPARCLUR defined name for the parser.
- Returns
Parser name
- Return type
str
- get_renders(page: Optional[Union[int, List[int]]] = None)
Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.
- Parameters
page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None
- Return type
PngImageFile or Dict[int, PngImageFile]
- get_text(page: Optional[int] = None)
Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.
- Parameters
page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
- Return type
str or Dict[int, str]
- get_tokens(page: Optional[int] = None)
Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.
- Parameters
page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
- Return type
str or Dict[int, str]
- property logs
View any gathered logs. :rtype: Dict[int, Dict[str, Any]]
- property num_pages
Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.
- Returns
The number of pages in the document
- Return type
int
- property sparclur_hash
image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).
- Returns
The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
- Return type
SparclurHash
- Type
The SPARCLUR hash attempts to distill the information from the different parser tools
- property validate_renderer
Performs a validity check for this tracer.
- Return type
Dict[str, Any]
- property validity
Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.
- Returns
A dictionary of dictionaries laying out the validity and statuses for the parser tools.
- Return type
Dict[str, Dict[str, Any]]
sparclur.parsers.PDFMiner
- class sparclur.parsers._pdfminer.PDFMiner(doc: str, temp_folders_dir: Optional[str] = None, skip_check: Optional[bool] = None, hash_exclude: Optional[str] = None, timeout: Optional[int] = None, page_delimiter: Optional[str] = None, detect_vertical: Optional[bool] = None, all_texts: Optional[bool] = None, stream_output: Optional[str] = None, suppress_warnings: Optional[bool] = None)
Bases:
sparclur._text_extractor.TextExtractor,sparclur._metadata_extractor.MetadataExtractorPDFMiner Text Extraction https://pdfminersix.readthedocs.io/en/latest/
- clear_text()
Clear any text that has already been extracted for the document
- compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)
Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.
- Parameters
other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser
page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document
shingle_size (int, default=4) – The size of the shingled n-grams
- Returns
The Jaccard Similarity score
- Return type
float
- property doc
Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.
- Returns
String of the document path or first 15 bytes of the binary
- Return type
str or bytes
- static get_name()
Return the SPARCLUR defined name for the parser.
- Returns
Parser name
- Return type
str
- get_text(page: Optional[int] = None)
Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.
- Parameters
page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
- Return type
str or Dict[int, str]
- get_tokens(page: Optional[int] = None)
Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.
- Parameters
page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
- Return type
str or Dict[int, str]
- property metadata: Dict[str, Any]
Return the dictionary of metadata.
- Return type
Dict[str, Any]
- property num_pages
Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.
- Returns
The number of pages in the document
- Return type
int
- property sparclur_hash
image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).
- Returns
The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
- Return type
SparclurHash
- Type
The SPARCLUR hash attempts to distill the information from the different parser tools
- property validate_metadata: Dict[str, Any]
Performs a validity check for this metadata extractor.
- Return type
Dict[str, Any]
- property validate_text: Dict[str, Any]
Performs a validity check for this text extractor.
- Return type
Dict[str, Any]
- property validity
Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.
- Returns
A dictionary of dictionaries laying out the validity and statuses for the parser tools.
- Return type
Dict[str, Dict[str, Any]]
sparclur.parsers.Poppler
- class sparclur.parsers._poppler.Poppler(doc: str, skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False, trace: Optional[str] = None, binary_path: Optional[str] = None, temp_folders_dir: Optional[str] = None, page_delimiter: Optional[str] = None, maintain_layout: Optional[bool] = None, dpi: Optional[int] = None, size: Optional[Tuple[int]] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None, ocr: Optional[bool] = None)
Bases:
sparclur._tracer.Tracer,sparclur._hybrid.Hybrid,sparclur._font_extractor.FontExtractor,sparclur._image_data_extractor.ImageDataExtractor,sparclur._reforge.ReforgerPoppler wrapper for pdftoppm, pdftocairo, and pdftotext
- property caching
Returns the caching setting for the renderer.
If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool
- property cleaned
Return a normalized collection of the warnings and errors with occurrence counts.
- Returns
A dictionary with each normalized message as the key and the occurrence count as the value
- Return type
Dict[str, int]
- clear_renders()
Clears any PIL’s that have been retained in the renderer object.
- clear_text()
Clear any text that has already been extracted for the document
- compare(other: sparclur._renderer.Renderer, page=None, full=False)
Performs a structural similarity comparison between two renders
- Parameters
other (Renderer) – The other Parser and document to compare this Parser and document to.
page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.
full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.
- Return type
Dict[int, PRCSim] or PRCSim
- compare_ocr(page=None, shingle_size=4)
Method that compares the OCR result to the built-in text extraction.
- Parameters
page (int) – Indicates which page the comparison should be run over. If ‘None’, all pages are compared.
shingle_size (int, default=4) – The size of the token shingles used in the Jaccard similarity comparison between the OCR and the text extraction.
- Returns
The Jaccard similarity between the OCR and the text extraction (for the specified shingle size).
- Return type
float
- compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)
Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.
- Parameters
other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser
page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document
shingle_size (int, default=4) – The size of the shingled n-grams
- Returns
The Jaccard Similarity score
- Return type
float
- property doc
Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.
- Returns
String of the document path or first 15 bytes of the binary
- Return type
str or bytes
- property dpi
Return dots per inch :rtype: int
- property fonts
Extracts the detected fonts from the PDF file.
- Returns
Dict[str, Any]
- static get_name()
Return the SPARCLUR defined name for the parser.
- Returns
Parser name
- Return type
str
- get_renders(page: Optional[Union[int, List[int]]] = None)
Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.
- Parameters
page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None
- Return type
PngImageFile or Dict[int, PngImageFile]
- get_text(page: Optional[int] = None)
Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.
- Parameters
page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
- Return type
str or Dict[int, str]
- get_tokens(page: Optional[int] = None)
Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.
- Parameters
page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
- Return type
str or Dict[int, str]
- property logs
View any gathered logs. :rtype: Dict[int, Dict[str, Any]]
- property messages
Return the error and warnings for the document passed into the Parser instance.
- Returns
The list of all raw messages from the parser over the given document
- Return type
List[str]
- property non_embedded_fonts
Determine whether or not there are non-embedded fonts in the PDF. Returns True if there are missing fonts.
- Return type
bool
- property num_pages
Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.
- Returns
The number of pages in the document
- Return type
int
- property reforge
The resulting reforged document.
- Return type
bytes
- save_reforge(save_path: str)
Saves the reforged document to the specified file location.
- Parameters
save_path (str) – The file name and location to save the document.
- property sparclur_hash
image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).
- Returns
The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
- Return type
SparclurHash
- Type
The SPARCLUR hash attempts to distill the information from the different parser tools
- property validate_fonts
Checks whether or not fonts can be successfully extracted from a document. Any issues or errors will result in a ‘Rejected’ classification.
- Returns
A dictionary containing a boolean for validity, a classification label for validity, and relevant info for the classification
- Return type
Dict[str, str]
- property validate_image_data
- Checks whether or not image data can be successfully extracted from a document. Any issues or errors will result
in a ‘Rejected’ classification.
- Returns
A dictionary containing a boolean for validity, a classification label for validity, and relevant info for the classification
- Return type
Dict[str, str]
- property validate_renderer: Dict[str, Any]
Performs a validity check for this tracer.
- Return type
Dict[str, Any]
- property validate_text: Dict[str, Any]
Performs a validity check for this text extractor.
- Return type
Dict[str, Any]
- property validate_tracer: Dict[str, Any]
Performs a validity check for this tracer.
- Return type
bool
- property validity
Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.
- Returns
A dictionary of dictionaries laying out the validity and statuses for the parser tools.
- Return type
Dict[str, Dict[str, Any]]
sparclur.parsers.QPDF
- class sparclur.parsers._qpdf.QPDF(doc: str, temp_folders_dir: Optional[str] = None, skip_check: Optional[bool] = None, hash_exclude: Optional[str] = None, binary_path: Optional[str] = None, timeout: Optional[int] = None)
Bases:
sparclur._tracer.Tracer,sparclur._metadata_extractor.MetadataExtractorQPDF tracer
- property cleaned
Return a normalized collection of the warnings and errors with occurrence counts.
- Returns
A dictionary with each normalized message as the key and the occurrence count as the value
- Return type
Dict[str, int]
- property doc
Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.
- Returns
String of the document path or first 15 bytes of the binary
- Return type
str or bytes
- static get_name()
Return the SPARCLUR defined name for the parser.
- Returns
Parser name
- Return type
str
- property messages
Return the error and warnings for the document passed into the Parser instance.
- Returns
The list of all raw messages from the parser over the given document
- Return type
List[str]
- property metadata: Dict[str, Any]
Return the dictionary of metadata.
- Return type
Dict[str, Any]
- property num_pages
Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.
- Returns
The number of pages in the document
- Return type
int
- property sparclur_hash
image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).
- Returns
The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
- Return type
SparclurHash
- Type
The SPARCLUR hash attempts to distill the information from the different parser tools
- property validate_metadata: Dict[str, Any]
Performs a validity check for this metadata extractor.
- Return type
Dict[str, Any]
- property validate_tracer: Dict[str, Any]
Performs a validity check for this tracer.
- Return type
bool
- property validity
Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.
- Returns
A dictionary of dictionaries laying out the validity and statuses for the parser tools.
- Return type
Dict[str, Dict[str, Any]]
sparclur.parsers.XPDF
- class sparclur.parsers._xpdf.XPDF(doc: Union[str, bytes], skip_check: Optional[bool] = None, hash_exclude: Optional[Union[str, List[str]]] = None, page_hashes: Optional[Union[int, Tuple[Any]]] = None, validate_hash: bool = False, binary_path: Optional[str] = None, temp_folders_dir: Optional[str] = None, page_delimiter: Optional[str] = None, maintain_layout: Optional[bool] = None, dpi: Optional[int] = None, size: Optional[Union[Tuple[int], int]] = None, cache_renders: Optional[bool] = None, timeout: Optional[int] = None, ocr: Optional[bool] = None)
Bases:
sparclur._tracer.Tracer,sparclur._hybrid.Hybrid,sparclur._font_extractor.FontExtractorXPDF wrapper for pdftoppm, and pdftotext
- property caching
Returns the caching setting for the renderer.
If caching is set to true, the collection of all rendered PIL’s is retained in the object. Otherwise, the renders will be regenerated every time the get_renders method is called. :rtype: bool
- property cleaned
Return a normalized collection of the warnings and errors with occurrence counts.
- Returns
A dictionary with each normalized message as the key and the occurrence count as the value
- Return type
Dict[str, int]
- clear_renders()
Clears any PIL’s that have been retained in the renderer object.
- clear_text()
Clear any text that has already been extracted for the document
- compare(other: sparclur._renderer.Renderer, page=None, full=False)
Performs a structural similarity comparison between two renders
- Parameters
other (Renderer) – The other Parser and document to compare this Parser and document to.
page (int, List[int], default=None) – Specifiy whether a single page or specific collection of pages should be compared. If ‘None’, all pages are compared.
full (bool, default=False) – Return an image of the comparison of the two document renders for each page or the specified page.
- Return type
Dict[int, PRCSim] or PRCSim
- compare_ocr(page=None, shingle_size=4)
Method that compares the OCR result to the built-in text extraction.
- Parameters
page (int) – Indicates which page the comparison should be run over. If ‘None’, all pages are compared.
shingle_size (int, default=4) – The size of the token shingles used in the Jaccard similarity comparison between the OCR and the text extraction.
- Returns
The Jaccard similarity between the OCR and the text extraction (for the specified shingle size).
- Return type
float
- compare_text(other: sparclur._text_compare.TextCompare, page=None, shingle_size=4)
Shingles the parsed tokens into the specified n-grams and then compares the two token sets and calculates the Jaccard similarity.
- Parameters
other (TextCompare) – The Text Extraction, Renderer, or Hybrid parser to comapre to this parser
page (int) – The 0-indexed page to compare. If None, Use the tokens from the entire document
shingle_size (int, default=4) – The size of the shingled n-grams
- Returns
The Jaccard Similarity score
- Return type
float
- property doc
Return the path to the document that is being run through the parser instance or the first 15 bytes if a binary was passed to the parser.
- Returns
String of the document path or first 15 bytes of the binary
- Return type
str or bytes
- property dpi
Return dots per inch :rtype: int
- property fonts
Extracts the detected fonts from the PDF file.
- Returns
Dict[str, Any]
- static get_name()
Return the SPARCLUR defined name for the parser.
- Returns
Parser name
- Return type
str
- get_renders(page: Optional[Union[int, List[int]]] = None)
Return the renders of the object document. If page is None, return the entire rendered document. Otherwise returns the specified page only.
- Parameters
page (int, List[int], or None) – zero-indexed page or list of pages to be rendered. Returns the whole document if None
- Return type
PngImageFile or Dict[int, PngImageFile]
- get_text(page: Optional[int] = None)
Return the extracted text from the document. If page is None, return all text from the document. Otherwise returns the text for the specified text only.
- Parameters
page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
- Return type
str or Dict[int, str]
- get_tokens(page: Optional[int] = None)
Return the parsed text tokens from the document. If page is None, return all token sets from the document. Otherwise returns the text for the specified text only.
- Parameters
page (int or None) – zero-indexed page to extract text from. Returns the whole document if None
- Return type
str or Dict[int, str]
- property logs
View any gathered logs. :rtype: Dict[int, Dict[str, Any]]
- property messages
Return the error and warnings for the document passed into the Parser instance.
- Returns
The list of all raw messages from the parser over the given document
- Return type
List[str]
- property non_embedded_fonts
Determine whether or not there are non-embedded fonts in the PDF. Returns True if there are missing fonts.
- Return type
bool
- property num_pages
Determine the number of pages in the PDF according to the parser. If the parser does not support page number extraction (e.g. Arlington DOM Checker) this returns None. If the parser fails to load and determine the number of pages, 0 is returned.
- Returns
The number of pages in the document
- Return type
int
- property sparclur_hash
image hashes for the renders and sets of shingled murmur hashes for the text extraction, metadata, trace messages, and fonts. These are collected and then can be used to compare two documents and a distance measure is calculated. This is most relevant in 2 specific cases: the first is trying to find evidence of non-determinism in a parser and the second is to quickly compare differences between parser translations of a document (See the Reforge class of tools).
- Returns
The class that holds the SPARCLUR hashes for each tool and provides an API for comparing two hashes.
- Return type
SparclurHash
- Type
The SPARCLUR hash attempts to distill the information from the different parser tools
- property validate_fonts
Checks whether or not fonts can be successfully extracted from a document. Any issues or errors will result in a ‘Rejected’ classification.
- Returns
A dictionary containing a boolean for validity, a classification label for validity, and relevant info for the classification
- Return type
Dict[str, str]
- property validate_renderer: Dict[str, Any]
Performs a validity check for this tracer.
- Return type
Dict[str, Any]
- property validate_text: Dict[str, Any]
Performs a validity check for this text extractor.
- Return type
Dict[str, Any]
- property validate_tracer: Dict[str, Any]
Performs a validity check for this tracer.
- Return type
bool
- property validity
Returns the validity statuses from each of the relevant tools of the parser and an overall validity for the document. If any of the tools have a warning or error the overall will show that otherwise all of the tools need to mark the document as valid for the overall status to be valid.
- Returns
A dictionary of dictionaries laying out the validity and statuses for the parser tools.
- Return type
Dict[str, Dict[str, Any]]