sparclur package

Subpackages

Astrotruther

class sparclur._astrotruther.Astrotruther(file_col: str = 0, label_col: str = 1, base_path: typing.Optional[str] = None, label_transform: typing.Optional[typing.Callable[[str], str]] = None, parsers: typing.List[sparclur._parser.Parser] = [<class 'sparclur.parsers._pdfminer.PDFMiner'>, <class 'sparclur.parsers._ghostscript.Ghostscript'>, <class 'sparclur.parsers._mupdf.MuPDF'>, <class 'sparclur.parsers._poppler.Poppler'>, <class 'sparclur.parsers._xpdf.XPDF'>, <class 'sparclur.parsers._qpdf.QPDF'>, <class 'sparclur.parsers._arlington.Arlington'>, <class 'sparclur.parsers._pdfcpu.PDFCPU'>, <class 'sparclur.parsers._pdfium.PDFium'>], parser_args: typing.Dict[str, typing.Dict[str, typing.Any]] = {}, exclude: typing.Optional[str] = None, overall_timeout: typing.Optional[int] = None, classifier: str = 'decTree', classifier_args: typing.Dict[str, typing.Any] = {}, k_folds: int = 3, max_workers: int = 1, timeout: typing.Optional[int] = None, progress_bar: bool = True)

Trains a classifier from the error and warning messages of the specified SPARCLUR Tracers and a given set of labels for a collection of PDFs.

fit(docs, doc_loading_args={}, save_training_data=None)

Fits the model to the training data.

classmethod load(path)

Load a generated model

Parameters

path (str) – The file path of a saved astrotruth model

predict(docs, file_col=None, doc_loading_args={}, unseen_ignore=None, unseen_message_default='Not enough info', prediction_column='astrotruth', save_eval_data=None)

Uses the generated model to predict the validity of the given documents.

Returns

DataFrame of the classification results

Return type

DataFrame

save(path)

Save the generated model

Parameters

path (str) – The file path to save the model in

DetectChaos

class sparclur._detect_chaos.DetectChaos(parsers: str, num_comparisons: int = 5, parser_args: Dict[str, Dict[str, Any]] = {}, parser_timeout: int = 120, overall_timeout: int = 600, num_workers: int = 1, progress_bar: bool = True)

Looks for evidence of non-determinism in PDF parsers.

run(files, recurse=False, base_path=None, save_path=None)

Run the non-deterministic detection.

Parameters
  • files (str or List[str]) – Path to a directory or a text file of PDF paths or a List of paths

  • recurse (bool) – Whether or not the directory passed into the files parameter should be recursively searched for PDF’s

  • base_path (str) – A base directory that should be appended to the list of paths passed into files

FloodLight

RollBack

class sparclur._roll_back.RollBack(doc: Union[str, bytes])

Checks for incremental updates and if present analyzes the differences between the versions.

compare_renders(parser='Poppler', parser_args={}, num_workers=1, versions=None, progress_bar=True, timeout=120, display_width=10, display_height=10, ncols=5)

Compares the renders between subsequent versions of incremental updates. If the versions are not specified and there are more than 11 versions detected, only the lastest 11 versions will be compared. The rendering comparisons can be run in parallel by setting the num_workers greater than 1. This can be a time consuming process and care should be taken in trying to compare too many versions at one time. These render comparisons are then plotted by subsequent versions comparisons for each page.

Parameters
  • parser (str, default='Poppler') – The parser to use to generate the renders.

  • parser_args (Dict[str, Any]) – Any additional arguments to pass to the specified parser

  • num_workers (int, default=1) – The number of workers to use if render comparisons are run in parallel. Choose 1 to run serially.

  • versions (List[int], default=None) – The specific versions to compare. If the versions are not contiguous, they will be sorted and comparisons will happen between neighbors in the sorted list.

  • progress_bar (bool, default=True) – Flag that determines if a progress bar for the render comparisons is displayed

  • timeout (int, default=120) – A timeout parameter for the rendering comparison

  • display_width (int) – The width of the resulting plot

  • display_height (int) – The height of the resulting plot

  • ncols (int) – The number of columns in the final subplot of comparisons.

Return type

PyPlot figure

compare_text(parser='Poppler', parser_args={}, display_width=10, display_height=10)

Compares the extracted text tokens between subsequent versions and plots the number of additions and subtractions in stacked bars. The Parser needs to support text extraction.

Parameters
  • parser (str, default='Poppler') – Name of the parser to use for the text extraction

  • parser_args (Dict[str, Any]) – Any specific arguments to pass to the selected parser

  • display_width (int) – The width of the resulting plot

  • display_height (int) – The height of the resulting plot

Return type

PyPlot figure

property contains_updates

Whether or not the document has incremental updates.

Return type

bool

get_version(version: int)

Return the specified version. Versions are 0-indexed.

Return type

bytes

property num_versions

The number of detected updates in the document.

Return type

int

save_version(version: int, save_path: str)

Save the specified incremental update version. Versions are 0-indexed.

Spotlight

class sparclur._spotlight.Spotlight(num_workers: int = 1, temp_folders_dir: Optional[str] = None, dpi: int = 72, page_hashes: Optional[Union[int, Tuple]] = None, parsers: Optional[List[str]] = None, parser_args: Dict[str, Dict[str, Any]] = {}, timeout: Optional[int] = None, progress_bar: bool = True)

Runs all of the selected parsers over the document and also reforges the document with all available reforgers. The results are hashed and stored for analysis.

class sparclur._spotlight.SpotlightResult(parser, version, validity, sparclur_hash)

Results from running Spotlight over a document. Provides the following methods:

validity_report: Provide a table of validity results for the parsers over the original document and the reforges overall_validity: The overall validity classification for the document given the parsers recoverable: Returns whether or not the document is unambiguously recoverable sim_heatmap: A heatmap of the similarity scores for each parser over given pairs of documents sim_sunburst: A sunburst of the similarity scores. Provides an interactive way to engage with the heatmap.

overall_validity(version='original', excluded_parsers=None)

Return a classification for the overall validity of the specified document

Parameters
  • version ({original, MuPDF, Poppler, Ghostscript}) – The document version to classify

  • excluded_parsers (str or List[str]) – Any parsers to exclude in the determination of the overall validity

Returns

The validity label

Return type

str

recoverable(excluded_parsers=None, sim_threshold=0.9)

Uses the spotlight results to determine if the document can be unambiguously recovered. The criteria is that each of the reforges successfully parses for each parser and further that the Sparclur Hash between the reforged documents is above the given threshold.

Parameters
  • excluded_parsers (str or List[str]) – List of parsers to omit from the recoverable criteria

  • sim_threshold (float) – The threshold to meet for the reforge comparisons. If a comparison falls below this threshold the recovery is said to be ambiguous

Returns

The overall report for whether the original document is unambiguously recoverable or not

Return type

str

sim_heatmap(parsers: Optional[str] = None, report: str = 'sim', annotated: bool = True, detailed: bool = False, compare_orig: bool = True, height: int = 10, width: int = 10, save_display=None)

A heatmap of the similarity scores for each parser over given pairs of documents

Parameters
  • parsers (str or List[str]) – The parsers to display the similarity scores for

  • report ({'sim', 'Renderer sim', 'Text Extractor sim', 'Tracer sim'}) – The specific similarity score to run. Choose one of sim, Renderer sim, Text Extractor sim or Tracer sim

  • annotated (bool, default=True) – Flag for whether or not similarity scores should be displayed on the heatmap

  • detailed (bool, default=False) – Flag for displaying all similarity scores for each parser

  • compare_orig (bool) – Whether reforge<->original comparison scores should be displayed in the heatmap. These comparisons don’t impact the recoverablity of the file and would only provide insight into a differential between the original and the reforge

  • height (int) – Height of the figure

  • width (int) – Width of the figure

  • save_display (str) – If not None, save a png of the figure to the file path specified by save_display

Return type

Seaborn heatmap

sim_sunburst(compare_orig: bool = True, full: bool = False, color: str = 'RdBu', color_range: List[float] = [0.6, 1])

Create an interactive sunburst for exploring the similarities between the documents for the Spotlight parsers

Parameters
  • compare_orig (bool) – Whether reforge<->original comparison scores should be displayed in the heatmap. These comparisons don’t impact the recoverablity of the file and would only provide insight into a differential between the original and the reforge

  • full (bool) – Create the full sunburst that has all possible combinations. Turning this off removes duplicate comparisons to reduce the overall number of slices.

  • color (str) – The color range to use. See https://plotly.com/python/builtin-colorscales/

  • color_range (List[float]) – The range to base the color on. Format is [min, max]

Return type

Plotly Sunburst

validity_report(report='overall', excluded_parsers=None)

Return a table of the validity classifications for each parser over each document.

Parameters
  • report ({overall, Renderer, Text Extraction, Font Extraction, Metadata Extraction, Tracer}) – Overall takes into account all of the tools of the given parser, or a specific tool can be specified

  • excluded_parsers (str or List[str]) – Parsers to exclude from the report

Returns

A DataFrame of the resulting labels

Return type

DataFrame