sparclur package

Subpackages

Astrotruther

class sparclur._astrotruther.Astrotruther(file_col: str = 0, label_col: str = 1, base_path: typing.Optional[str] = None, label_transform: typing.Optional[typing.Callable[[str], str]] = None, parsers: typing.List[sparclur._parser.Parser] = [<class 'sparclur.parsers._pdfminer.PDFMiner'>, <class 'sparclur.parsers._ghostscript.Ghostscript'>, <class 'sparclur.parsers._mupdf.MuPDF'>, <class 'sparclur.parsers._poppler.Poppler'>, <class 'sparclur.parsers._xpdf.XPDF'>, <class 'sparclur.parsers._qpdf.QPDF'>, <class 'sparclur.parsers._arlington.Arlington'>, <class 'sparclur.parsers._pdfcpu.PDFCPU'>, <class 'sparclur.parsers._pdfium.PDFium'>], parser_args: typing.Dict[str, typing.Dict[str, typing.Any]] = {}, exclude: typing.Optional[str] = None, overall_timeout: typing.Optional[int] = None, classifier: str = 'decTree', classifier_args: typing.Dict[str, typing.Any] = {}, k_folds: int = 3, max_workers: int = 1, timeout: typing.Optional[int] = None, progress_bar: bool = True)

Trains a classifier from the error and warning messages of the specified SPARCLUR Tracers and a given set of labels for a collection of PDFs.

fit(docs, doc_loading_args={}, save_training_data=None): Fits the model to the training data.

classmethod load(path)

Load a generated model

Parameters: path (str) – The file path of a saved astrotruth model

predict(docs, file_col=None, doc_loading_args={}, unseen_ignore=None, unseen_message_default='Not enough info', prediction_column='astrotruth', save_eval_data=None)

Uses the generated model to predict the validity of the given documents.

Returns: DataFrame of the classification results
Return type: DataFrame

save(path)

Save the generated model

Parameters: path (str) – The file path to save the model in

DetectChaos

class sparclur._detect_chaos.DetectChaos(parsers: str, num_comparisons: int = 5, parser_args: Dict[str, Dict[str, Any]] = {}, parser_timeout: int = 120, overall_timeout: int = 600, num_workers: int = 1, progress_bar: bool = True)

Looks for evidence of non-determinism in PDF parsers.

run(files, recurse=False, base_path=None, save_path=None)

Run the non-deterministic detection.

Parameters

files (str or List[str]) – Path to a directory or a text file of PDF paths or a List of paths
recurse (bool) – Whether or not the directory passed into the files parameter should be recursively searched for PDF’s
base_path (str) – A base directory that should be appended to the list of paths passed into files

FloodLight

RollBack

class sparclur._roll_back.RollBack(doc: Union[str, bytes])

Checks for incremental updates and if present analyzes the differences between the versions.

compare_renders(parser='Poppler', parser_args={}, num_workers=1, versions=None, progress_bar=True, timeout=120, display_width=10, display_height=10, ncols=5)

Compares the renders between subsequent versions of incremental updates. If the versions are not specified and there are more than 11 versions detected, only the lastest 11 versions will be compared. The rendering comparisons can be run in parallel by setting the num_workers greater than 1. This can be a time consuming process and care should be taken in trying to compare too many versions at one time. These render comparisons are then plotted by subsequent versions comparisons for each page.

Parameters

parser (str, default='Poppler') – The parser to use to generate the renders.
parser_args (Dict[str, Any]) – Any additional arguments to pass to the specified parser
num_workers (int, default=1) – The number of workers to use if render comparisons are run in parallel. Choose 1 to run serially.
versions (List[int], default=None) – The specific versions to compare. If the versions are not contiguous, they will be sorted and comparisons will happen between neighbors in the sorted list.
progress_bar (bool, default=True) – Flag that determines if a progress bar for the render comparisons is displayed
timeout (int, default=120) – A timeout parameter for the rendering comparison
display_width (int) – The width of the resulting plot
display_height (int) – The height of the resulting plot
ncols (int) – The number of columns in the final subplot of comparisons.

Return type

PyPlot figure

compare_text(parser='Poppler', parser_args={}, display_width=10, display_height=10)

Compares the extracted text tokens between subsequent versions and plots the number of additions and subtractions in stacked bars. The Parser needs to support text extraction.

Parameters

parser (str, default='Poppler') – Name of the parser to use for the text extraction
parser_args (Dict[str, Any]) – Any specific arguments to pass to the selected parser
display_width (int) – The width of the resulting plot
display_height (int) – The height of the resulting plot

Return type

PyPlot figure

property contains_updates

Whether or not the document has incremental updates.

Return type: bool

get_version(version: int)

Return the specified version. Versions are 0-indexed.

Return type: bytes

property num_versions

The number of detected updates in the document.

Return type: int

save_version(version: int, save_path: str): Save the specified incremental update version. Versions are 0-indexed.

Spotlight

class sparclur._spotlight.Spotlight(num_workers: int = 1, temp_folders_dir: Optional[str] = None, dpi: int = 72, page_hashes: Optional[Union[int, Tuple]] = None, parsers: Optional[List[str]] = None, parser_args: Dict[str, Dict[str, Any]] = {}, timeout: Optional[int] = None, progress_bar: bool = True): Runs all of the selected parsers over the document and also reforges the document with all available reforgers. The results are hashed and stored for analysis.

class sparclur._spotlight.SpotlightResult(parser, version, validity, sparclur_hash)

Results from running Spotlight over a document. Provides the following methods:

validity_report: Provide a table of validity results for the parsers over the original document and the reforges overall_validity: The overall validity classification for the document given the parsers recoverable: Returns whether or not the document is unambiguously recoverable sim_heatmap: A heatmap of the similarity scores for each parser over given pairs of documents sim_sunburst: A sunburst of the similarity scores. Provides an interactive way to engage with the heatmap.

overall_validity(version='original', excluded_parsers=None)

Return a classification for the overall validity of the specified document

Parameters

version ({original, MuPDF, Poppler, Ghostscript}) – The document version to classify
excluded_parsers (str or List[str]) – Any parsers to exclude in the determination of the overall validity

Returns

The validity label

Return type

str

recoverable(excluded_parsers=None, sim_threshold=0.9)

Uses the spotlight results to determine if the document can be unambiguously recovered. The criteria is that each of the reforges successfully parses for each parser and further that the Sparclur Hash between the reforged documents is above the given threshold.

Parameters

excluded_parsers (str or List[str]) – List of parsers to omit from the recoverable criteria
sim_threshold (float) – The threshold to meet for the reforge comparisons. If a comparison falls below this threshold the recovery is said to be ambiguous

Returns

The overall report for whether the original document is unambiguously recoverable or not

Return type

str

sim_heatmap(parsers: Optional[str] = None, report: str = 'sim', annotated: bool = True, detailed: bool = False, compare_orig: bool = True, height: int = 10, width: int = 10, save_display=None)

A heatmap of the similarity scores for each parser over given pairs of documents

Parameters

parsers (str or List[str]) – The parsers to display the similarity scores for
report ({'sim', 'Renderer sim', 'Text Extractor sim', 'Tracer sim'}) – The specific similarity score to run. Choose one of sim, Renderer sim, Text Extractor sim or Tracer sim
annotated (bool, default=True) – Flag for whether or not similarity scores should be displayed on the heatmap
detailed (bool, default=False) – Flag for displaying all similarity scores for each parser
compare_orig (bool) – Whether reforge<->original comparison scores should be displayed in the heatmap. These comparisons don’t impact the recoverablity of the file and would only provide insight into a differential between the original and the reforge
height (int) – Height of the figure
width (int) – Width of the figure
save_display (str) – If not None, save a png of the figure to the file path specified by save_display

Return type

Seaborn heatmap

sim_sunburst(compare_orig: bool = True, full: bool = False, color: str = 'RdBu', color_range: List[float] = [0.6, 1])

Create an interactive sunburst for exploring the similarities between the documents for the Spotlight parsers

Parameters

compare_orig (bool) – Whether reforge<->original comparison scores should be displayed in the heatmap. These comparisons don’t impact the recoverablity of the file and would only provide insight into a differential between the original and the reforge
full (bool) – Create the full sunburst that has all possible combinations. Turning this off removes duplicate comparisons to reduce the overall number of slices.
color (str) – The color range to use. See https://plotly.com/python/builtin-colorscales/
color_range (List[float]) – The range to base the color on. Format is [min, max]

Return type

Plotly Sunburst

validity_report(report='overall', excluded_parsers=None)

Return a table of the validity classifications for each parser over each document.

Parameters

report ({overall, Renderer, Text Extraction, Font Extraction, Metadata Extraction, Tracer}) – Overall takes into account all of the tools of the given parser, or a specific tool can be specified
excluded_parsers (str or List[str]) – Parsers to exclude from the report

Returns

A DataFrame of the resulting labels

Return type

DataFrame