sparclur package
Subpackages
Astrotruther
- class sparclur._astrotruther.Astrotruther(file_col: str = 0, label_col: str = 1, base_path: typing.Optional[str] = None, label_transform: typing.Optional[typing.Callable[[str], str]] = None, parsers: typing.List[sparclur._parser.Parser] = [<class 'sparclur.parsers._pdfminer.PDFMiner'>, <class 'sparclur.parsers._ghostscript.Ghostscript'>, <class 'sparclur.parsers._mupdf.MuPDF'>, <class 'sparclur.parsers._poppler.Poppler'>, <class 'sparclur.parsers._xpdf.XPDF'>, <class 'sparclur.parsers._qpdf.QPDF'>, <class 'sparclur.parsers._arlington.Arlington'>, <class 'sparclur.parsers._pdfcpu.PDFCPU'>, <class 'sparclur.parsers._pdfium.PDFium'>], parser_args: typing.Dict[str, typing.Dict[str, typing.Any]] = {}, exclude: typing.Optional[str] = None, overall_timeout: typing.Optional[int] = None, classifier: str = 'decTree', classifier_args: typing.Dict[str, typing.Any] = {}, k_folds: int = 3, max_workers: int = 1, timeout: typing.Optional[int] = None, progress_bar: bool = True)
Trains a classifier from the error and warning messages of the specified SPARCLUR Tracers and a given set of labels for a collection of PDFs.
- fit(docs, doc_loading_args={}, save_training_data=None)
Fits the model to the training data.
- classmethod load(path)
Load a generated model
- Parameters
path (str) – The file path of a saved astrotruth model
- predict(docs, file_col=None, doc_loading_args={}, unseen_ignore=None, unseen_message_default='Not enough info', prediction_column='astrotruth', save_eval_data=None)
Uses the generated model to predict the validity of the given documents.
- Returns
DataFrame of the classification results
- Return type
DataFrame
- save(path)
Save the generated model
- Parameters
path (str) – The file path to save the model in
DetectChaos
- class sparclur._detect_chaos.DetectChaos(parsers: str, num_comparisons: int = 5, parser_args: Dict[str, Dict[str, Any]] = {}, parser_timeout: int = 120, overall_timeout: int = 600, num_workers: int = 1, progress_bar: bool = True)
Looks for evidence of non-determinism in PDF parsers.
- run(files, recurse=False, base_path=None, save_path=None)
Run the non-deterministic detection.
- Parameters
files (str or List[str]) – Path to a directory or a text file of PDF paths or a List of paths
recurse (bool) – Whether or not the directory passed into the files parameter should be recursively searched for PDF’s
base_path (str) – A base directory that should be appended to the list of paths passed into files
FloodLight
RollBack
- class sparclur._roll_back.RollBack(doc: Union[str, bytes])
Checks for incremental updates and if present analyzes the differences between the versions.
- compare_renders(parser='Poppler', parser_args={}, num_workers=1, versions=None, progress_bar=True, timeout=120, display_width=10, display_height=10, ncols=5)
Compares the renders between subsequent versions of incremental updates. If the versions are not specified and there are more than 11 versions detected, only the lastest 11 versions will be compared. The rendering comparisons can be run in parallel by setting the num_workers greater than 1. This can be a time consuming process and care should be taken in trying to compare too many versions at one time. These render comparisons are then plotted by subsequent versions comparisons for each page.
- Parameters
parser (str, default='Poppler') – The parser to use to generate the renders.
parser_args (Dict[str, Any]) – Any additional arguments to pass to the specified parser
num_workers (int, default=1) – The number of workers to use if render comparisons are run in parallel. Choose 1 to run serially.
versions (List[int], default=None) – The specific versions to compare. If the versions are not contiguous, they will be sorted and comparisons will happen between neighbors in the sorted list.
progress_bar (bool, default=True) – Flag that determines if a progress bar for the render comparisons is displayed
timeout (int, default=120) – A timeout parameter for the rendering comparison
display_width (int) – The width of the resulting plot
display_height (int) – The height of the resulting plot
ncols (int) – The number of columns in the final subplot of comparisons.
- Return type
PyPlot figure
- compare_text(parser='Poppler', parser_args={}, display_width=10, display_height=10)
Compares the extracted text tokens between subsequent versions and plots the number of additions and subtractions in stacked bars. The Parser needs to support text extraction.
- Parameters
parser (str, default='Poppler') – Name of the parser to use for the text extraction
parser_args (Dict[str, Any]) – Any specific arguments to pass to the selected parser
display_width (int) – The width of the resulting plot
display_height (int) – The height of the resulting plot
- Return type
PyPlot figure
- property contains_updates
Whether or not the document has incremental updates.
- Return type
bool
- get_version(version: int)
Return the specified version. Versions are 0-indexed.
- Return type
bytes
- property num_versions
The number of detected updates in the document.
- Return type
int
- save_version(version: int, save_path: str)
Save the specified incremental update version. Versions are 0-indexed.
Spotlight
- class sparclur._spotlight.Spotlight(num_workers: int = 1, temp_folders_dir: Optional[str] = None, dpi: int = 72, page_hashes: Optional[Union[int, Tuple]] = None, parsers: Optional[List[str]] = None, parser_args: Dict[str, Dict[str, Any]] = {}, timeout: Optional[int] = None, progress_bar: bool = True)
Runs all of the selected parsers over the document and also reforges the document with all available reforgers. The results are hashed and stored for analysis.
- class sparclur._spotlight.SpotlightResult(parser, version, validity, sparclur_hash)
Results from running Spotlight over a document. Provides the following methods:
validity_report: Provide a table of validity results for the parsers over the original document and the reforges overall_validity: The overall validity classification for the document given the parsers recoverable: Returns whether or not the document is unambiguously recoverable sim_heatmap: A heatmap of the similarity scores for each parser over given pairs of documents sim_sunburst: A sunburst of the similarity scores. Provides an interactive way to engage with the heatmap.
- overall_validity(version='original', excluded_parsers=None)
Return a classification for the overall validity of the specified document
- Parameters
version ({original, MuPDF, Poppler, Ghostscript}) – The document version to classify
excluded_parsers (str or List[str]) – Any parsers to exclude in the determination of the overall validity
- Returns
The validity label
- Return type
str
- recoverable(excluded_parsers=None, sim_threshold=0.9)
Uses the spotlight results to determine if the document can be unambiguously recovered. The criteria is that each of the reforges successfully parses for each parser and further that the Sparclur Hash between the reforged documents is above the given threshold.
- Parameters
excluded_parsers (str or List[str]) – List of parsers to omit from the recoverable criteria
sim_threshold (float) – The threshold to meet for the reforge comparisons. If a comparison falls below this threshold the recovery is said to be ambiguous
- Returns
The overall report for whether the original document is unambiguously recoverable or not
- Return type
str
- sim_heatmap(parsers: Optional[str] = None, report: str = 'sim', annotated: bool = True, detailed: bool = False, compare_orig: bool = True, height: int = 10, width: int = 10, save_display=None)
A heatmap of the similarity scores for each parser over given pairs of documents
- Parameters
parsers (str or List[str]) – The parsers to display the similarity scores for
report ({'sim', 'Renderer sim', 'Text Extractor sim', 'Tracer sim'}) – The specific similarity score to run. Choose one of sim, Renderer sim, Text Extractor sim or Tracer sim
annotated (bool, default=True) – Flag for whether or not similarity scores should be displayed on the heatmap
detailed (bool, default=False) – Flag for displaying all similarity scores for each parser
compare_orig (bool) – Whether reforge<->original comparison scores should be displayed in the heatmap. These comparisons don’t impact the recoverablity of the file and would only provide insight into a differential between the original and the reforge
height (int) – Height of the figure
width (int) – Width of the figure
save_display (str) – If not None, save a png of the figure to the file path specified by save_display
- Return type
Seaborn heatmap
- sim_sunburst(compare_orig: bool = True, full: bool = False, color: str = 'RdBu', color_range: List[float] = [0.6, 1])
Create an interactive sunburst for exploring the similarities between the documents for the Spotlight parsers
- Parameters
compare_orig (bool) – Whether reforge<->original comparison scores should be displayed in the heatmap. These comparisons don’t impact the recoverablity of the file and would only provide insight into a differential between the original and the reforge
full (bool) – Create the full sunburst that has all possible combinations. Turning this off removes duplicate comparisons to reduce the overall number of slices.
color (str) – The color range to use. See https://plotly.com/python/builtin-colorscales/
color_range (List[float]) – The range to base the color on. Format is [min, max]
- Return type
Plotly Sunburst
- validity_report(report='overall', excluded_parsers=None)
Return a table of the validity classifications for each parser over each document.
- Parameters
report ({overall, Renderer, Text Extraction, Font Extraction, Metadata Extraction, Tracer}) – Overall takes into account all of the tools of the given parser, or a specific tool can be specified
excluded_parsers (str or List[str]) – Parsers to exclude from the report
- Returns
A DataFrame of the resulting labels
- Return type
DataFrame