sparclur.trawler package
sparclur.trawler.Highlight
- class sparclur.trawler._highlight.Highlight(renderers: typing.List[sparclur._parser.Parser] = [<class 'sparclur.parsers._ghostscript.Ghostscript'>, <class 'sparclur.parsers._mupdf.MuPDF'>, <class 'sparclur.parsers._poppler.Poppler'>, <class 'sparclur.parsers._xpdf.XPDF'>, <class 'sparclur.parsers._pdfium.PDFium'>], parser_args: typing.Dict[str, typing.Dict[str, typing.Any]] = {}, max_workers: int = 1, timeout: typing.Optional[int] = None, overall_timeout: typing.Optional[int] = None, progress_bar: bool = True)
Bases:
objectCompares two PDF’s with the same provenance and highlights regions of difference between their renders.
- spot_the_difference(file_set: str, matching_criteria: Dict[str, str], dpi: int = 72, min_region: int = 40, prc_threshold: float = 1.0, ent_threshold: float = 0.2, recurse: bool = False, extension: Optional[str] = None, base_path: Optional[str] = None, save_path: Optional[str] = None)
- Parameters
file_set (str or List[str]) – Path to a directory or a text file of PDF paths or a List of paths (or files if the base_path is defined). This are the modified files to be compared with the originators.
matching_criteria (Dict[str, str] or Callable[[str], str]) – An explicit dictionary that maps every path in the file_set to the path of the originator file or a method that transforms each path string from file_set into the originator file path.
dpi (int) – The dots-per-inch to set each renderer to.
min_region (int) – The minimum region size for which differences are highlighted.
prc_threshold (float) – A float, x, such that 0 < x <= 1.0. Indicates that regions should only be searched if the SPARCLUR PRC is less than this threshold. If it’s set to 1.0 all files have their regions explored.
ent_threshold (float) – The threshold for flagging a region as having a significant difference in information between the two regions.
recurse (bool) – Whether or not the directory passed into the file_set parameter should be recursively searched for PDF’s
extension (str) – Filters out files that don’t have the matching extension.
base_path (str) – A base directory that should be appended to the list of files passed into file_set
save_path (str) – If specified, will save a csv of the run results to save_path