Skip to content

Subsetted fonts can break PDF parsers and PDF reconstruction #11

@ajkiessl

Description

We have a handful of uploads to ScholarSphere that were sent through remediation that lost letters in their title due to "subsetted fonts". We had a similar issue with metadata-listener with these exact same files. Refer to the issue in the metadata-listener repo for more info and example files: psu-libraries/metadata-listener#132 .

The issue with subsetted fonts may be at the early PDF parsing stage or the later reconstruction. It might also happen at the Adobe API stage, but I think Adobe would be able to properly handle these "glyphs"/fonts.

So, what can we do about this? Some ideas:

  • Upgrade the PDF parsing libraries
    • This could fix the problem entirely, or add some error guard to fail if subsetted fonts are detected
  • Find a custom way to raise an error and fail out when subsetted fonts are detected
  • Find a custom way to skip text that is a subsetted font

I would start with option # 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions