Subsetted fonts can break PDF parsers and PDF reconstruction

We have a handful of uploads to ScholarSphere that were sent through remediation that lost letters in their title due to "subsetted fonts".  We had a similar issue with metadata-listener with these exact same files.  Refer to the issue in the metadata-listener repo for more info and example files: https://github.com/psu-libraries/metadata-listener/issues/132 .

The issue with subsetted fonts may be at the early PDF parsing stage or the later reconstruction.  It _might_ also happen at the Adobe API stage, but I think Adobe would be able to properly handle these "glyphs"/fonts.

So, what can we do about this?  Some ideas:

- Upgrade the PDF parsing libraries
  - This could fix the problem entirely, or add some error guard to fail if subsetted fonts are detected
- Find a custom way to raise an error and fail out when subsetted fonts are detected
- Find a custom way to skip text that is a subsetted font

I would start with option # 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subsetted fonts can break PDF parsers and PDF reconstruction #11

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Subsetted fonts can break PDF parsers and PDF reconstruction #11

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions