We have a handful of uploads to ScholarSphere that were sent through remediation that lost letters in their title due to "subsetted fonts". We had a similar issue with metadata-listener with these exact same files. Refer to the issue in the metadata-listener repo for more info and example files: psu-libraries/metadata-listener#132 .
The issue with subsetted fonts may be at the early PDF parsing stage or the later reconstruction. It might also happen at the Adobe API stage, but I think Adobe would be able to properly handle these "glyphs"/fonts.
So, what can we do about this? Some ideas:
- Upgrade the PDF parsing libraries
- This could fix the problem entirely, or add some error guard to fail if subsetted fonts are detected
- Find a custom way to raise an error and fail out when subsetted fonts are detected
- Find a custom way to skip text that is a subsetted font
I would start with option # 1
We have a handful of uploads to ScholarSphere that were sent through remediation that lost letters in their title due to "subsetted fonts". We had a similar issue with metadata-listener with these exact same files. Refer to the issue in the metadata-listener repo for more info and example files: psu-libraries/metadata-listener#132 .
The issue with subsetted fonts may be at the early PDF parsing stage or the later reconstruction. It might also happen at the Adobe API stage, but I think Adobe would be able to properly handle these "glyphs"/fonts.
So, what can we do about this? Some ideas:
I would start with option # 1