GSoC'21: Mid-Term Progress- 5 mins
Table of Contents:
Well, it’s been a while since I last wrote. But I wasn’t spending time watching Loki either! (that’s a lie.)
During this period the project took on some interesting (and stressful) curves, which I intend to talk about in this small writeup.
The first week of coding period, and I met one of my new mentors, Jouni. Without him, along with Tom and Antony, the project wouldn’t have moved an inch.
It was initially Jouni’s PR which was my starting point of the first milestone in my proposal, .
What is Font Subsetting anyway?
As was proposed by Tom, a good way to understand something is to document your journey along the way! (well, that’s what GSoC wants us to follow anyway right?)
Taking an excerpt from one of the paragraphs I wrote here:
Font Subsetting can be used before generating documents, to embed only the required glyphs within the documents. Fonts can be considered as a collection of these glyphs, so ultimately the goal of subsetting is to find out which glyphs are required for a certain array of characters, and embed only those within the output.
Now this may seem straightforward, right?
The glyph programs can call their own subprograms, for example, characters like
ä could be composed by calling subprograms for
→ could be composed by a program that changes the display matrix and calls the subprogram for
Since the subsetter has to find out all such subprograms being called by every glyph included in the subset, this is a generally difficult problem!
Something which one of my mentors said which really stuck with me:
Matplotlib isn’t a font library, and shouldn’t try to be one.
It’s really easy to fall into the trap of trying to do everything within your own project, which ends up rather hurting itself.
To give an analogy, try to do a lot of things with its platform, from basic text editing to full-fledged project management.
Compared with , whose whole niche is to provide a platform for project management (among other things), some people tend to choose the latter for their specific needs.
Since this analogy holds true even for Matplotlib, it uses external dependencies like FreeType, ttconv, and newly proposed fontTools to handle font subsetting, embedding, rendering, and related stuff.
PS: If that font stuff didn’t make sense, I would recommend going through a friendly tutorial I wrote, which is all about Matplotlib and Fonts!
Matplotlib uses an external dependency
ttconv which was initially forked into Matplotlib’s repository in 2003!
ttconv was a standalone commandline utility for converting TrueType fonts to subsetted Type 3 fonts (among other features) written in 1995, which Matplotlib forked in order to make it work as a library.
Over the time, there were a lot of issues with it which were either hard to fix, or didn’t attract a lot of attention. (See the above paragraph for a valid reason)
One major utility which is still used is
convert_ttf_to_ps, which takes a font path as input and converts it into a Type 3 or Type 42 PostScript font, which can be embedded within PS/EPS output documents. The guide I wrote (link) contains decent descriptions, the differences between these type of fonts, etc.
So we need to convert that font path input to a font buffer input.
Why do we need to? Type 42 subsetting isn’t really supported by ttconv, so we use a new dependency called fontTools, whose ‘full-time job’ is to subset Type 42 fonts for us (among other things).
It provides us with a font buffer, however ttconv expects a font path to embed that font
Easily enough, this can be done by Python’s
with tempfile.NamedTemporaryFile(suffix=".ttf") as tmp: # fontdata is the subsetted buffer # returned from fontTools tmp.write(fontdata.getvalue()) # TODO: allow convert_ttf_to_ps # to input file objects (BytesIO) convert_ttf_to_ps( os.fsencode(tmp.name), fh, fonttype, glyph_ids, )
But this is far from a clean API; in terms of separation of *reading* the file from *parsing* the data.
What we ideally want is to pass the buffer down to
convert_ttf_to_ps, and modify the embedding code of
ttconv (written in C++). And here we come across a lot of unexplored codebase, which wasn’t touched a lot ever since it was forked.
Funnily enough, just yesterday, after spending a lot of quality time, me and my mentors figured out that the whole logging system of ttconv was broken, all because of a single debugging function. 🥲
This is still an ongoing problem that we need to tackle over the coming weeks, hopefully by the next time I write one of these blogs, it gets resolved!
Again, thanks a ton for spending time reading these blogs. :D