When it comes to computers and computer science, there are still lots of things I know very little about, and I suspect this will always be the case. Instead of learning something new and keeping it to myself, I thought I'd share the learnings in case they're useful or interesting to others.
For this post, I'll be focusing on the wild world of font rasterization with tangents into Unicode and the OpenType file format. If you happen to have experience in these areas, let me know what you think I should learn next! I hope to update the post with new learnings as I hear from experienced people.
Font rasterization is the process of converting some font description to a bitmap that can be displayed on a pixelated screen. Before we dive into what this means, we need to clarify some vocabulary that often gets used imprecisely. A reason these terms are used imprecisely is that they're not 100% perfectly well defined, but I think the definitions below get us far enough.
- Character: character is an abstract term for the smallest component of written language with some semantic meaning. We we talk about the letter “a” in the abstract, i.e., the character in English that usually makes an “ahh” kind of sound, we're talking about the character “a”. Characters are abstract so things like weight, size and whether it's italic don't apply to characters - you can't see characters, they're an idea. The Unicode standard attempts to assign numbers to characters which they call “code points”.
- Glyph: A glyph is a visual representation of a character or group of characters that is capable of standing alone. When we see the letter “a” on a page we're looking at specific glyph of the character “a”. The glyph has a distinct weight and size. Glyphs can sometimes be composed of multiple characters like the “ﬀ” (notice this is not two “f” glyphs next to each other but rather one entity). These are known as ligatures.
- Font: A font is a mapping of characters (or more specifically, Unicode code points) to specific glyphs. A font is usually created to have a consistent visual style (weight, size, italic angle, etc.) that all of the glyphs adhere to.
- Typeface: A typeface is a collection of fonts that all have similar design features. Usually typefaces have fonts that have glyphs with the same general look but where their weight (i.e., degree of boldness) differ.
To summarize, let's say you want to type the first letter of the word “Computer”. What we want is to type the character “C” (in English called “capital ‘see’"). Unicode assigns the code point U+0043 to “C”. In our document processing software, we've chosen a font - let's say Helvetica Neue bold. This font is part of the typeface Helvetica Neue. A typeface designer has at some point created a descriptions of Helvetica Neue (in some specialized software) and stored that description in a file that adheres to a particular file format (more on this later). This description includes the vector graphic description of the glyph along with other data that is mapped to the Unicode code point U+0043.
The job of a font rasterizer is to take the vector graphics description of the glyph and the other data and turn it into an actual bitmap that can be used by programs to display the glyph to the screen.
Aside: Unicode Equivalence
An interesting side note to all of this is the idea of Unicode equivalence which is the part of the Unicode specification which says that some Unicode code points (or groups of code points) can be semantically or visually equivalent to other code points. Code points that are both visually the same are “canonically equivalent”. For example, the Spanish letter “ñ” has its own Unicode code point U+00F1 but can also be represented as the code points U+006E (“n”) followed by U+0303 (known as the combining tilde - which is not normally visually representable). These code points are thus “canonically equivalent”.
This idea is important for font rendering. For font rasterization, it is usually necessary to perform some sort of “normalization” such that canonically equivalent code points resolve to the same glyph when mapping between Unicode code points and glyphs.
Outline Font File Formats
Above we said that there are files with vector graphics descriptions of fonts (and other data). These files associate certain Unicode code points with glyphs and also provide descriptions of how to render these glyphs.
Back in the day computers shipped with bitmap fonts which are arrays of pixels that exactly describe how glyphs should be displayed on the screen pixel by pixel. Vector graphics on the other hand are mathematical descriptions of graphics that scaled without any pixelation. Of course computer screens are still at the end of the day rows of pixels so rendering vector graphics still causes some artifacts, but overall vector graphics are preferred since they are more flexible and can be scaled to different sizes.
These files that contain the vector graphics descriptions of glyphs are known as outline font files. The big names in this are PostScript Type1 and Type3, TrueType and OpenType which you'll often see with various file endings like .ttf, .otf as well as .ttc and .otc for collections of fonts. The newest of these file formats is OpenType which was developed by Microsoft and Adobe and has been adopted by Apple as well. All of these formats are supported on modern computers and actually share quite a bit of structure among them. I'm going to focus on OpenType since it's the newest standard. The OpenType spec is fairly complicated but looks to be super flexible.
An OpenType file contains one ore more fonts that are each a series of tables of data that contain certain types of information for that given font. Each font contains more data than just the vector graphics data. It also includes information around the how glyphs are laid out in relation to one another including the glyph's baseline (the “bottom” of most glyphs besides any descender - e.g., the tail of a “g” or “y”), the origin (the leftmost point of a glyph's baseline - at least in left to right scripts) and the advance width (the distance from the origin of one glyph to the origin of another). This article in the Apple docs is a super helpful overview of these concepts.
Additional interesting bits of outline font files and font rasterization include font hinting, antialiasing, and subpixel rendering. Font hints are ways for typeface designers to hint to the rasterizer how to better render for particular screen resolutions. It seems to be mostly used for lower resolution screens to get glyphs lined up well with the pixel grid of the screen so that the fonts appear much more sharp. Anti-aliasing is the act of determining what percentage of a pixel a shape would occupy if we were able to render at a higher resolution and then filling in that pixel with a grayscale color matching that percentage. From afar this often makes lines look less pixelated. Lastly subpixel rendering takes advantage of the fact that color LCD screens have actually three color subpixels per pixel which can be individually adjusted. Rasterizers can use this fact to get better resolution.
The Rust Ecosystem
I'm doing a lot of programming in Rust lately so naturally I decided to see what the Rust community has to offer for font rasterization. Raph Levin introduced a new renderer, font-rs, 2 years ago that apparently performed better than anything else that had come before including sbt_truetype which is a commonly used C library for font rasterization. This project doesn't seem to be very active, but it seems that a lot of the learnings have been picked up by pathfinder which claims to be even faster. I believe Pathfinder is slated to be integrated into Firefox at some point in the near future. Another implementation is RustType from the RedoxOS project. One project I'm interested in is font-kit from Patrick Walton, the other of pathfinder. It seems this project is an attempt to be a one stop shop for cross platform font needs from system font look ups to font rasterization.
There's a lot of stuff I didn't go into that I found in my deep dive into fonts including actual algorithms for rendering fonts which I believe would merit its own post. Again, if you have experience with fonts and font rasterization let me know what I should look into next. If you are like me and just starting to get into this stuff, let me know what I should write more about. You can find me on twitter.