Always Processing

Rust API Bindings: CFStringGetBytes Is Hard, Part 1

A seemingly simple computer being covered by an explosion of complexity.

Designing a "correct by default" interface for CFStringGetBytes is surprisingly complex, as many of its behaviors are encoding-dependent.

When I started building Rust API bindings for Core Foundation, I thought implementing the Debug trait with CFCopyDescription would be a great place to start, and it would help validate incremental progress in implementing the crate.

Use of CFStringCopyBytes is required to write the string value returned by CFCopyDescription to the Debug Formatter as CFStrings are canonically UTF-16 and Rust uses UTF-8, so this ended up being the first function for which I’d write Rust API bindings.

When writing API bindings, I prefer to cover 100% of the API up front. Incomplete binding layers have bitten me in the past, so I try to avoid creating any divergent behavior in a logically equivalent interface. Insufficient API coverage may also lead to a bindings layer design gap that can be difficult to close after accumulating many dependencies.

API bindings are a domain where it’s essential to go slow at first to coherently incorporate the entire API surface area into the design of the bindings to enable product development to go fast by not getting caught up fixing bindings bugs and closing bindings gaps.

In all interface design, I have two common goals:

  1. The interface should be idiomatic—a person proficient in the language should find the naming, design patterns, integration points, etc., intuitive and familiar.

  2. It should not be possible to represent an invalid state. In particular, assertions, preconditions, etc., are unnecessary because the compiler will reject code containing invalid states or control flow.

I had intended to discuss my Rust API design in today’s post, but I discovered a potential invalid state in the design in my final review and am still working on a fix. That works well for this blog since we can separate the problem and solution into independent posts!

CFStringGetBytes Complexity

The API doesn’t seem that complicated:

String.subproj/CFString.h lines 340-357
/* The primitive conversion routine; allows you to convert a string piece at a time
       into a fixed size buffer. Returns number of characters converted.
   Characters that cannot be converted to the specified encoding are represented
       with the byte specified by lossByte; if lossByte is 0, then lossy conversion
       is not allowed and conversion stops, returning partial results.
   Pass buffer==NULL if you don't care about the converted string (but just the convertability,
       or number of bytes required).
   maxBufLength indicates the maximum number of bytes to generate. It is ignored when buffer==NULL.
   Does not zero-terminate. If you want to create Pascal or C string, allow one extra byte at start or end.
   Setting isExternalRepresentation causes any extra bytes that would allow
       the data to be made persistent to be included; for instance, the Unicode BOM. Note that
       CFString prepends UTF encoded data with the Unicode BOM <http://www.unicode.org/faq/utf_bom.html>
       when generating external representation if the target encoding allows. It's important to note that
       only UTF-8, UTF-16, and UTF-32 define the handling of the byte order mark character, and the "LE"
       and "BE" variants of UTF-16 and UTF-32 don't.
*/
CF_EXPORT
CFIndex CFStringGetBytes(CFStringRef theString, CFRange range, CFStringEncoding encoding, UInt8 lossByte, Boolean isExternalRepresentation, UInt8 *buffer, CFIndex maxBufLen, CFIndex *usedBufLen);

But, in writing tests for my bindings, I discovered the following behaviors that were not apparent to me from the comment or the documentation:

  1. range may start or end in the middle of a surrogate pair, or the CFString may contain invalid UTF-16 (does not validate strings created from a UTF-16 buffer, and it allows the deletion of a surrogate code unit without deleting its counterpart). The handling of invalid surrogates is dependent on the encoding:

    1. For UTF-8, conversion stops. Even if the caller provides a lossByte, the function does not process the code unit as a lossy conversion.

      Corollary: Code that assumes UTF-8 conversion cannot fail will infinitely loop if the string contains invalid UTF-16 and the loop does not validate conversion made forward progress.

    2. For UTF-32, the surrogate code unit becomes a lossy conversion.

    3. For all other encodings, including UTF-16, there is no observable effect.

  2. A code point encoded as a surrogate pair becomes two lossy code points for non-Unicode encodings.

  3. isExternalRepresentation does not encode a BOM for UTF-8 despite the implication in the comment.

  4. If encoding == kCFStringEncodingUTF16, isExternalRepresentation == true, and maxBufLen < 2, Core Foundation will overrun the buffer when writing the BOM. (The UTF-32 BOM write does validate the buffer’s capacity.)

I didn’t see any mention of buffer alignment requirements in the documentation or code. buffer's type of UInt8 * implies CFStringGetBytes supports unaligned buffers. Such support is reasonable, for example, to facilitate callers writing bytes into a persistent format where packing may create unaligned offsets. But, as far as I can tell, the implementation relies on undefined behavior (casting the buffer pointer to a type with stricter alignment) for unaligned pointer support. Precariously but fortuitously, the instructions selected by the compiler support unaligned writes.

For the most part, these are insignificant corner cases:

  • Although a lot of code working with UTF-16 needs to handle surrogate pairs better, I suspect there are not many cases where a surrogate pair split occurs at the beginning or end of a range.

  • The primary conversion direction for non-Unicode encodings is into Unicode, so the code point inflation of surrogate pairs when converting from Unicode is likely rare.

  • In practice, BOMs are rarely used, especially for UTF-8.

  • No production code uses a buffer size of one byte.

Fortunately, a few layers of abstraction can simplify most of this complexity. Stay tuned for an overview of how I designed a Rust API to enforce correct and predictable behavior.