Rust API Bindings: CFStringGetBytes Is Hard, Part 1
Designing a "correct by default" interface for
CFStringGetBytes is surprisingly complex, as many of its behaviors are encoding-dependent.
When I started building Rust API bindings for Core Foundation, I thought implementing the
Debug trait with
CFCopyDescription would be a great place to start, and it would help validate incremental progress in implementing the crate.
CFStringCopyBytes is required to write the string value returned by
CFCopyDescription to the
CFStrings are canonically UTF-16 and Rust uses UTF-8, so this ended up being the first function for which I’d write Rust API bindings.
In all interface design, I have two common goals:
The interface should be idiomatic—a person proficient in the language should find the naming, design patterns, integration points, etc., intuitive and familiar.
It should not be possible to represent an invalid state. In particular, assertions, preconditions, etc., are unnecessary because the compiler will reject code containing invalid states or control flow.
I had intended to discuss my Rust API design in today’s post, but I discovered a potential invalid state in the design in my final review and am still working on a fix. That works well for this blog since we can separate the problem and solution into independent posts!
The API doesn’t seem that complicated:
String.subproj/CFString.h lines 340-357
/* The primitive conversion routine; allows you to convert a string piece at a time
into a fixed size buffer. Returns number of characters converted.
Characters that cannot be converted to the specified encoding are represented
with the byte specified by lossByte; if lossByte is 0, then lossy conversion
is not allowed and conversion stops, returning partial results.
Pass buffer==NULL if you don't care about the converted string (but just the convertability,
or number of bytes required).
maxBufLength indicates the maximum number of bytes to generate. It is ignored when buffer==NULL.
Does not zero-terminate. If you want to create Pascal or C string, allow one extra byte at start or end.
Setting isExternalRepresentation causes any extra bytes that would allow
the data to be made persistent to be included; for instance, the Unicode BOM. Note that
CFString prepends UTF encoded data with the Unicode BOM <http://www.unicode.org/faq/utf_bom.html>
when generating external representation if the target encoding allows. It's important to note that
only UTF-8, UTF-16, and UTF-32 define the handling of the byte order mark character, and the "LE"
and "BE" variants of UTF-16 and UTF-32 don't.
CFIndex CFStringGetBytes(CFStringRef theString, CFRange range, CFStringEncoding encoding, UInt8 lossByte, Boolean isExternalRepresentation, UInt8 *buffer, CFIndex maxBufLen, CFIndex *usedBufLen);
But, in writing tests for my bindings, I discovered the following behaviors that were not apparent to me from the comment or the documentation:
rangemay start or end in the middle of a surrogate pair, or the
CFStringmay contain invalid UTF-16 (does not validate strings created from a UTF-16 buffer, and it allows the deletion of a surrogate code unit without deleting its counterpart). The handling of invalid surrogates is dependent on the encoding:
For UTF-8, conversion stops. Even if the caller provides a
lossByte, the function does not process the code unit as a lossy conversion.
For UTF-32, the surrogate code unit becomes a lossy conversion.
For all other encodings, including UTF-16, there is no observable effect.
A code point encoded as a surrogate pair becomes two lossy code points for non-Unicode encodings.
isExternalRepresentationdoes not encode a BOM for UTF-8 despite the implication in the comment.
encoding == kCFStringEncodingUTF16,
isExternalRepresentation == true, and
maxBufLen < 2, Core Foundation will overrun the
bufferwhen writing the BOM. (The UTF-32 BOM write does validate the buffer’s capacity.)
For the most part, these are insignificant corner cases:
Although a lot of code working with UTF-16 needs to handle surrogate pairs better, I suspect there are not many cases where a surrogate pair split occurs at the beginning or end of a range.
The primary conversion direction for non-Unicode encodings is into Unicode, so the code point inflation of surrogate pairs when converting from Unicode is likely rare.
In practice, BOMs are rarely used, especially for UTF-8.
No production code uses a buffer size of one byte.
Fortunately, a few layers of abstraction can simplify most of this complexity. Stay tuned for an overview of how I designed a Rust API to enforce correct and predictable behavior.