Rust API Bindings: CFStringGetBytes Is Hard, Part 2

Brian T. Kelley Jan 7, 2024 6 min read

A computer monitor and keyboard with many bits arranged in a grid in the background. Are the bits an encoding of the information on screen?

Using Rust’s features, we can provide API bindings to CFStringGetBytes that prevent unsupported argument combinations at the call site to fix the problems identified in the previous post.

The last post described the complexities I encountered when building Rust bindings for CFStringGetBytes. This post shares the rationale behind my design choices for the first bindings layer. This lowest level layer mitigates the four problems identified in the previous post.

I say "first bindings layer" because the method discussed in this post underlies up to four other Rust interfaces for calling CFStringGetBytes. Why so many layers, though? From the Rust API Guidelines:

Functions expose intermediate results to avoid duplicate work

Many functions that answer a question also compute interesting related data. If this data is potentially of interest to the client, consider exposing it in the API.

Therefore, in building what I thought was the most idiomatic Rust API, I exposed successively lower-level layers for more customizability (at the cost of complexity).

First, let’s review the Core Foundation C API:

String.subproj/CFString.h line 357

CFIndex CFStringGetBytes(CFStringRef theString, CFRange range, CFStringEncoding encoding, UInt8 lossByte, Boolean isExternalRepresentation, UInt8 *buffer, CFIndex maxBufLen, CFIndex *usedBufLen);

And compare it to the most direct Rust interface implemented in my crate:

src/string.rs lines 605-641

impl String {
  pub fn get_bytes_unchecked(
    &self,
    range: impl RangeBounds<usize>,
    encoding: GetBytesEncoding,
    buf: Option<&mut [u8]>,
  ) -> GetBytesResult { /* ... */ }
}

&self represents the CFStringRef pointer, which is typical for bindings of object-oriented interfaces.
The CFRange parameter equivalent is an impl RangeBounds<usize>, enabling the caller to use a Rust range expression. A caller may, for example, pass .. to specify the full range of the string.
- CFRange's fields are of type CFIndex, which is a signed type. The first post in this mini-series discussed the design choices in implementing unsigned to signed conversion.
GetBytesEncoding replaces CFStringEncoding and also subsumes lossByte and isExternalRepresentation. The following section provides more detail.
buf: Option<&mut [u8]> captures the optional UInt8 *buffer and CFIndex maxBufLen arguments. The Option type clearly expresses that a buffer is not required. If supplied, though, the slice provides the buffer’s length.
The GetBytesResult return type includes the C API’s return value (the number of UTF-16 code units converted) and the out parameter in the C API, usedBufLen.
The _unchecked suffix hints that this method has a quirk the caller must handle. The following sections elaborate on this behavior.

This method implements a check to mitigate the fourth problem identified in the previous post:

If encoding == kCFStringEncodingUTF16, isExternalRepresentation == true, and maxBufLen < 2, Core Foundation will overrun the buffer when writing the BOM. (The UTF-32 BOM write does validate the buffer’s capacity.)

Without this check, the function’s use could lead to unsoundness, so it would need the unsafe qualifier.

GetBytesEncoding

The GetBytesEncoding struct encompasses the CFStringEncoding, lossByte, and isExternalRepresentation arguments.

src/string.rs lines 126-161

pub enum GetBytesEncoding {
  CharacterSet {
    character_set: CharacterSet,

    /// **Note:** Core Foundation will process surrogate pairs as two individual lossy code
    /// points, so the number of output code points will equal the number of input code units.
    loss_byte: Option<NonZeroU8>,
  },
  Utf8,
  Utf16 {
    byte_order: GetBytesByteOrder,
  },
  Utf32 {
    byte_order: GetBytesByteOrder,
    loss_byte: Option<NonZeroU8>,
  },
}

The CharacterSet enum encompasses all the non-Unicode encodings. (The corresponding string creation function, CFStringCreateWithBytes, is quirky, too. The string construction bindings also specifically handle UTF-8, UTF-16, and UTF-32, and handle all non-Unicode representations with the CharacterSet enum.)
- The lossByte type in the bindings is an Option of NonZeroU8 to express, through the type system, that a loss byte, if used, must have a non-zero value.
- Unfortunately, the use of a comment was the only mitigation I could find for the second problem identified in the previous post:
  
  A code point encoded as a surrogate pair becomes two lossy code points for non-Unicode encodings.
The Utf8 encoding does not have a loss byte, mitigating potential unexpected behavior in reading the call site and documentation, identified in part a of the first problem in the previous post:

Even if the caller provides a lossByte, the function does not process the code unit as a lossy conversion.

The absence of a conversion fallback may imply UTF-8 cannot fail like UTF-16, which is a contributing factor to the method’s _unchecked suffix—the behavior of a straightforward call may not align with default assumptions.
Utf16 has a GetBytesByteOrder field, discussed below, which implements the isExternalRepresentation argument.
Utf32, like Utf16, also has a GetBytesByteOrder field and, similar to CharacterSet, has a loss byte to handle invalid surrogates (clarifying UTF-32 is the only Unicode target encoding that implements loss byte support, as described in part b of the first problem in the previous post).

UTF-16 and UTF-32 use 16-bit and 32-bit integer scalars, which may have big or little endian byte orders. GetBytesByteOrder enumerates the supported options.

src/string.rs lines 109-124

pub enum GetBytesByteOrder {
  BigEndian,
  HostNative {
    include_bom: bool,
  },
  LittleEndian,
}

Core Foundation only supports writing a byte order mark, or BOM (isExternalRepresentation = true), when using the host’s native byte order. This enum prevents callers from specifying unsupported combinations and, therefore, receiving unexpected results.

The combination of GetBytesEncoding::Utf8 and the implementation of isExternalRepresentation in GetBytesByteOrder combine to mitigate the third problem identified in the previous post:

isExternalRepresentation does not encode a BOM for UTF-8 despite the implication in the comment.

GetBytesResult

The function returns a struct incorporating the C API’s return value and out parameter.

src/string.rs lines 202-222

pub struct GetBytesResult {
  pub buf_len: usize,
  pub remaining: Option<Range<usize>>,
}

If buf was Some, the buf_len field contains the number of bytes written into the slice. Otherwise, it contains the number of bytes required to convert the processed range.

If the call converted the entire input range, the remaining field is None. Otherwise, it contains the portion of the input range not converted during the call. Returning a Range instead of the number of UTF-16 code units converted simplifies conversion loop implementations by providing a clear "done" signal and the range argument value for the next call.

If the caller does not provide a loss byte, any conversion (except to UTF-16) may fail. As this function is "unchecked," it’s the caller’s responsibility to check for forward progress, or it risks never terminating.

The next post will discuss the get_bytes method, which explicitly handles lossy conversions and prevents a potential infinite loop.