Rust API Bindings: CFStringGetBytes Is Hard, Part 2
Using Rust’s features, we can provide API bindings to CFStringGetBytes
that prevent unsupported argument combinations at the call site to fix the problems identified in the previous post.
The last post described the complexities I encountered when building Rust bindings for CFStringGetBytes
. This post shares the rationale behind my design choices for the first bindings layer. This lowest level layer mitigates the four problems identified in the previous post.
First, let’s review the Core Foundation C API:
String.subproj/CFString.h
line 357CFIndex CFStringGetBytes(CFStringRef theString, CFRange range, CFStringEncoding encoding, UInt8 lossByte, Boolean isExternalRepresentation, UInt8 *buffer, CFIndex maxBufLen, CFIndex *usedBufLen);
And compare it to the most direct Rust interface implemented in my crate:
src/string.rs
lines 605-641impl String {
pub fn get_bytes_unchecked(
&self,
range: impl RangeBounds<usize>,
encoding: GetBytesEncoding,
buf: Option<&mut [u8]>,
) -> GetBytesResult { /* ... */ }
}
-
&self
represents theCFStringRef
pointer, which is typical for bindings of object-oriented interfaces. -
The
CFRange
parameter equivalent is animpl RangeBounds<usize>
, enabling the caller to use a Rust range expression. A caller may, for example, pass..
to specify the full range of the string.-
CFRange
's fields are of typeCFIndex
, which is a signed type. The first post in this mini-series discussed the design choices in implementing unsigned to signed conversion.
-
-
GetBytesEncoding
replacesCFStringEncoding
and also subsumeslossByte
andisExternalRepresentation
. The following section provides more detail. -
buf: Option<&mut [u8]>
captures the optionalUInt8 *buffer
andCFIndex maxBufLen
arguments. TheOption
type clearly expresses that a buffer is not required. If supplied, though, the slice provides the buffer’s length. -
The
GetBytesResult
return type includes the C API’s return value (the number of UTF-16 code units converted) and the out parameter in the C API,usedBufLen
. -
The
_unchecked
suffix hints that this method has a quirk the caller must handle. The following sections elaborate on this behavior.
This method implements a check to mitigate the fourth problem identified in the previous post:
If
encoding == kCFStringEncodingUTF16
,isExternalRepresentation == true
, andmaxBufLen < 2
, Core Foundation will overrun thebuffer
when writing the BOM. (The UTF-32 BOM write does validate the buffer’s capacity.)
Without this check, the function’s use could lead to unsoundness, so it would need the unsafe
qualifier.
GetBytesEncoding
The GetBytesEncoding
struct encompasses the CFStringEncoding
, lossByte
, and isExternalRepresentation
arguments.
src/string.rs
lines 126-161pub enum GetBytesEncoding {
CharacterSet {
character_set: CharacterSet,
/// **Note:** Core Foundation will process surrogate pairs as two individual lossy code
/// points, so the number of output code points will equal the number of input code units.
loss_byte: Option<NonZeroU8>,
},
Utf8,
Utf16 {
byte_order: GetBytesByteOrder,
},
Utf32 {
byte_order: GetBytesByteOrder,
loss_byte: Option<NonZeroU8>,
},
}
-
The
CharacterSet
enum encompasses all the non-Unicode encodings. (The corresponding string creation function,CFStringCreateWithBytes
, is quirky, too. The string construction bindings also specifically handle UTF-8, UTF-16, and UTF-32, and handle all non-Unicode representations with theCharacterSet
enum.)-
The
lossByte
type in the bindings is anOption
ofNonZeroU8
to express, through the type system, that a loss byte, if used, must have a non-zero value. -
Unfortunately, the use of a comment was the only mitigation I could find for the second problem identified in the previous post:
A code point encoded as a surrogate pair becomes two lossy code points for non-Unicode encodings.
-
-
The
Utf8
encoding does not have a loss byte, mitigating potential unexpected behavior in reading the call site and documentation, identified in part a of the first problem in the previous post:Even if the caller provides a
lossByte
, the function does not process the code unit as a lossy conversion.The absence of a conversion fallback may imply UTF-8 cannot fail like UTF-16, which is a contributing factor to the method’s
_unchecked
suffix—the behavior of a straightforward call may not align with default assumptions. -
Utf16
has aGetBytesByteOrder
field, discussed below, which implements theisExternalRepresentation
argument. -
Utf32
, likeUtf16
, also has aGetBytesByteOrder
field and, similar toCharacterSet
, has a loss byte to handle invalid surrogates (clarifying UTF-32 is the only Unicode target encoding that implements loss byte support, as described in part b of the first problem in the previous post).
UTF-16 and UTF-32 use 16-bit and 32-bit integer scalars, which may have big or little endian byte orders. GetBytesByteOrder
enumerates the supported options.
src/string.rs
lines 109-124pub enum GetBytesByteOrder {
BigEndian,
HostNative {
include_bom: bool,
},
LittleEndian,
}
Core Foundation only supports writing a byte order mark, or BOM (isExternalRepresentation = true
), when using the host’s native byte order. This enum prevents callers from specifying unsupported combinations and, therefore, receiving unexpected results.
The combination of GetBytesEncoding::Utf8
and the implementation of isExternalRepresentation
in GetBytesByteOrder
combine to mitigate the third problem identified in the previous post:
isExternalRepresentation
does not encode a BOM for UTF-8 despite the implication in the comment.
GetBytesResult
The function returns a struct incorporating the C API’s return value and out parameter.
src/string.rs
lines 202-222pub struct GetBytesResult {
pub buf_len: usize,
pub remaining: Option<Range<usize>>,
}
If buf
was Some
, the buf_len
field contains the number of bytes written into the slice. Otherwise, it contains the number of bytes required to convert the processed range.
If the call converted the entire input range
, the remaining
field is None
. Otherwise, it contains the portion of the input range not converted during the call. Returning a Range
instead of the number of UTF-16 code units converted simplifies conversion loop implementations by providing a clear "done" signal and the range argument value for the next call.
If the caller does not provide a loss byte, any conversion (except to UTF-16) may fail. As this function is "unchecked," it’s the caller’s responsibility to check for forward progress, or it risks never terminating.
The next post will discuss the get_bytes
method, which explicitly handles lossy conversions and prevents a potential infinite loop.