ylliX - Online Advertising Network

AttributedString’s Codable format and what it has to do with Unicode


Here’s a simple AttributedString with some formatting:

import Foundation

let str = try! AttributedString(
  markdown: "Café **Sol**",
  options: .init(interpretedSyntax: .inlineOnly)
)

AttributedString is Codable. If your task was to design the encoding format for an attributed string, what would you come up with? Something like this seems reasonable (in JSON with comments):

{
  "text": "Café Sol",
  "runs": [
    {
      // start..<end in Character offsets
      "range": [5, 8],
      "attrs": {
        "strong": true
      }
    }
  ]
}

This stores the text alongside an array of runs of formatting attributes. Each run consists of a character range and an attribute dictionary.

But this format is bad and can break in various ways. The problem is that the character offsets that define the runs aren’t guaranteed to be stable. The definition of what constitutes a Character, i.e. a user-perceived character, or a Unicode grapheme cluster, can and does change in new Unicode versions. If we decoded an attributed string that had been serialized

  • on a different OS version (before Swift 5.6, Swift used the OS’s Unicode library for determining character boundaries),
  • or by code compiled with a different Swift version (since Swift 5.6, Swift uses its own grapheme breaking algorithm that will be updated alongside the Unicode standard), the character ranges might no longer represent the original intent, or even become invalid.

Update April 11, 2024: See this Swift forum post I wrote for an example where the Unicode rules for grapheme cluster segmentation changed for flag emoji. This change caused a corresponding change in how Swift counts the Characters in a string containing consecutive flags, such as "🇦🇷🇯🇵".

Normalization forms

So let’s use UTF-8 byte offsets for the ranges, I hear you say. This avoids the first issue but still isn’t safe, because some characters, such as the é in the example string, have more than one representation in Unicode: it can be either the standalone character é (Latin small letter e with acute) or the combination of e + ◌́ (Combining acute accent). The Unicode standard calls these variants normalization forms. The first form needs 2 bytes in UTF-8, whereas the second uses 3 bytes, so subsequent ranges would be off by one if the string and the ranges used different normalization forms.

Now in theory, the string itself and the ranges should use the same normalization form upon serialization, avoiding the problem. But this is almost impossible to guarantee if the serialized data passes through other systems that may (inadvertently or not) change the Unicode normalization of the strings that pass through them.

A safer option would be to store the text not as a string but as a blob of UTF-8 bytes, because serialization/networking/storage layers generally don’t mess with binary data. But even then you’d have to be careful in the encoding and decoding code to apply the formatting attributes before any normalization takes place. Depending on how your programming language handles Unicode, this may not be so easy.

The people on the Foundation team know all this, of course, and chose a better encoding format for Attributed String. Let’s take a look.

let encoder = JSONEncoder()
encoder.outputFormatting = [.prettyPrinted, .sortedKeys]
let jsonData = try encoder.encode(str)
let json = String(decoding: jsonData, as: UTF8.self)

This is how our sample string is encoded:

[
  "Café ",
  {

  },
  "Sol",
  {
    "NSInlinePresentationIntent" : 2
  }
]

This is an array of runs, where each run consists of a text segment and a dictionary of formatting attributes. The important point is that the formatting attributes are directly associated with the text segments they belong to, not indirectly via brittle byte or character offsets. (This encoding format is also more space-efficient and possibly better represents the in-memory layout of AttributedString, but that’s beside the point for this discussion.)

There’s still a (smaller) potential problem here if the character boundary rules change for code points that span two adjacent text segments: the last character of run N and the first character of run N+1 might suddenly form a single character (grapheme cluster) in a new Unicode version. In that case, the decoding code will have to decide which formatting attributes to apply to this new character. But this is a much smaller issue because it only affects the characters in question. Unlike our original example, where an off-by-one error in run N would affect all subsequent runs, all other runs are untouched.

Related forum discussion: Itai Ferber on why Character isn’t Codable.

We can extract a general lesson out of this: Don’t store string indices or offsets if possible. They aren’t stable over time or across runtime environments.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *