Entering non-ASCII characters in a worksheet produces warnings, misbehavior, and crashes.
This is because the codebase is very fuzzy about what are byte indices and what are character offsets. And also mixes together unicode and str objects. Two base premises I would make are:
- Any time we have a Python object holding text, and we have positions relative to it, we must make sure that the Python object can be sliced with the positions - that text[start:end] gives the text between start and end.
- The entire codebase must be consistent. We can't have byte indices in some places and character offsets in other places.
So that gives two basic approaches. On one hand, we could always represent buffer text as str objects (Python-2.x; bytes objects in Python-3.x), and use byte indices. On the other hand, we could always represent the buffer text as unicode objects (Python-2.x; str objects in Python-3.x) and use character offsets.
Advantages to bytes: Less conversion at the boundary with GTK+ and hence hopefully more efficient. Possibly more efficient in general, especially for CPython. Avoids problems with UTF-16 and surrogate pairs.
Advantages to Unicode strings: Program are text. Text should be represented as language string objects. Better if we want to do string operations (like upper/lower-casing) on buffer contents. Won't have to make all our internal string constants b"" for Python-3.0. Less danger of mis-slicing the text and getting something that isn't valid Unicode. (Though problems with UTF-16.) Also makes more sense if we ever do alternate frontends to the Worksheet backend.
The UTF-16 issue is that GtkTextBuffer works naturally either with byte indices or character offsets. Python can be compiled in UCS-4 mode where unicode strings are indexed by character, but more common is to have it compiled in "UCS-2" mode where unicode strings are indexed by codepoint indices and surrogate pairs, and even unpaired surrogates (grr!) can occur. If we go with unicode string representation, I think the best way to handle this is to just forbid the insertion of non-BMP characters into the buffer if Python is compiled in UCS-2 mode, or maybe always for notebook portability. Prevention of commenting your code with Linear B Ideograms is not a major problem.