Ticket #74 (closed defect: fixed)

Opened 3 years ago

Last modified 5 months ago

Non-ASCII characters in worksheets

Reported by: otaylor Assigned to:
Priority: high Keywords:
Cc:

Description

Entering non-ASCII characters in a worksheet produces warnings, misbehavior, and crashes.

This is because the codebase is very fuzzy about what are byte indices and what are character offsets. And also mixes together unicode and str objects. Two base premises I would make are:

  1. Any time we have a Python object holding text, and we have positions relative to it, we must make sure that the Python object can be sliced with the positions - that text[start:end] gives the text between start and end.
  2. The entire codebase must be consistent. We can't have byte indices in some places and character offsets in other places.

So that gives two basic approaches. On one hand, we could always represent buffer text as str objects (Python-2.x; bytes objects in Python-3.x), and use byte indices. On the other hand, we could always represent the buffer text as unicode objects (Python-2.x; str objects in Python-3.x) and use character offsets.

Advantages to bytes: Less conversion at the boundary with GTK+ and hence hopefully more efficient. Possibly more efficient in general, especially for CPython. Avoids problems with UTF-16 and surrogate pairs.

Advantages to Unicode strings: Program are text. Text should be represented as language string objects. Better if we want to do string operations (like upper/lower-casing) on buffer contents. Won't have to make all our internal string constants b"" for Python-3.0. Less danger of mis-slicing the text and getting something that isn't valid Unicode. (Though problems with UTF-16.) Also makes more sense if we ever do alternate frontends to the Worksheet backend.

The UTF-16 issue is that GtkTextBuffer works naturally either with byte indices or character offsets. Python can be compiled in UCS-4 mode where unicode strings are indexed by character, but more common is to have it compiled in "UCS-2" mode where unicode strings are indexed by codepoint indices and surrogate pairs, and even unpaired surrogates (grr!) can occur. If we go with unicode string representation, I think the best way to handle this is to just forbid the insertion of non-BMP characters into the buffer if Python is compiled in UCS-2 mode, or maybe always for notebook portability. Prevention of commenting your code with Linear B Ideograms is not a major problem.

Change History

04/02/09 07:39:02 changed by akaihola

I'm currently working around this by always escaping non-ASCII in strings (e.g. u'ä' -> u'\xe4'). It would be nice if Reinteract handled non-ASCII characters in worksheets more gracefully than by crashing. My native language uses some umlaut characters, so I easily forget not to type them in Reinteract, and I easily lose my work if I then calculate before saving.

+1 for Unicode objects in the entire codebase.

04/02/09 09:34:08 changed by otaylor

I did some work over the weekend to implement the Unicode-strings approach. You can find it in the unicode-internals branch of the Reinteract git repository.

I want to understand the performance implications a bit better before I push it to the main branch - it's certainly not obviously slow, but I'd like to know "how much slower".

04/04/09 08:03:30 changed by akaihola

Fantastic! I tested the unicode-internals branch and it seems to work fine for both Unicode literals and UTF-8 bytestring literals.

I did run into one strange issue, though. If a library with non-ASCII text and without a coding declaration is imported, it won't work for the rest of the session even if the coding declaration is added. Here's how to reproduce this.

Create and save a library testlib:

A = 'ä'

Create a worksheet:

import testlib

Evaluate. Throws:

SyntaxError: Non-ASCII character '\xc3' in file
/home/akaihola/Documents/Reinteract/unicodetest/testlib.py
on line 1, but no encoding declared;
see http://www.python.org/peps/pep-0263.html
for details (testlib.py, line 1)

Edit and save testlib:

# -*- coding: utf-8 -*-
A = 'ä'

Re-evaluate the worksheet. Result:

from testlib import A
AssertionError

Close and restart Reinteract, then re-evaluate. Works fine.

04/04/09 11:43:47 changed by otaylor

09/17/11 14:13:44 changed by otaylor

  • status changed from new to closed.
  • resolution set to fixed.

This landed a long time ago, closing the ticket