Hari's Corner

Humour, comics, tech, law, software, reviews, essays, articles and HOWTOs intermingled with random philosophy now and then

Python's unicode strings and QString gotcha

Filed under: Bits and Bytes by Hari
Posted on Fri, Mar 13, 2009 at 17:47 IST (last updated: Wed, Mar 18, 2009 @ 20:28 IST)

This is more a small personal note than anything else, but writing a UTF-8 string to a UTF-8 file is a bit tricky if you're using PyQt and doing implicit conversion between QString and Python's built in unicode string.

What I tried to achieve: To get the contents of a text box and save the Unicode contents to a file.

Here's what didn't work actually:
def exportFile (self, filename):
	"""Procedure to export the UNICODE contents to a file"""
	txtOutput = self.findChild (QtGui.QPlainTextEdit, "txtTamil")
	fcontents = unicode (txtOutput.toPlainText(), "utf-8")
	
	f = codecs.open (filename, "w", encoding="utf-8")
	f.write ( fcontents )
	f.close ()

Most confusing, as the output file ended with a series of question marks instead of the actual Unicode characters. Surely I was doing everything right?

After investigating the Python side fully, I turned to QT's QString class for inspiration. Turns out that you need to actually convert the QString first to a UTF-8 bytestream using QString's toUtf8 () function before calling the Python unicode () function.

Code which works as expected:
def exportFile (self, filename):
	"""Procedure to export the UNICODE contents to a file"""
	txtOutput = self.findChild (QtGui.QPlainTextEdit, "txtTamil")
	fcontents = unicode (txtOutput.toPlainText().toUtf8(), "utf-8")
	
	f = codecs.open (filename, "w", encoding="utf-8")
	f.write ( fcontents )
	f.close ()

5 comment(s)

  1. Although it works, this seems wasteful. The QString is probably using some 16-bit or 32-bit encoding like UCS-2 or UCS-4, which you then convert to UTF-8, then convert back to UCS-2 or UCS-4, and then finally convert to UTF-8 for writing.
    Does "f.write(txtOutput.toPlainText())" work?

    Comment by tim (visitor) on Sat, Mar 14, 2009 @ 21:19 IST #
  2. Tim no... it doesn't work... Indeed it was my first try. It outputs a bunch of question marks instead of the actual UTF-8 characters.

    Couldn't find the reason why though I searched the web. It seems very strange, but the only thing I could fathom is that the automatic conversion of QString to python string is not Unicode aware.

    I tried a lot of stuff with this, but it seems that QString's internal handling of the actual Unicode data is not friendly to Python :(

    Comment by Hari (blog owner) on Sat, Mar 14, 2009 @ 21:28 IST #
  3. That's unfortunate, since converting between UCS and UTF-8 is really rather slow.

    Both the Unicode Python string and QString is either 2 or 4 bytes depending on compilation options, but not necessarily match (according to the docs, anyway). So, if the number of question marks is double that of the number of normal codepoints expected, than python is using a 2 byte representation and QString is using a 4 byte representation. Vice-versa if the number of question marks is half. It is possible that they are using two different encodings with the same size, but not likely.

    Hmm, does writing out the output from toUtf8() work? Of course, you have to turn off Python's utf8 conversion and write it out as plain old bytes.

    Comment by tim (visitor) on Sun, Mar 15, 2009 @ 04:29 IST #
  4. I get this series of errors:

    Traceback (most recent call last):
    File "/home/hari/Projects/PyTamEditor/pytameditor_main.py", line 153, in onFileExport
    self.exportFile (filename)
    File "/home/hari/Projects/PyTamEditor/pytameditor_main.py", line 111, in exportFile
    f.write (txtOutput.toPlainText().toUtf8())
    File "/usr/lib/python2.5/codecs.py", line 638, in write
    return self.writer.write(data)
    File "/usr/lib/python2.5/codecs.py", line 303, in write
    data, consumed = self.encode(object, self.errors)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128 )

    Comment by Hari (blog owner) on Sun, Mar 15, 2009 @ 08:03 IST #
  5. Hey, well, it works actually if I use a normal file() instead of using codecs.open()

    But is that a recommended way of writing to Unicode files? I thought using codecs.open() was the recommended way?

    Comment by Hari (blog owner) on Sun, Mar 15, 2009 @ 08:08 IST #

Comments closed

The blog owner has closed further commenting on this entry.