Python's unicode strings and QString gotcha

Filed under: Bits and Bytes by Hari
Posted at 17:47 IST (last updated: 18 Mar 2009 @ 20:28 IST)
This is more a small personal note than anything else, but writing a UTF-8 string to a UTF-8 file is a bit tricky if you're using PyQt and doing implicit conversion between QString and Python's built in unicode string.

What I tried to achieve: To get the contents of a text box and save the Unicode contents to a file.

Here's what didn't work actually:
def exportFile (self, filename):
	"""Procedure to export the UNICODE contents to a file"""
	txtOutput = self.findChild (QtGui.QPlainTextEdit, "txtTamil")
	fcontents = unicode (txtOutput.toPlainText(), "utf-8")
	
	f = codecs.open (filename, "w", encoding="utf-8")
	f.write ( fcontents )
	f.close ()

Most confusing, as the output file ended with a series of question marks instead of the actual Unicode characters. Surely I was doing everything right?

After investigating the Python side fully, I turned to QT's QString class for inspiration. Turns out that you need to actually convert the QString first to a UTF-8 bytestream using QString's toUtf8 () function before calling the Python unicode () function.

Code which works as expected:
def exportFile (self, filename):
	"""Procedure to export the UNICODE contents to a file"""
	txtOutput = self.findChild (QtGui.QPlainTextEdit, "txtTamil")
	fcontents = unicode (txtOutput.toPlainText().toUtf8(), "utf-8")
	
	f = codecs.open (filename, "w", encoding="utf-8")
	f.write ( fcontents )
	f.close ()

5 comment(s)

Leave a comment »
  1. Although it works, this seems wasteful. The QString is probably using some 16-bit or 32-bit encoding like UCS-2 or UCS-4, which you then convert to UTF-8, then convert back to UCS-2 or UCS-4, and then finally convert to UTF-8 for writing.
    Does "f.write(txtOutput.toPlainText())" work?

    Comment by tim (visitor) on 14 Mar 2009 @ 21:19 IST #
  2. Tim no... it doesn't work... Indeed it was my first try. It outputs a bunch of question marks instead of the actual UTF-8 characters.

    Couldn't find the reason why though I searched the web. It seems very strange, but the only thing I could fathom is that the automatic conversion of QString to python string is not Unicode aware.

    I tried a lot of stuff with this, but it seems that QString's internal handling of the actual Unicode data is not friendly to Python :(

    Comment by Hari (blog owner) on 14 Mar 2009 @ 21:28 IST #
  3. That's unfortunate, since converting between UCS and UTF-8 is really rather slow.

    Both the Unicode Python string and QString is either 2 or 4 bytes depending on compilation options, but not necessarily match (according to the docs, anyway). So, if the number of question marks is double that of the number of normal codepoints expected, than python is using a 2 byte representation and QString is using a 4 byte representation. Vice-versa if the number of question marks is half. It is possible that they are using two different encodings with the same size, but not likely.

    Hmm, does writing out the output from toUtf8() work? Of course, you have to turn off Python's utf8 conversion and write it out as plain old bytes.

    Comment by tim (visitor) on 15 Mar 2009 @ 04:29 IST #
  4. I get this series of errors:

    Traceback (most recent call last):
    File "/home/hari/Projects/PyTamEditor/pytameditor_main.py", line 153, in onFileExport
    self.exportFile (filename)
    File "/home/hari/Projects/PyTamEditor/pytameditor_main.py", line 111, in exportFile
    f.write (txtOutput.toPlainText().toUtf8())
    File "/usr/lib/python2.5/codecs.py", line 638, in write
    return self.writer.write(data)
    File "/usr/lib/python2.5/codecs.py", line 303, in write
    data, consumed = self.encode(object, self.errors)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128 )

    Comment by Hari (blog owner) on 15 Mar 2009 @ 08:03 IST #
  5. Hey, well, it works actually if I use a normal file() instead of using codecs.open()

    But is that a recommended way of writing to Unicode files? I thought using codecs.open() was the recommended way?

    Comment by Hari (blog owner) on 15 Mar 2009 @ 08:08 IST #

Leave a comment

First-time comments on this blog are moderated.
Your name*
Email ID*
(wont' be published)
Website
Your comments*
(No HTML allowed)
:-) :-D :biggrin: :-P ;-) 8-) :-( :mad: |-| :oops: :-/ :-| :roll:
bold italic quote code
Code* captcha Enter the code you see in the image
* required fields