How do I convert a file's format from Unicode to ASCII using Python?
-
I use a 3rd party tool that outputs a file in Unicode format. However, I prefer it to be in ASCII. The tool does not have settings to change the file format. What is the best way to convert the entire file format using Python?
-
Answer:
You can convert the file easily enough just using the unicode function, but you'll run into problems with Unicode characters without a straight ASCII equivalent. http://www.peterbe.com/plog/unicode-to-ascii recommends the http://www.python.org/doc/2.5.2/lib/module-unicodedata.html module, which seems to take care of roughly converting characters without direct corresponding ASCII values, e.g. >>> title = "Klüft skräms inför på fédéral électoral große" is typically converted to Klft skrms infr p fdral lectoral groe which is pretty wrong. However, using the unicodedata module, the result can be much closer to the original text: >>> import unicodedata >>> unicodedata.normalize('NFKD', title).encode('ascii','ignore') 'Kluft skrams infor pa federal electoral groe'
Ray Vega at Stack Overflow Visit the source
Other answers
I think this is a deeper issue than you realize. Simply changing the file from Unicode into ASCII is easy, however, getting all of the Unicode characters to translate into reasonable ASCII counterparts (many letters are not available in both encodings) is another. This Python Unicode tutorial may give you a better idea of what happens to Unicode strings that are translated to ASCII: http://www.reportlab.com/i18n/python_unicode_tutorial.html Here's a useful quote from the site: Python 1.6 also gets a "unicode" built-in function, to which you can specify the encoding: > >>> unicode('hello') u'hello' > >>> unicode('hello', 'ascii') u'hello' > >>> unicode('hello', 'iso-8859-1') u'hello' > >>> All three of these return the same thing, since the characters in 'Hello' are common to all three encodings. Now let's encode something with a European accent, which is outside of ASCII. What you see at a console may depend on your operating system locale; Windows lets me type in ISO-Latin-1. > >>> a = unicode('André','latin-1') > >>> a u'Andr\202' If you can't type an acute letter e, you can enter the string 'Andr\202', which is unambiguous. Unicode supports all the common operations such as iteration and splitting. We won't run over them here. IMHO, PConroy has a good answer. I'd vote him up, but I'm out for today :D
Pete Karl II
Like this: uc = open(filename).read().decode('utf8') ascii = uc.decode('ascii') Note, however, that this will fail with a UnicodeDecodeError exception if there are any characters that can't be converted to ASCII. EDIT: As Pete Karl just pointed out, there is no one-to-one mapping from Unicode to ASCII. So some characters simply can't be converted in an information-preserving way. Moreover, standard ASCII is more or less a subset of UTF-8, so you don't really even need to do any decoding.
Dan
Here's some simple (and stupid) code to do encoding translation. I'm assuming (but you shouldn't) that the input file is in UTF-16 (Windows calls this simply 'Unicode'). input_codec = 'UTF-16' output_codec = 'ASCII' unicode_file = open('filename') unicode_data = unicode_file.read().decode(input_codec) ascii_file = open('new filename', 'w') ascii_file.write(unicode_data.write(unicode_data.encode(output_codec))) Note that this will not work if there are any characters in the Unicode file that are not also ASCII characters. You can do the following to turn unrecognized characters into '?'s: ascii_file.write(unicode_data.write(unicode_data.encode(output_codec, 'replace'))) Check out http://docs.python.org/library/stdtypes.html#str.encode for more simple choices. If you need to do anything more sophisticated, you may wish to check out http://code.activestate.com/recipes/251871/ at the Python Cookbook.
giltay
By the way, these is a linux command iconv to do this kind of job. iconv -f utf8 -t ascii <input.txt >output.txt
kev
It's important to note that there is no 'Unicode' file format. Unicode can be encoded to bytes in several different ways. Most commonly UTF-8 or UTF-16. You'll need to know which one your 3rd-party tool is outputting. Once you know that, converting between different encodings is pretty easy: in_file = open("myfile.txt", "rb") out_file = open("mynewfile.txt", "wb") in_byte_string = in_file.read() unicode_string = bytestring.decode('UTF-16') out_byte_string = unicode_string.encode('ASCII') out_file.write(out_byte_string) out_file.close() As noted in the other replies, you're probably going to want to supply an error handler to the encode method. Using 'replace' as the error handler is simple, but will mangle your text if it contains characters that cannot be represented in ASCII.
Jerry Hill
As other posters have noted, ASCII is a subset of unicode. However if you: have a legacy app you don't control the code for that app you're sure your input falls into the ASCII subset Then the example below shows how to do it: mystring = u'bar' type(mystring) <type 'unicode'> myasciistring = (mystring.encode('ASCII')) type(myasciistring) <type 'str'>
nailer
For my problem where I just wanted to skip the Non-ascii characters and just output only ascii output, the below solution worked really well: import unicodedata input = open(filename).read().decode('UTF-16') output = unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore')
Vijay
Related Q & A:
- How do I convert a PDF file to PDF/A in Delphi?Best solution by softwarerecs.stackexchange.com
- How can I convert a string number to a number in Perl?Best solution by Stack Overflow
- How do i convert coaxial to s-video?Best solution by Yahoo! Answers
- How can I convert a video to mp3 format?Best solution by Yahoo! Answers
- How can I remove a picture's watermark using Matlab's image processing toolbox?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.