ASCII/Unicode in Word / VBA
-
The specific resolution to this question is less important to me than understanding the concepts involved. I am working with VBA in Word to convert extended characters in DOS files that show up as gibberish. Currently I do this with a long series of replace-alls based on trial-and-error: for example, I figured out that Chr(131) is an a with a circumflex; Chr(132) is an a with an umlaut; Chr(133) is an a with an accent grave. Word uses different values, which I believe are called Unicode. My program works fine, but I wonder if there's a more systematic way to do this. And could someone give me an explanation of this incompatibility?
-
Answer:
Hi, viseu-ga: It's a generally useful tactic, when trying to develop a piece of VBA code, to try Record Macro to get a code snippet that at least does correctly something close to what is wanted. First I used TextPad 4.6 to create sample "ANSI" text document with some special (upper ASCII) characters, taken as it happens from a Google Answers thread (answered by Scriptor-GA) here: [Translate Song into German] http://answers.google.com/answers/main?cmd=threadview&id=173434 Krieg! Ha! Paßt auf! Was hat er Gutes? Absolut rein gar nichts! Hört mir zu. Ah, ich hasse den Krieg, Weil ganz alleine der Tod nur siegt. Krieg heißt Tränen, und er trifft die Mütter hart, Denn ihre Söhne, die sind tot, vergessen und verscharrt! Then I recorded this macro, which correctly opens the file (macro slightly edited for formatting purposes): Sub myOpen() ' ' myOpen Macro for Word 2002 ' Macro recorded 3/18/2003 by mathtalk-ga ' Documents.Open FileName:="WordASCII.txt", _ ConfirmConversions:=False, ReadOnly:=False, _ AddToRecentFiles:=False, PasswordDocument:="", _ PasswordTemplate:="", Revert:=False, _ WritePasswordDocument:="", WritePasswordTemplate:="", _ Format:=wdOpenFormatAuto, Encoding:=1252 End Sub That final "Encoding" parameter, which is supported in Word 2000 and 2002 but not in Word 97, works in combination with the "Format" parameter to control how text files are converted: [Word 2002 Documents.Open] (click on bolded "Documents" to reveal the syntax) http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbawd10/html/womthOpen.asp [Word 2000 Documents.Open] (click on bolded "Documents" to reveal the syntax) http://msdn.microsoft.com/library/default.asp?url=/library/en-us/off2000/html/womthopen.asp [Word 97 Documents.Open] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/office97/html/output/F1/D4/S5ABE9.asp?frame=true The mystery value 1252 shown above has a "coder friendly" equivalent, the constant msoEncodingWestern. The particular value was apparently chosen to match the Windows Standard code page, ANSI 1252 (see History below for more on the "code page" concept). This was Microsoft's "improvement" on the ISO Western Latin(1) extension of ASCII known as ISO-8859-1. For details of their minor differences, see this comparison by George Hernandez: [ANSI] http://www.georgehernandez.com/xComputers/CharacterSets/ANSI.htm For a list of all the MsoEncoding values in Office VBA, see here: [Encoding Property] (click on bolded "MsoEncoding" to reveal the list) msdn.microsoft.com/library/en-us/vbawd10/ html/woproEncoding.asp These same enumeration constants are used in other related contexts. For example: [ReloadAs Method] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbawd10/html/womthReloadAs.asp Strangely the "Encoding" parameter was not symmetrically added to the Save method, as discussed here: [Ask Dr.International #5: Word Macro Recording Misses Encoding] (first Q&A item listed) http://www.microsoft.com/globaldev/DrIntl/columns/005/default.mspx Instead the way to control how Word encodes text documents during saves is to set the SaveEncoding document property: [SaveEncoding] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbawd10/html/woproSaveEncoding.asp For the sake of completeness here's the list of possible values for the "Format" parameter: wdOpenFormatAllWord wdOpenFormatAuto [Default] wdOpenFormatDocument wdOpenFormatEncodedText wdOpenFormatRTF wdOpenFormatTemplate wdOpenFormatText wdOpenFormatUnicodeText wdOpenFormatWebPages [Word 2002 Documents.Open] (click on bolded "Documents" to reveal the syntax) (click on bolded "WdOpenFormat" to reveal the list) http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbawd10/html/womthOpen.asp History ======= Bearing in mind your desire for a conceptual understanding, let's stop and ask exactly what does it mean for a text document to be "ANSI" format? Historically the ASCII (American Standard Code for Information Interchange) addressed only a set of 7-bit signals between computer and "teletype" terminals (even if they were video display terminals or "glass TTY's" that emulated the original "hardcopy" teletypes). As dialup-modems become normative for terminal-computer communications, rather than hardwiring these connections, the 7-bit character signals were "embedded" in 8-bit groups. The eighth bit was then available for additional information, such as "error detection" (e.g. requiring even or odd parity for each 8-bit group). By the time that "personal" computers were blessed by IBM's entry into the marketplace, there were two sorts of uses for what had come to be called the "upper ASCII" characters, treated as individual values on independent footing from their original "lower ASCII" 7-bit correspondances. One of these uses was as graphical characters, exemplified in the IBM "PC DOS" operating system as a set of primarily line-drawing symbols (vertical, horizontal, corners, double lines, etc.) The other use was for displaying "foreign" (from an English alphabetic perspective) characters. The PC-DOS character set includes, for example, a certain number of vowels with diacritical marks and a handful of Greek alphabet and mathematical symbols, though hardly sufficient for serious applications. The ASCII character set was eventually incorporated into an "international" (ISO) standard as ISO-646-US-ASCII: http://www.ascii-table.org/ In order to support "localization" of IBM PC's into a number of European countries, IBM developed what were termed "country code pages". What this involved, in its primative formulation, was loading of customized fonts (from disk) at "boot time" based on settings in the ubiquitous CONFIG.SYS file. Applications (such as word processors), however, would need to be written to take cognizance of these "code page" settings, and packages such as WordPerfect did this with greater or lesser fidelity. But now we had a classic "tower of Babel" situation, in which simple text files would display differently, depending on setting external to the text files themselves. Several approaches were proposed to remedy this, eventually converging on the Unicode Standard (UCS): http://www.unicode.org/ which aims to simultaneously represent all character sets, even "large" ones like Chinese characters. In order to do this the 256 possibilities allowed by 8 bits are obviously insufficient. Hence one often sees the phrase "wide character" in connection with Unicode implementations, although these are not synonyms. A key to understanding the Unicode standard is to appreciate the difference between the abstract assignments of all character sets, the BMP (Basic Multilingual Plane), and "encodings" of those sets in "storage" mappings like UTF-7, UTF-8, UTF-16, and UTF-32. These designations in essence describe the number of bits used in code blocks to map characters, with the former encodings providing substantial backward compatibility with older ASCII/ANSI text files. The Unicode Standard continues to evolve and to incorporate new "alphabets". Other Links of Interest ======================= For a good discussion on Microsoft's compatibility aims with Word and Unicode: [Taking Advantage of Unicode Support] http://www.microsoft.com/office/ork/xp/three/intd02.htm A little known quirk of how VBA handles passing strings into DLL's is "implicit" conversion from Unicode to ANSI: [Anatomy of a Declare Statement] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odeopg/html/deovranatomyofdeclarestatement.asp Sometimes, of course, one wishes to pass Unicode strings & needs to bypass this conversion: [Working Around VBA String Conversion from Unicode to ANSI for DLLs] http://www.mvps.org/vb/index2.html?tips/varptr.htm Search Strategies ================= recording a macro in Word 2002 consulting Office/Word VBA help files searching MSDN Library (online and offline) Keywords: MsoEncoding WdOpenFormat Unicode ASCII ANSI 1252
viseu-ga at Google Answers Visit the source
Related Q & A:
- What ascii character would sort before the dot character?Best solution by Super User
- How I can find string in excel with vba?Best solution by Stack Overflow
- How can I programmatically change the language for non-Unicode programs?Best solution by Super User
- How do I convert the image into ASCII format?Best solution by Stack Overflow
- How to combine Unicode characters?Best solution by Stack Overflow
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.