A Brief History of Microsoft Products and Unicode – 32 Bit

COM in the 32-Bit World

COM made an interesting break with both operating systems: It only supports Unicode, and ANSI is just left out in the cold. If you cannot speak Unicode at some level-even if it only means supporting the MultiByteToWideChar and WideCharToMultiByte calls to convert it-, then you cannot speak to COM. With so much of even the basic functionality in the 32-bit Windows shell requiring Unicode, every application must at least do a little work in Unicode. Strings in 32-bit COM (OLESTRs and BSTRs) are always Unicode. Of course, most applications support it minimally by handling the conversion functions and using the system default codepage (CP_ACP), the one codepage guaranteed to always be supported.

Windows 98

hqdefault-2

The core operating system only added a few API calls to the list that would support both Unicode and ANSI (seen in Table 6.2).

Table 6.2 The Win32 API Calls for Which Windows 98 Added Unicode Support

API Call

What It Does

lstrcat

Appends one string to another

lstrcpy

Copies a string to a buffer

However, many new interfaces were added, such as new shell extensions and integrated browsing enhancements. These are all COM interfaces and thus only support Unicode.

Windows Millennium Edition (Windows Me)

Windows Me did not really add much to the equation. According to Microsoft, it is the last version of the Win9x code base that will ever ship, but, in fairness, the company has been saying this since the OSR2 release of Windows 95. The basic issues I mentioned in connection with Windows 95 and 98 apply to Windows Me.

Two sets of APIs that have had Unicode support added in Millennium are those related to the Input Method Manager (IMM) API, which I discuss further in Chapter 8, and the Geographical Information Management (Geo) API. Geo is used by many of the Windows Me components that map locale information to a geographical location.

Data Storage Engines

foxpro

The engines themselves, whether SQL Server, Jet, FoxPro, or other, initially stayed away from the Unicode world, preferring the provincial world where a single codepage is all that would be needed. Although both Jet and SQL Server could do their own string normalization in many cases (see Chapter 12 for more information on this topic), it was done only for performance reasons and not to support the notion of multiple codepages in the same file. Both products were a step beyond the operating system notion of the default codepage. You could explicitly choose to use any single codepage that the OS supported, but you were still limited to a single codepage.

The more recent versions of Jet and SQL Server, however, do support Unicode as a native format: In Jet, everything was moved to Unicode; in SQL Server, you could choose between ANSI and Unicode. Other engines (such as FoxPro) have no native Unicode support at the database engine level.

Data Access Methods

Unlike the engine itself, most of the data access methods (ADO, OLE DB, DAO, and RDO) are COM components that only support Unicode. So what do data layers do when they must speak Unicode if the underlying engine does not? Well, simply speaking, they convert to and from Unicode, using either the default system codepage, or in the case of FoxPro, Jet, and SQL Server, the codepage of their choice. Obviously there is a lot of room here for conversion errors.

The move to Unicode by the data engines not only the made conversion errors usually go away (if everything can stay in one format, there is nothing to incorrectly convert!), it also improved performance because so many conversion calls went away! Chapter 12 has more information on why and where there are sometimes still problems in this area.

Microsoft Office

microsoft_office_2000_68442

The popular Visual Basic author Bruce McKinney once stated, “Someday there will be Unicode data file formats, but it might not happen in your lifetime.” How wrong this turned out to be! Over the first three 32-bit versions of Office, all the major applications (Word, Excel, Access, and PowerPoint) have moved to both Unicode file formats and Unicode executables. Even in the world of text files (which are usually stored in ANSI format), provisions to not make assumptions about the codepage of the file have been made.

To give a specific example, this very book was written, edited, and laid out by the publisher using Word 2000. Why? Because in many cases, I wanted to support multilingual text. I did not want the publisher to use QuarkXpress, a very popular program in publishing circles, because it has the exact same limitations as I have been describing in other programs. I have had to deal with the limitations of such packages for years in the articles I have written (and Quark, Inc. definitely is a standard for many publishers), but for this book it was important to be able to treat all languages as equal. By moving to Word 2000, I am able to include Hindi text such as “आप यहाँ पर क्यों आना चाहते हैं?” or Thai text such as “ทำไมคุณถึงต้องเข้ามาชมเว็บไซต์นี้?” without requiring the use of special screenshots for each bit of text. I will discuss this further in Chapter 10, “Handling Localized Resources with Satellite DLLs.”

For the curious, translations for the previous Hindi and Thai texts are given in Table 6.3, in many languages (perhaps even yours!). These translations were produced for many of the locales used on the trigeminal.com Web site.

Table 6.3 Look, Ma, No Bitmaps! Many Ways to Say the Same Phrase (Showing Off the Capabilities of My Publisher!)

Language

Phrase

Hindi

आप यहाँ पर क्यों आना चाहते हैं?

Thai

ทำไมคุณถึงต้องเข้ามาชมเว็บไซต์นี้?

Bulgarian

или защо Ви трябва да идвате тук?

English

That is, why would you want to be here?

Simplified Chinese

即你来这儿的目的?

Traditional Chinese

即你為什麼要來這裡?

Turkish

örneğin; Neden bu sitede olmayı isteyeceginiz gibi?

German

d.h. warum lohnt es sich, hier zu sein?

French

i.e., pour quelles raisons dé sirez-vous explorer ce site?

Greek

δηλαδή, γιατί θέλετε να είστε εδώ;

Hebrew

כלוםר למה בכלל תרצה להיות כאן?

Dutch

d.w.z., waarom wilt u hier zijn?

Japanese

すなわち、あなたに必要なもの

Swedish

m.a.o. vad gör du här?

Portuguese

porque é que tu queres estar aqui?

Russian

возможно это то, что Вам надо

Spanish

ejemplo, ¿Porqué deseas estar aquí?

Italian

in altre parole, perché potreste voler visitare queste pagine?

Romanian

cu alte cuvinte, de ce sunteti aici?

Tamil

ஏன்நீ இஙுகு வரவேண்டு?

Windows 2000

Windows 2000, known while under development as NT5, simply continued the tradition of NT3.1, 3.51, and 4.0. It did pick up the new shell from Windows 98 and addressed many usability complaints. But, from the globalization standpoint, it moved much closer to the worldwide EXE model, throughout: There were no longer bug fixes that existed only for specific languages! Support for MUI (the MultiLanguage User Interface) proved that Windows 2000 was a worldwide operating system.

Some applications, unfortunately, are still stuck with ANSI, most notably Internet Information Server, but these applications have been clearly put on notice where they need to be heading: Unicode.

Windows CE

Yet another model was used for the smallest operating system: Windows CE is closest to COM in that it only supports Unicode at the API level. However, because there are only a limited number of applications that still do support pure Unicode and only a limited amount of space on a smaller device for codepage translation tables, Windows CE applications are still limited in the number of codepages they can use. It is clearly, however, a step in the right direction.

Visual Basic in the 32-Bit World

And at last I am to the most important RAD tool in terms of this book: the 32-bit versions of Visual Basic! There are many issues that surround Unicode support in VB:

  • VB is in many ways the quintessential COM component, and COM is pure Unicode, so much of VB is indeed Unicode. Certainly its string storage is Unicode, and any calls to interfaces that are used via CreateObject/GetObject or by referenced libraries stay Unicode throughout.
  • VB’s forms package is, unfortunately, ANSI based and had remained so for VB4, VB5, and VB6. Therefore, although all properties on VB forms have Unicode interfaces, they must be converted to ANSI, and CP_ACP is always used.
  • Many of VB’s string-handling functions are actually wrappers around operating system calls that normalize strings. Therefore, they will support Unicode on Windows NT and Windows 2000, but will not on Windows 95 and Windows 98. This was partially mitigated in VB6, in which a Compare argument was added to many of these functions, allowing users to specify an LCID (VB would extract a codepage from the LCID to use for these operations).
  • Because VB wanted to keep all worries about matters such as ANSI or Unicode from developers, a syntax for declaring outside API calls with both types of strings was not specified. However, the entire Windows API under Windows 95/98 is ANSI! Therefore, VB developers made the “backward compatibility” decision to always convert VB’s Unicode strings to ANSI in declare statements. (This is true for both inbound and outbound parameters in the ByVal String case but only for inbound parameters in the ByRef String case, an issue described further in the next chapter.) As you have probably noticed, this is not for backwards compatibility with prior VB versions; it is more for backward compatibility with all the existing libraries you might want to call, including the Windows API.
  • VB source files are basically text files, and they are ANSI text files. The best thing I was able to do for multiple language support for samples in the book was to make sure that the files themselves and all of the information in them was in US English, as that is the only language guaranteed to work everywhere. The exception to this rule is several of the files in the next chapter.
  • Because most text and other files are saved as ANSI, all the file I/O functions in VB that deal with strings are once again handled the same way: Always convert the functions to ANSI, and, because there is no provision for choosing how conversions will happen, always use CP_ACP, the default system codepage. There is a means for completely avoiding all such conversions, however: a binary version of all the file i/o statements that does no extra conversions. To use this method, you would use InputB instead of Input, Write# instead of Print#, and, of course, open the file for binary I/O.
  • As another partially mitigating feature, Visual Basic always supports explicitly converting strings to byte arrays and byte arrays to strings. You can use byte arrays in many cases when you want to pass information that is a Unicode string but do not want Visual Basic to convert it to ANSI for you. (This feature is used quite a bit in the next two chapters.)

    Of course, the order in which I have presented these points would lead anyone to believe the final answer to the question “Is VB Unicode?” would be “Yes, but…”, and maybe that is the best answer to give. Visual Basic is indeed Unicode with its Unicode string storage and Unicode interfaces, but as the data engines and access methods learned, there is a lot more to supporting Unicode then making sure the front door supported it, especially if you want to get the benefits of Unicode. If you think about it, Visual Basic forms gain nothing from their Unicode interfaces, nothing at all. Why is that? Well, when they are used, a single codepage is required. Therefore, the only thing that the Unicode interfaces of the forms package gives VB is compatibility with COM; none of the benefits inherent in Unicode, such as being able to support many languages/locales, are available here.

Leave a Reply +

Leave a Reply

Your email address will not be published. Required fields are marked *