Chapter 6
VB-Is It ANSI or Unicode?
As I implied way back in Chapter 1, "Getting Started," the question of whether Visual Basic is ANSI or Unicode is a trick question. The only answer someone could ever possibly give is yes and no, or more accurately, the answer depends on the meaning of the question. Perhaps delving a little more deeply into the history would help give the question enough meaning that the answer of "It depends" will at least feel a little less unsatisfying.
One important definition has already been discussed: when I refer to Unicode here I am using the Microsoft notion of UCS-2/UTF-16. Another that I have not discussed is that of the non-Unicode strings. Although these are usually referred to as multibyte (which will cover all the other codepages, including the DBCS codepages that will use two bytes and UTF-8 that will use 1-5 bytes, and so on), many people also refer to them as being "ANSI." This is largely for historical reasons, but there is a modern basis for this misused term: All the multibyte Windows APIs have an "A" suffix on them, the "A" standing for ANSI. An example would be the FindWindow API, which takes strings for a window class, a window caption, or both. The FindWindowA API exists on all Win32 operating systems and accepts strings that can be represented by the default system codepage. The FindWindowW API exists only on Windows NT and Windows 2000 and accepts Unicode strings. My personal preference would have been to call the "A" APIs the ~Unicode APIs (meaning the "not Unicode" APIs), but the tilde character is not one that would be valid in identifiers. This would be confusing, to say the least! The issue of using and calling Unicode versus non-Unicode APIs is one that I will be discussing a little later in this chapter and extensively in Chapters 7, "Understanding the Codepage Barrier," and 8, "Handling VB Forms and Formats." |
A Brief History of Microsoft Products and Unicode
The following is not meant to be an exhaustive look at the topic of Unicode, or the ISO-10646 standard that Unicode now tries to match character for character. It is meant to give a historical context to help explain where Visual Basic and other Microsoft products are in relation to Unicode-and hopefully where they are heading, as well. This history will cover most of the high points, and might be more functional than chronological. Where to start? Well, at the beginning, of course...
16-Bit Windows (Windows 3.0, 3.1, and 3.1x)
Windows was developed in the United States of America, by an American company, Microsoft. Unicode was not even an understood term at the time, so it's not surprising that none of the 16-bit Windows operating systems spoke Unicode. There were some forward-looking people at Microsoft, who were looking to international markets, notably markets such as Japan. This did cause Microsoft to begin thinking beyond codepage 1252 and into the Japanese and other Asian codepages. The expense of globalization and localization was not as well understood, but the requirements of the Japanese market were very clear, clear enough that Microsoft jumped in with the intention of localization.
However, localization at this stage was very primitive and involved U.S. developers "throwing their code over the fence" to the Japanese developers and then having them make the changes the product required and testing those changes. Unfortunately, those developers and testers made as many mistakes as their original counterparts, and these versions of the products were not supersets that could handle all languages, but subsets that just as often were broken for handling data that would work in the original version.
Thus localization existed, but globalization was relatively unknown. Because each locale simply kept to itself and cross-codepage interoperability was not a requirement, this caused no major problems for anyone.
COM in the 16-Bit World
COM did not really understand Unicode much better, for much the same reasons. In fact, the Component Object Model did not really embrace the importance of language/locale at all. It worked under the same assumption that it was good enough to work within the current system's default codepage.
Visual Basic in the 16-Bit World
Early versions of Visual Basic worked under the same rules as other 16-bit applications. But, by the time Visual Basic 3.0 arrived, it was clear that a better way of dealing with software in general was needed to bring down the costs and improve the quality of products that needed to be sold in other countries. However, Visual Basic has always been very much based on the platform on which it sits, and without support from the foundation, the house is simply not going to be built in such a direction. So VB 3.0 stayed where it was, but people were looking forward, toward Windows New Technology, or Windows NT.
Windows NT
David Cutler's second operating system (the first was for Digital) was built with the future in mind, and the important aspect for us is Windows NT's full support for Unicode. The entire operating system was written from the ground up with the idea that a Unicode Kernel and Unicode support was the most important method of getting to the operating system. This was met with skepticism, as most people simply saw twice as big as a real problem in an era when memory and hard drive space were at such premiums, but Mr. Cutler really was looking to the future. The "Unicode" he chose was UCS-2, the worldwide standard encapsulated by ISO-16046 that defined every character in terms of two bytes. The US English locale still had the "positional" advantage of being in the first 127 characters, but it no longer had the advantage over the Asian languages of taking up less space.
At the same time, reality crept in, and it became obvious that no one could move to a platform that did not support the "old way" of doing things. Therefore, ANSI would be supported for the sake of backward compatibility, and to enable all the existing applications to keep working (mostly). All the Win32 APIs that took strings now had two versions: an "A" version for multibyte character systems such as English, Dutch, Japanese, , and a "W" version that would use Unicode. At compile time, you would choose which set of APIs to use by choosing whether to compile with the Unicode flag, for example, deciding whether the GetWindowLong call in your C/C++ code would be calling GetWindowLongA or GetWindowLongW. You could always choose to call one or the other explicitly, but you were encouraged not to. In theory, it would be easy for you to simply flip a switch one day and be in Unicode!
And why would they do this? Well, first, there was the obvious strength of a worldwide EXE (which even Windows NT did not yet have because many of its own core applications were not yet as enlightened as the Kernel-to the extent that a Kernel can be considered enlightened!). Second, any time you were dealing directly with Unicode, all your operations would be faster because no extra translations between ANSI and Unicode would be needed. People learned very quickly that the SDK documentation claims were accurate- MultiByteToWideChar and WideCharToMultiByte functions could slow down an application.
One of the hidden features of MultiByteToWideChar and WideCharToMultiByte was, of course, that codepage tables had to exist to assist in the conversion of strings between Unicode and any codepage, and vice versa. Supporting a given language in a world where most applications were not really using Unicode internally meant supporting the codepage, as well.
Windows 95
Windows 95 was born under slightly different principles: It was, in many ways, a port of the original 16-bit Win 3.x codebase (just as most of the applications were, even under Windows NT). It was originally intended to fully support the same Win32 API as Windows NT, although in most cases the "W" versions of the API functions simply return an ERROR_CALL_NOT_IMPLEMENTED or E_NOTIMPL error. The limited number of Win32 API calls designed to support both Unicode and ANSI under Windows 95/98 are seen in Table 6.1.
Table 6.1 The Win32 API Calls That Support Unicode Under All Platforms
API Call | What It Does |
EnumResourceLanguages | Enumerates the languages supported by a specified resource name/type in a given module |
EnumResourceNames | Enumerates all the resources of a specified type in a given module |
EnumResourceTypes | Enumerates all the resource types in a given module |
ExtTextOut | Writes a character string out to a specified location, optionally enabling parameters beyond what TextOut supports |
FindResource | Finds a resource of a specified name and title |
FindResourceEx | Finds a resource of a specified name and title, allowing a language to be specified |
GetCharWidth | Retrieves the width of specified characters |
GetCommandLine | Retrieves the command line string for the current process |
GetTextExtentPoint | Computes the width and height of a given text string (provided for backward compatibility, GetTextExtentPoint32 is recommended) |
GetTextExtentPoint32 | Computes the width and height of a given text string |
lstrlen | Returns the length of a null-terminated string |
MessageBox | Creates, displays, and operates a message box. |
MessageBoxEx | Creates, displays, and operates a message box, allowing the user to specify a language for the predefined buttons |
MultiByteToWideChar | Converts a multibyte string to a Unicode one, given the codepage with which to do the conversion |
TextOut | Writes a character string out to a specified location |
WideCharToMultiByte | Converts a Unicode string to a multibyte one, given the codepage with which to do the conversion |
Clearly, it can be challenging to write a Unicode application for Windows 95 or 98 given such a sparse set of tools. However, at first, the idea was that you would later simply compile as a Unicode application for Windows NT and be done with it; only later did the need to support Unicode applications on Windows 95/98 become clear.
The one big exception to all of this is 32-bit COM.
COM in the 32-Bit World
COM made an interesting break with both operating systems: It only supports Unicode, and ANSI is just left out in the cold. If you cannot speak Unicode at some level-even if it only means supporting the MultiByteToWideChar and WideCharToMultiByte calls to convert it-, then you cannot speak to COM. With so much of even the basic functionality in the 32-bit Windows shell requiring Unicode, every application must at least do a little work in Unicode. Strings in 32-bit COM (OLESTRs and BSTRs) are always Unicode. Of course, most applications support it minimally by handling the conversion functions and using the system default codepage (CP_ACP), the one codepage guaranteed to always be supported.
Windows 98
The core operating system only added a few API calls to the list that would support both Unicode and ANSI (seen in Table 6.2).
Table 6.2 The Win32 API Calls for Which Windows 98 Added Unicode Support
API Call | What It Does |
lstrcat | Appends one string to another |
lstrcpy | Copies a string to a buffer |
However, many new interfaces were added, such as new shell extensions and integrated browsing enhancements. These are all COM interfaces and thus only support Unicode.
Windows Millennium Edition (Windows Me)
Windows Me did not really add much to the equation. According to Microsoft, it is the last version of the Win9x code base that will ever ship, but, in fairness, the company has been saying this since the OSR2 release of Windows 95. The basic issues I mentioned in connection with Windows 95 and 98 apply to Windows Me.
Two sets of APIs that have had Unicode support added in Millennium are those related to the Input Method Manager (IMM) API, which I discuss further in Chapter 8, and the Geographical Information Management (Geo) API. Geo is used by many of the Windows Me components that map locale information to a geographical location.
Data Storage Engines
The engines themselves, whether SQL Server, Jet, FoxPro, or other, initially stayed away from the Unicode world, preferring the provincial world where a single codepage is all that would be needed. Although both Jet and SQL Server could do their own string normalization in many cases (see Chapter 12 for more information on this topic), it was done only for performance reasons and not to support the notion of multiple codepages in the same file. Both products were a step beyond the operating system notion of the default codepage. You could explicitly choose to use any single codepage that the OS supported, but you were still limited to a single codepage.
The more recent versions of Jet and SQL Server, however, do support Unicode as a native format: In Jet, everything was moved to Unicode; in SQL Server, you could choose between ANSI and Unicode. Other engines (such as FoxPro) have no native Unicode support at the database engine level.
Data Access Methods
Unlike the engine itself, most of the data access methods (ADO, OLE DB, DAO, and RDO) are COM components that only support Unicode. So what do data layers do when they must speak Unicode if the underlying engine does not? Well, simply speaking, they convert to and from Unicode, using either the default system codepage, or in the case of FoxPro, Jet, and SQL Server, the codepage of their choice. Obviously there is a lot of room here for conversion errors.
The move to Unicode by the data engines not only the made conversion errors usually go away (if everything can stay in one format, there is nothing to incorrectly convert!), it also improved performance because so many conversion calls went away! Chapter 12 has more information on why and where there are sometimes still problems in this area.
Microsoft Office
The popular Visual Basic author Bruce McKinney once stated, "Someday there will be Unicode data file formats, but it might not happen in your lifetime." How wrong this turned out to be! Over the first three 32-bit versions of Office, all the major applications (Word, Excel, Access, and PowerPoint) have moved to both Unicode file formats and Unicode executables. Even in the world of text files (which are usually stored in ANSI format), provisions to not make assumptions about the codepage of the file have been made.
To give a specific example, this very book was written, edited, and laid out by the publisher using Word 2000. Why? Because in many cases, I wanted to support multilingual text. I did not want the publisher to use QuarkXpress, a very popular program in publishing circles, because it has the exact same limitations as I have been describing in other programs. I have had to deal with the limitations of such packages for years in the articles I have written (and Quark, Inc. definitely is a standard for many publishers), but for this book it was important to be able to treat all languages as equal. By moving to Word 2000, I am able to include Hindi text such as "आप यहाँ पर क्यों आना चाहते हैं?" or Thai text such as "ทำไมคุณถึงต้องเข้ามาชมเว็บไซต์นี้?" without requiring the use of special screenshots for each bit of text. I will discuss this further in Chapter 10, "Handling Localized Resources with Satellite DLLs."
For the curious, translations for the previous Hindi and Thai texts are given in Table 6.3, in many languages (perhaps even yours!). These translations were produced for many of the locales used on the trigeminal.com Web site.
Table 6.3 Look, Ma, No Bitmaps! Many Ways to Say the Same Phrase (Showing Off the Capabilities of My Publisher!)
Language | Phrase |
Hindi | आप यहाँ पर क्यों आना चाहते हैं? |
Thai | ทำไมคุณถึงต้องเข้ามาชมเว็บไซต์นี้? |
Bulgarian | или защо Ви трябва да идвате тук? |
English | That is, why would you want to be here? |
Simplified Chinese | 即你来这儿的目的? |
Traditional Chinese | 即你為什麼要來這裡? |
Turkish | örneğin; Neden bu sitede olmayı isteyeceginiz gibi? |
German | d.h. warum lohnt es sich, hier zu sein? |
French | i.e., pour quelles raisons dé sirez-vous explorer ce site? |
Greek | δηλαδή, γιατί θέλετε να είστε εδώ; |
Hebrew | כלוםר למה בכלל תרצה להיות כאן? |
Dutch | d.w.z., waarom wilt u hier zijn? |
Japanese | すなわち、あなたに必要なもの |
Swedish | m.a.o. vad gör du här? |
Portuguese | porque é que tu queres estar aqui? |
Russian | возможно это то, что Вам надо |
Spanish | ejemplo, ¿Porqué deseas estar aquí? |
Italian | in altre parole, perché potreste voler visitare queste pagine? |
Romanian | cu alte cuvinte, de ce sunteti aici? |
Tamil | ஏன்நீ இஙுகு வரவேண்டு? |
Windows 2000
Windows 2000, known while under development as NT5, simply continued the tradition of NT3.1, 3.51, and 4.0. It did pick up the new shell from Windows 98 and addressed many usability complaints. But, from the globalization standpoint, it moved much closer to the worldwide EXE model, throughout: There were no longer bug fixes that existed only for specific languages! Support for MUI (the MultiLanguage User Interface) proved that Windows 2000 was a worldwide operating system.
Some applications, unfortunately, are still stuck with ANSI, most notably Internet Information Server, but these applications have been clearly put on notice where they need to be heading: Unicode.
Windows CE
Yet another model was used for the smallest operating system: Windows CE is closest to COM in that it only supports Unicode at the API level. However, because there are only a limited number of applications that still do support pure Unicode and only a limited amount of space on a smaller device for codepage translation tables, Windows CE applications are still limited in the number of codepages they can use. It is clearly, however, a step in the right direction.
Visual Basic in the 32-Bit World
And at last I am to the most important RAD tool in terms of this book: the 32-bit versions of Visual Basic! There are many issues that surround Unicode support in VB:
Of course, the order in which I have presented these points would lead anyone to believe the final answer to the question "Is VB Unicode?" would be "Yes, but...", and maybe that is the best answer to give. Visual Basic is indeed Unicode with its Unicode string storage and Unicode interfaces, but as the data engines and access methods learned, there is a lot more to supporting Unicode then making sure the front door supported it, especially if you want to get the benefits of Unicode. If you think about it, Visual Basic forms gain nothing from their Unicode interfaces, nothing at all. Why is that? Well, when they are used, a single codepage is required. Therefore, the only thing that the Unicode interfaces of the forms package gives VB is compatibility with COM; none of the benefits inherent in Unicode, such as being able to support many languages/locales, are available here.
Looking at Future Versions of Visual Basic
One thing that must be clear to the Visual Basic team at this point is that VB cannot be an ANSI application anymore. As the premiere RAD tool from Microsoft, it is important that VB be able to deal with Office 2000 and Windows 2000 properly, supporting every scenario they do.
At the time I am writing this book, the next VB version does not even have a name yet (although many people have suggested it might not be VB 7.0!). There are many exciting features being discussed for the product, but one issue about which Microsoft has been curiously silent is Unicode and multilingual support. One thing we can be certain of: The Visual Basic team is very aware of the pressures here to bring Visual Basic to the same level as so many of Microsoft's other products.
Where We Just Were...
Sometimes the simplest of questions can lead to answers that take a long time to express, and I think this chapter was indeed one of them. We started way back in the distant past of 16-bit Windows and walked through several of the important milestones in Microsoft products. The history and the way that other products have grown up around Visual Basic can give us clues into what future versions of Visual Basic can be expected to do for Unicode support.
The rest of Part II (the next two chapters) will discuss how to work around these limitations, and support languages that are outside the current default system codepage.