README.mb
上传用户:blenddy
上传日期:2007-01-07
资源大小:6495k
文件大小:9k
- postgresql 6.5.1 multi-byte (MB) support README July 11 1999
- Tatsuo Ishii
- t-ishii@sra.co.jp
- http://www.sra.co.jp/people/t-ishii/PostgreSQL/
- 0. Introduction
- The MB support is intended for allowing PostgreSQL to handle
- multi-byte character sets such as EUC(Extended Unix Code), Unicode and
- Mule internal code. With the MB enabled you can use multi-byte
- character sets in regexp ,LIKE and some functions. The default
- encoding system chosen is determined while initializing your
- PostgreSQL installation using initdb(1). Note that this can be
- overridden when you create a database using createdb(1) or create
- database SQL command. So you could have multiple databases with
- different encoding systems.
- MB also fixes some problems concerning with 8-bit single byte
- character sets including ISO8859. (I would not say all of problems
- have been fixed. I just confirmed that the regression test ran fine
- and a few French characters could be used with the patch. Please let
- me know if you find any problem while using 8-bit characters)
- 1. How to use
- run configure with the mb option:
- % configure --with-mb=encoding_system
- where encoding_system is one of:
- SQL_ASCII ASCII
- EUC_JP Japanese EUC
- EUC_CN Chinese EUC
- EUC_KR Korean EUC
- EUC_TW Taiwan EUC
- UNICODE Unicode(UTF-8)
- MULE_INTERNAL Mule internal
- LATIN1 ISO 8859-1 English and some European languages
- LATIN2 ISO 8859-2 English and some European languages
- LATIN3 ISO 8859-3 English and some European languages
- LATIN4 ISO 8859-4 English and some European languages
- LATIN5 ISO 8859-5 English and some European languages
- KOI8 KOI8-R
- WIN Windows CP1251
- ALT Windows CP866
- Example:
- % configure --with-mb=EUC_JP
- If MB is disabled, nothing is changed except better supporting for
- 8-bit single byte character sets.
- 2. How to set encoding
- initdb command defines the default encoding for a PostgreSQL
- installation. For example:
- % initdb -e EUC_JP
- sets the default encoding to EUC_JP(Extended Unix Code for Japanese).
- Note that you can use "-pgencoding" instead of "-e" if you like longer
- option string:-) If no -e or -pgencoding option is given, the encoding
- specified at the compile time is used.
- You can create a database with a different encoding.
- % createdb -E EUC_KR korean
- will create a database named "korean" with EUC_KR encoding. The
- another way to accomplish this is to use a SQL command:
- CREATE DATABASE korean WITH ENCODING = 'EUC_KR';
- The encoding for a database is represented as "encoding" column in the
- pg_database system catalog.
- datname |datdba|encoding|datpath
- -------------+------+--------+-------------
- template1 | 1739| 1|template1
- postgres | 1739| 0|postgres
- euc_jp | 1739| 1|euc_jp
- euc_kr | 1739| 3|euc_kr
- euc_cn | 1739| 2|euc_cn
- unicode | 1739| 5|unicode
- mule_internal| 1739| 6|mule_internal
- A number in the encoding column is "encoding id" and can be translated
- to the encoding name using pg_encoding command.
- $ pg_encoding 1
- EUC_JP
- If an argument to pg_encoding is not a number, then it is regarded as
- an encoding name and pg_encoding will return the encoding id.
- $ pg_encoding EUC_JP
- 1
- 3. PGCLIENTENCODING
- If an environment variable PGCLIENTENCODING is defined on the
- frontend, automatic encoding translation is done by the backend. For
- example, if the backend has been compiled with MB=EUC_JP and
- PGCLIENTENCODING=SJIS(Shift JIS: yet another Japanese encoding
- system), then any SJIS strings coming from the frontend would be
- translated to EUC_JP before going into the parser. Outputs from the
- backend would be translated to SJIS of course.
- Supported encodings for PGCLIENTENCODING are:
- SQL_ASCII ASCII
- EUC_JP Japanese EUC
- SJIS Yet another Japanese encoding
- EUC_CN Chinese EUC
- EUC_KR Korean EUC
- EUC_TW Taiwan EUC
- BIG5 Traditional Chinese
- MULE_INTERNAL Mule internal
- LATIN1 ISO 8859-1 English and some European languages
- LATIN2 ISO 8859-2 English and some European languages
- LATIN3 ISO 8859-3 English and some European languages
- LATIN4 ISO 8859-4 English and some European languages
- LATIN5 ISO 8859-5 English and some European languages
- KOI8 KOI8-R
- WIN Windows CP1251
- ALT Windows CP866
- WIN1250 Windows CP1250 (Czech)
- Note that UNICODE is not supported(yet). Also note that the
- translation is not always possible. Suppose you choose EUC_JP for the
- backend, LATIN1 for the frontend, then some Japanese characters cannot
- be translated into latin. In this case, a letter cannot be represented
- in the Latin character set, would be transformed as:
- (HEXA DECIMAL)
- 3. SET CLIENT_ENCODING TO command
- Actually setting the frontend side encoding information is done by a
- new command:
- SET CLIENT_ENCODING TO 'encoding';
- where encoding is one of the encodings those can be set to
- PGCLIENTENCODING. Also you can use SQL92 syntax "SET NAMES" for this
- purpose:
- SET NAMES 'encoding';
- To query the current the frontend encoding:
- SHOW CLIENT_ENCODING;
- To return to the default encoding:
- RESET CLIENT_ENCODING;
- This would reset the frontend encoding to same as the backend
- encoding, thus no encoding translation would be performed.
- 4. References
- These are good sources to start learning various kind of encoding
- systems.
- ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
- Detailed explanations of EUC_JP, EUC_CN, EUC_KR, EUC_TW
- appear in section 3.2.
- Unicode: http://www.unicode.org/
- The homepage of UNICODE.
- RFC 2044
- UTF-8 is defined here.
- 5. History
- July 11, 1999
- * Add support for WIN1250 (Windows Czech) as a client encoding
- (contributed by Pavel Behal)
- * fix some compiler warnings (contributed by Tomoaki Nishiyama)
- Mar 23, 1999
- * Add support for KOI8(KOI8-R), WIN(CP1251), ALT(CP866)
- (thanks Oleg Broytmann for testing)
- * Fix problem with MB and locale
- Jan 26, 1999
- * Add support for Big5 for fronend encoding
- (you need to create a database with EUC_TW to use Big5)
- * Add regression test case for EUC_TW
- (contributed by Jonah Kuo <jonahkuo@mail.ttn.com.tw>)
- Dec 15, 1998
- * Bugs related to SQL_ASCII support fixed
- Nov 5, 1998
- * 6.4 release. In this version, pg_database has "encoding"
- column that represents the database encoding
- Jul 22, 1998
- * determine encoding at initdb/createdb rather than compile time
- * support for PGCLIENTENCODING when issuing COPY command
- * support for SQL92 syntax "SET NAMES"
- * support for LATIN2-5
- * add UNICODE regression test case
- * new test suite for MB
- * clean up source files
- Jun 5, 1998
- * add support for the encoding translation between the backend
- and the frontend
- * new command SET CLIENT_ENCODING etc. added
- * add support for LATIN1 character set
- * enhance 8 bit cleaness
- April 21, 1998 some enhancements/fixes
- * character_length(), position(), substring() are now aware of
- multi-byte characters
- * add octet_length()
- * add --with-mb option to configure
- * new regression tests for EUC_KR
- (contributed by "Soonmyung. Hong" <hong@lunaris.hanmesoft.co.kr>)
- * add some test cases to the EUC_JP regression test
- * fix problem in regress/regress.sh in case of System V
- * fix toupper(), tolower() to handle 8bit chars
- Mar 25, 1998 MB PL2 is incorporated into PostgreSQL 6.3.1
- Mar 10, 1998 PL2 released
- * add regression test for EUC_JP, EUC_CN and MULE_INTERNAL
- * add an English document (this file)
- * fix problems concerning 8-bit single byte characters
- Mar 1, 1998 PL1 released
- Appendix:
- [Here is a good documentation explaining how to use WIN1250 on
- Windows/ODBC from Pavel Behal. Please note that Installation step 1)
- is not necceary in 6.5.1 -- Tatsuo]
- Version: 0.91 for PgSQL 6.5
- Author: Pavel Behal
- Revised by: Tatsuo Ishii
- Email: behal@opf.slu.cz
- Licence: The Same as PostgreSQL
- Sorry for my Eglish and C code, I'm not native :-)
- !!!!!!!!!!!!!!!!!!!!!!!!! NO WARRANTY !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
- Instalation:
- ------------
- 1) Change three affected files in source directories
- (I don't have time to create proper patch diffs, I don't know how)
- 2) Compile with enabled locale and multibyte set to LATIN2
- 3) Setup properly your instalation, do not forget to create locale
- variables in your profile (environment). Ex. (may not be exactly true):
- LC_ALL=cs_CZ.ISO8859-2
- LC_COLLATE=cs_CZ.ISO8859-2
- LC_CTYPE=cs_CZ.ISO8859-2
- LC_MONETARY=cs_CZ.ISO8859-2
- LC_NUMERIC=cs_CZ.ISO8859-2
- LC_TIME=cs_CZ.ISO8859-2
- 4) You have to start the postmaster with locales set!
- 5) Try it with Czech language, it have to sort
- 5) Install ODBC driver for PgSQL into your M$ Windows
- 6) Setup properly your data source. Include this line in your ODBC
- configuration dialog in field "Connect Settings:" :
- SET CLIENT_ENCODING = 'WIN1250';
- 7) Now try it again, but in Windows with ODBC.
- Description:
- ------------
- - Depends on proper system locales, tested with RH6.0 and Slackware 3.6,
- with cs_CZ.iso8859-2 loacle
- - Never try to set-up server multibyte database encoding to WIN1250,
- always use LATIN2 instead. There is not WIN1250 locale in Unix
- - WIN1250 encoding is useable only for M$W ODBC clients. The characters are
- on thy fly re-coded, to be displayed and stored back properly
-
- Important:
- ----------
- - it reorders your sort order depending on your LC_... setting, so don't be
- confused with regression tests, they don't use locale
- - "ch" is corectly sorted only in some newer locales (Ex. RH6.0)
- - you have to insert money as '162,50' (with comma in aphostrophes!)
- - not tested properly