AWS Database Blog
Choosing the right code page and collation for migration from mainframe Db2 to Amazon RDS for Db2
When migrating from Db2 mainframe (z/OS) to Amazon Relational Database Service (Amazon RDS) for Db2 on Linux, selecting the appropriate code page and collation sequence is critical for ensuring data compatibility. The right choice will help prevent data truncation, character expansion issues, and maintain consistent display of foreign characters (such as accented Latin letters) and sorting behavior.
While Amazon RDS for Db2 defaults to Unicode, you can achieve mainframe compatibility through careful configuration. This post guides you through selecting the appropriate code page for Amazon RDS for Db2 to help ensure seamless data migration.
Code page, collation, and territory parameters in Amazon RDS for Db2 cannot be modified after database creation. Careful selection upfront avoids the need for database recreation and remigration.
Understanding mainframe CCSID and code page
In Db2 for z/OS®, each character is mapped to a number depending upon the CCSID (coded character set identifier) and code page associated with that character. A code page is a numeric identifier for a specific mapping of bytes and is only the byte-to-glyph mapping. An example is code page 037, also known as Extended Binary Coded Decimal Interchange Code (EBCDIC) for the US and Canada. The CCSID is a full identifier describing the character set, encoding, and conversion rules. The CCSID includes code page and locale with conversion metadata. An example is CCSID 0037 (EBCDIC for the US and Canada including conversion rules). Every CCSID implies a code page, but a code page does not imply a complete CCSID. Db2 for z/OS performs character conversion internally and needs CCSID to know exactly how to convert data.
Regional code pages
North American and Western European code pages
- IBM-37 (CCSID 37, CP037)
- IBM-500 (CCSID 500, CP500)
- IBM-1047 (CCSID 1047, CP1047)
German and Austrian code pages
- IBM-273 (CCSID 273, CP273): Traditional German and Austrian EBCDIC
- IBM-1141 (CCSID 1141, CP1141): German and Austrian EBCDIC with Euro symbol support
Japanese code pages
- IBM-930 (CCSID 930): Japanese EBCDIC mixed SBCS and DBCS code page using Katakana SBCS and JIS X 0208 DBCS
- IBM-939 (CCSID 939): Japanese EBCDIC mixed SBCS and DBCS code page using Latin (Roman) SBCS and JIS X 0208 DBCS
- IBM-1390 (CCSID 1390): Japanese EBCDIC mixed SBCS/DBCS character CCSID with Euro (€) symbol support
- IBM-1399 (CCSID 1399): Japanese EBCDIC DBCS-only graphic CCSID
- IBM-5026 (CCSID 5026): Japanese EBCDIC Katakana SBCS code page with uppercase-only Latin letters and Katakana-oriented collation
- IBM-5035 (CCSID 5035): Japanese EBCDIC DBCS-only graphic CCSID
These EBCDIC code pages encode character sets optimized for specific regions and languages. While many share similar Latin-1 character repertoires, they differ in the assignment of specific code points to enhance compatibility with local requirements and open systems. On IBM platforms, character CCSIDs (CHAR and VARCHAR) and graphic CCSIDs (GRAPHIC and VARGRAPHIC) are distinct. DBCS-only CCSIDs such as 1399 and 5035 are intended for graphic data and cannot represent SBCS characters.
Regional code page details
North American and Western European:
- CCSID 37 (CP037): Traditional EBCDIC for the U.S., Canada, the Netherlands, and Portugal. Optimized for legacy mainframe applications with standard symbol placements. Originated in the 1960s for System/360 mainframes.
- CCSID 500 (CP500): International EBCDIC variant for Belgium, French-Canada, and Switzerland. Features adjustments for local symbols (for example, currency symbols like CHF). Shares most code points with CCSID 37 but includes swaps for multilingual support.
- CCSID 1047 (CP1047): Open Systems EBCDIC introduced in the 1990s for OS/390 USS (Unix System Services). A variant of CCSID 37 with six specific character swaps to align with POSIX/C standards.
- CCSID 1140 (CP1140): Enhanced U.S. and Canada (CP037) that includes the Euro symbol (€).
German and Austrian:
- CCSID 273 (CP273): Traditional German and Austrian EBCDIC encoding with German-specific characters such as ä, ö, ü, and ß. Widely used in German-speaking countries for legacy mainframe systems.
- CCSID 1141 (CP1141): Enhanced German and Austrian EBCDIC that includes the Euro symbol (€). Introduced to support European monetary union requirements while maintaining German character support.
Japanese:
- CCSID 930 (CP930): Japanese EBCDIC mixed SBCS and DBCS code page using Latin (Roman) SBCS and JIS X 0208 DBCS.
- CCSID 939 (CP939): Japanese EBCDIC mixed SBCS and DBCS code page using Katakana SBCS and JIS X 0208 DBCS. It supports both SBCS (Katakana, Latin, and digits) and DBCS Kanji. It’s very common on z/OS and older mainframe environments.
- CCSID 1390 (CP1390): Japanese EBCDIC mixed SBCS and DBCS code with Euro (€) symbol support.
- CCSID 1399 (CP1399): Japanese EBCDIC DBCS graphic CCSOID (no SBCS).
- CCSID 5026 (CP5026): Japanese EBCDIC Katakana SBCS code page with uppercase-only Latin letters and a Katakana-oriented collation sequence.
- CCSID 5035 (CP5035): Japanese EBCDIC DBCS graphic CCSID (DBCS only).
Migration compatibility and ISO standards
For Latin-based code pages (37, 500, 1047, 273): ISO-8859-1 (Latin-1 or CCSID 819) serves as the direct ASCII equivalent, enabling lossless conversion for most characters.
Euro symbol consideration: ISO-8859-1 doesn’t include the Euro symbol (€) because it was standardized before the Euro’s introduction. For mainframe systems using Euro-enabled code pages (1140, 1141, 1390), consider:
- ISO-8859-15 (Latin-9): Includes the Euro symbol and replaces some rarely used characters from ISO-8859-1
- UTF-8: Comprehensive Unicode support including Euro and all international characters
For Japanese code pages (930, 939, 1390, 1399): UTF-8 is the recommended target encoding because ISO-8859-1 cannot represent Japanese characters. Note that either 930 or 939 can be converted to UTF-8 because conversion is one direction. If you’re trying to switch back from UTF-8 to either 930 or 939, almost half of the characters in UTF-8 won’t work in 930 and 939. Because of this, you won’t be able to convert back and forth without issues.
Understanding Db2 LUW code pages
Db2 LUW (Linux, Unix, and Windows) doesn’t use CCSIDs the same way Db2 on z/OS uses.
- Db2 LUW is built on a code page instead of CCSIDs.
- A code page is immutable and defined at the database creation time.
- DB2 LUW relies on client/server code page conversion.
- Db2 LUW doesn’t have a CCSID catalog column—the platform uses an International Components for Unicode (ICU) and US locales rather than IBM CCSID definitions.
Code page support: Db2 LUW supports various code pages including UTF-8 (Unicode) for comprehensive character support.
Character compatibility: ISO-8859-1 serves as the direct ASCII counterpart to IBM 37, IBM 500, and IBM-1047’s character repertoire, enabling lossless conversion for Latin-1 characters.
Collation behavior:
- Standard collations are locale-based (for example, en_US)
- Custom 256-byte sequences can simulate EBCDIC sorting for single-byte databases
- Unicode databases use CLDR (Common Locale Data Repository) or UCA (Unicode Collation Algorithm) collations, making EBCDIC simulation more complex
Character display consistency: Foreign characters (non-ASCII) display correctly when data is properly converted during transfer. Without conversion, EBCDIC bytes appear as invalid characters in LUW (for example, 0xC1 represents A in EBCDIC but is invalid in ASCII). Db2’s DRDA protocol and tools like db2move handle automatic conversion.
Code page selection strategy for Amazon RDS for Db2
In this section, we review mainframe compared to Amazon RDS code page flexibility.
Mainframe Db2 allows code page definition at multiple levels:
- Subsystem level: Defining CCSIDs (SBCS_CCSID, MIXED_CCSID, and DBCS_CCSID) establishes defaults for objects and applications that don’t explicitly specify a CCSID
- Tablespace and Table level: A CCSID clause during creation sets the default encoding for all character columns unless overridden at the column level
- Column level: Specific CCSID for individual columns (for example,
VARCHAR(50) CCSID 1208for UTF-8)
Db2 defines the code page only at database creation time and cannot be modified afterward.
Key considerations
When creating an Amazon RDS for Db2 instance, avoid specifying the default database name during instance creation, because it defaults to UTF-8 with US territory.
Code page equivalency:
- IBM code page 819 (ISO-8859-1) is the exact equivalent of mainframe code pages 37, 500, 1047, and 273 (excluding the Euro symbol).
- For Euro symbol support, ISO-8859-15 (Latin-9) provides compatibility with code pages 1141 and other Euro-enabled variants.
- Japanese code pages (930, 939, 1390, and 1399) require UTF-8 encoding for proper character representation. However the counterpart of Japanese code pages 930 and 939 in Db2 LUW is IBM-943. It appears that most Japanese customers choose IBM-943 for Japanese code pages 930 and 939.
Decision framework
Choose the appropriate code page based on your specific migration requirements.
Option 1a: ISO-8859-1 Code set
Use ISO-8859-1 when:
- Using only one CCSID (37, 500, or 1047) for all tables and columns in mainframe Db2.
- Zero data truncation is required.
- Exact mainframe sorting order must be preserved. The sorting order in Db2 is enabled at the time of the database creation through the collation parameter.
- Multi-language support is not needed.
- The Euro symbol isn’t required (ISO-8859-1 predates Euro introduction).
Implementation:
Replace <MasterUserName>, <MasterUserPassword>, and <DBNAME> with your actual values.
Option 1b: ISO-8859-15 code set (Euro support)
Use ISO-8859-15 when:
- Migrating from Euro-enabled mainframe code pages (1141, and 1390)
- Euro symbol (€) support is required
- Maintaining single-byte character encoding—SBCS (Single Byte Character Set)
- Limited multi-language requirements
Implementation:
Collation sequence options:
The fifth parameter specifies the collation sequence. Choose based on your mainframe CCSID:
|
Collation value |
Target code page |
Mainframe CCSID |
Description |
|
EBCDIC_819_037 |
ISO-8859-1 |
37 |
EBCDIC US English |
|
EBCDIC_819_500 |
ISO-8859-1 | 500, 1047 | EBCDIC International |
| EBCDIC_850_037 | ASCII Latin | 37 | EBCDIC US English |
| EBCDIC_850_500 | ASCII Latin | 500 | EBCDIC International |
| EBCDIC_932_5026 | Japanese | 932 (collation 5026) | EBCDIC US English |
| EBCDIC_932_5035 | Japanese | 932 (collation 5035) | EBCDIC International |
| EBCDIC_1252_037 | Windows Latin | 37 | EBCDIC US English |
| EBCDIC_1252_500 | Windows Latin | 500 | EBCDIC International |
Selection guide:
- Mainframe CCSID 37: Use EBCDIC_819_037
- Mainframe CCSID 500 or 1047: Use EBCDIC_819_500
- Mainframe CCSID 273: Use EBCDIC_819_500
- Mainframe CCSID 930: Use EBCDIC_932_5026 (Katakana collation)
- Mainframe CCSID 939: Use EBCDIC_932_5035 (Latin collation)
- Euro-enabled CCSID (1141, 1390, 1399): Consider UTF-8 or ISO-8859-15
The CREATE_DATABASE command creates a Db2 database with:
- Recommended 32 K page size. The default page size is 8 K.
- ISO-8859-1 code set (equivalent to IBM code page 819)
- US territory
- EBCDIC collation for the specified mainframe code page
Supported territory codes for ISO-8859-1 and ISO-8859-15: AL, AU, AT, BE, BR, CA, CH, CN, DE, DK, ES, ET, FI, GB, ID, IE, IN, IS, IT, JP, KE, KR, Lat, MY, NL, NZ, NO, PH, PT, TW, TZ, US, and ZA
Consult the IBM documentation for valid territory and code page combinations.
Option 2: UTF-8 code page
Use UTF-8 when:
- Multi-language data support is required
- Migrating from Japanese mainframe code pages (930, 939, or 1390)
- Euro symbol (€) and comprehensive international character support needed
- Future proofing for global character handling is important
- Mainframe sorting order preservation is not critical
Important considerations:
- Character expansion: UTF-8 might require more storage space
- Accented vowels (à, é, î), currency symbols (¢, £, ¥), fractions (¼, ½, ¾), and special characters (ß, ¬, µ) require 1 byte in mainframe code pages but 2 bytes in UTF-8
- DDL modifications required:
CHARandVARCHARlengths must be adjusted, which is not trivial
Implementation:
UTF-8 is supported in all territories. Consult the IBM documentation for available territory options.
Understanding CODEUNITS32 in UTF-8 Db2 databases
When using UTF-8 in Amazon RDS for Db2, mainframe migration inevitably encounters data truncation issues because character expansion—some characters grow from 1 byte on mainframe to 2 bytes or more in UTF-8.
To avoid DDL changes, Db2 provides a mechanism to change the default string measurement from OCTETS to CODEUNITS32 by modifying the STRING_UNITS database configuration parameter.
This change requires an instance restart because STRING_UNITS isn’t a dynamic parameter. The DDLs must be created after updating the parameter for this parameter to take effect.
Practical example: ISO-8859-1 database
Alternatively, you can specify CODEUNITS32 at the column level when creating objects without changing the STRING_UNITS parameter.
Create an ISO-8859-1 database:
Test with sample data:
Result:
Notice that each column uses exactly 2 bytes in the ISO-8859-1 database. Do the same test with a UTF-8 database.
Create a UTF-8 database:
Test with CODEUNITS32:
Result:
Key observation: Each column uses 3 bytes in the UTF-8 database, despite being defined as CHAR(2 CODEUNITS32). The CODEUNITS32 specification allows for two characters with up to 4 bytes each (8 bytes total allocation).
CODEUNITS32 trade-offs and recommendations
Impact of database level CODEUNITS32:
- Default
CHARandVARCHARallocation changes from 1 byte to 4 bytes per character - Reduces maximum lengths:
CHAR: from 255 to 63 charactersVARCHAR: from 32,704 to 8,174 bytes (32 KB page size)
- For databases with 90% ASCII characters, size can expand to 3.8 times the original
It’s recommended to avoid database level CODEUNITS32. One option is to use selective column-level CODEUNITS32, but do so with caution, because it can waste significant space.
For example, CREATE TABLE t1 (c1 CHAR(2 CODEUNITS32)) allocates 8 bytes.
A better approach is to use OCTETS with adjusted length. OCTETS is a representation of a byte. It is a one-to-one mapping between a character and a number associated in non-Unicode code pages. CREATE TABLE t1 (c1 CHAR(4 OCTETS)) allocates exactly 4 bytes.
Best practice: Use CODEUNITS32 only when certain that you’ll store 3–4 byte characters (for example, East Asian languages). For most international characters, use and adjust OCTETS length accordingly.
Collation differences: EBCDIC compared to SYSTEM
Sort order comparison.
EBCDIC collation order: Special characters, lowercase letters, uppercase letters, then numerals
SYSTEM collation order: Numerals, uppercase letters, lowercase letters, then special characters
Practical examples
EBCDIC collation test:
Result:
The result shows the expected sort order of special characters, lowercase, uppercase, and numerals.
SYSTEM collation test:
Result:
The result shows the expected sort order of numerals, uppercase, lowercase, and special characters.
Handling unsupported characters in ISO-8859-1
When applications attempt to insert characters not available in ISO-8859-1 (CCSID 819), the database performs silent character substitution.
Example test:
Result:
What happened:
- No error or warning was generated
- The Japanese character was replaced by SUB (substitute) character (0x1A)
- The remaining bytes are filled with spaces (0x20)
Key takeaway: ISO-8859-1 databases cannot store unsupported characters. Unknown characters are silently replaced with 0x1A followed by spaces (0x20) for remaining byte positions.
Ensuring character consistency across platforms
Data migration best practices
Conversion tools: Use Db2 federation (catalog z/OS database in LUW) or tools like db2move/EXPORT with ASC/DEL formats for automatic character conversion.
Check the mainframe CCSID configuration
Table-level CCSID:
Column-level CCSID:
Example result:
Validation and testing
Character testing: Insert test data with international characters (for example, café, niño) on mainframe, export, and import to LUW and verify display consistency using GUI clients such as:
- DBeaver
- DataGrip
- IBM Data Studio
- IBM Database Management Console
Conversion risks: Round-trip conversions (EBCDIC → ASCII → EBCDIC) might lose character variants without exact mappings. Verify the results using IBM’s official CCSID tables.
Conclusion
Code page, collation, and territory parameters in Db2 are immutable after database creation—unlike other database systems. This makes the initial selection critical for migration success.Key considerations:
- ISO-8859-1: Choose for exact mainframe compatibility and zero data loss. Use an Amazon RDS for Db2 defined collation sequence to preserve mainframe sorting behavior
- ISO-8859-15: Consider for Euro symbol support when migrating from Euro-enabled mainframe code pages (1141, 1390) and verify Amazon RDS support
- UTF-8: Select for multi-language support, but be aware of character expansion and potential DDL modifications
- Early planning: Address code page and collation requirements early in the migration cycle to prevent data quality issues
Proper code page selection prevents data truncation, manages character expansion effectively, and helps ensure consistent application behavior across platforms.
Acknowledgements
Thanks to Chiranjeev Mukherjee, Hajime Minagawa, and Akira Shimosako for carefully reviewing this blog post from the mainframe migration perspective and especially for Japanese character sets.