Choosing the right code page and collation for migration from mainframe Db2 to Amazon RDS for Db2

When migrating from Db2 mainframe (z/OS) to Amazon Relational Database Service (Amazon RDS) for Db2 on Linux, selecting the appropriate code page and collation sequence is critical for ensuring data compatibility. The right choice will help prevent data truncation, character expansion issues, and maintain consistent display of foreign characters (such as accented Latin letters) and sorting behavior.

While Amazon RDS for Db2 defaults to Unicode, you can achieve mainframe compatibility through careful configuration. This post guides you through selecting the appropriate code page for Amazon RDS for Db2 to help ensure seamless data migration.

Code page, collation, and territory parameters in Amazon RDS for Db2 cannot be modified after database creation. Careful selection upfront avoids the need for database recreation and remigration.

Understanding mainframe CCSID and code page

In Db2 for z/OS®, each character is mapped to a number depending upon the CCSID (coded character set identifier) and code page associated with that character. A code page is a numeric identifier for a specific mapping of bytes and is only the byte-to-glyph mapping. An example is code page 037, also known as Extended Binary Coded Decimal Interchange Code (EBCDIC) for the US and Canada. The CCSID is a full identifier describing the character set, encoding, and conversion rules. The CCSID includes code page and locale with conversion metadata. An example is CCSID 0037 (EBCDIC for the US and Canada including conversion rules). Every CCSID implies a code page, but a code page does not imply a complete CCSID. Db2 for z/OS performs character conversion internally and needs CCSID to know exactly how to convert data.

Regional code pages

North American and Western European code pages

IBM-37 (CCSID 37, CP037)
IBM-500 (CCSID 500, CP500)
IBM-1047 (CCSID 1047, CP1047)

German and Austrian code pages

IBM-273 (CCSID 273, CP273): Traditional German and Austrian EBCDIC
IBM-1141 (CCSID 1141, CP1141): German and Austrian EBCDIC with Euro symbol support

Japanese code pages

IBM-930 (CCSID 930): Japanese EBCDIC mixed SBCS and DBCS code page using Latin (Roman) SBCS and JIS X 0208 DBCS
IBM-939 (CCSID 939): Japanese EBCDIC mixed SBCS and DBCS code page using Katakana SBCS and JIS X 0208 DBCS
IBM-1390 (CCSID 1390): It uses Katakana SBCS with Euro (€) symbol support
IBM-1399 (CCSID 1399): Mixed byte character set whose SBCS is defined in IBM 1027 and DBCS is defined in IBM-300 Euro symbol (€)
IBM-5026 (CCSID 5026): Japanese EBCDIC Katakana SBCS code page with uppercase-only Latin letters and Katakana-oriented collation
IBM-5035 (CCSID 5035): Mixed byte character set, whose SBCS is defined in IBM-1027 and DBCS is defined in IBM-300.

These EBCDIC code pages encode character sets optimized for specific regions and languages. While many share similar Latin-1 character repertoires, they differ in the assignment of specific code points to enhance compatibility with local requirements and open systems. On IBM platforms, character CCSIDs (CHAR and VARCHAR) and graphic CCSIDs (GRAPHIC and VARGRAPHIC) are distinct. Both 1399 and 5035 are mixed byte character set.

Regional code page details

North American and Western European:

CCSID 37 (CP037): Traditional EBCDIC for the U.S., Canada, the Netherlands, and Portugal. Optimized for legacy mainframe applications with standard symbol placements. Originated in the 1960s for System/360 mainframes.
CCSID 500 (CP500): International EBCDIC variant for Belgium, French-Canada, and Switzerland. Features adjustments for local symbols (for example, currency symbols like CHF). Shares most code points with CCSID 37 but includes swaps for multilingual support.
CCSID 1047 (CP1047): Open Systems EBCDIC introduced in the 1990s for OS/390 USS (Unix System Services). A variant of CCSID 37 with six specific character swaps to align with POSIX/C standards.
CCSID 1140 (CP1140): Enhanced U.S. and Canada (CP037) that includes the Euro symbol (€).

German and Austrian:

CCSID 273 (CP273): Traditional German and Austrian EBCDIC encoding with German-specific characters such as ä, ö, ü, and ß. Widely used in German-speaking countries for legacy mainframe systems.
CCSID 1141 (CP1141): Enhanced German and Austrian EBCDIC that includes the Euro symbol (€). Introduced to support European monetary union requirements while maintaining German character support.

Japanese:

CCSID 930 (CP930): Japanese EBCDIC mixed SBCS and DBCS code page using Latin (Roman) SBCS and JIS X 0208 DBCS.
CCSID 939 (CP939): Japanese EBCDIC mixed SBCS and DBCS code page using Katakana SBCS and JIS X 0208 DBCS. It supports both SBCS (Katakana, Latin, and digits) and DBCS Kanji. It’s very common on z/OS and older mainframe environments.
CCSID 1390 (CP1390): Japanese EBCDIC mixed SBCS and DBCS code with Euro (€) symbol support.
CCSID 1399 (CP1399): Japanese EBCDIC DBCS graphic CCSOID (no SBCS).
CCSID 5026 (CP5026): Japanese EBCDIC Katakana SBCS code page with uppercase-only Latin letters and a Katakana-oriented collation sequence.
CCSID 5035 (CP5035): Japanese EBCDIC DBCS graphic CCSID (DBCS only).

Migration compatibility and ISO standards

For Latin-based code pages (37, 500, 1047, 273): ISO-8859-1 (Latin-1 or CCSID 819) serves as the direct ASCII equivalent, enabling lossless conversion for most characters.

Euro symbol consideration: ISO-8859-1 doesn’t include the Euro symbol (€) because it was standardized before the Euro’s introduction. For mainframe systems using Euro-enabled code pages (1140, 1141, 1390), consider:

ISO-8859-15 (Latin-9): Includes the Euro symbol and replaces some rarely used characters from ISO-8859-1
UTF-8: Comprehensive Unicode support including Euro and all international characters

For Japanese code pages (930, 939, 1390, 1399): UTF-8 is the recommended target encoding because ISO-8859-1 cannot represent Japanese characters. Note that either 930 or 939 can be converted to UTF-8 because conversion is one direction. If you’re trying to switch back from UTF-8 to either 930 or 939, almost half of the characters in UTF-8 won’t work in 930 and 939. Because of this, you won’t be able to convert back and forth without issues.

Understanding Db2 LUW code pages

Db2 LUW (Linux, Unix, and Windows) doesn’t use CCSIDs the same way Db2 on z/OS uses.

Db2 LUW is built on a code page instead of CCSIDs.
A code page is immutable and defined at the database creation time.
DB2 LUW relies on client/server code page conversion.
Db2 LUW doesn’t have a CCSID catalog column—the platform uses an International Components for Unicode (ICU) and US locales rather than IBM CCSID definitions.

Code page support: Db2 LUW supports various code pages including UTF-8 (Unicode) for comprehensive character support.

Character compatibility: ISO-8859-1 serves as the direct ASCII counterpart to IBM 37, IBM 500, and IBM-1047’s character repertoire, enabling lossless conversion for Latin-1 characters.

Collation behavior:

Standard collations are locale-based (for example, en_US)
Custom 256-byte sequences can simulate EBCDIC sorting for single-byte databases
Unicode databases use CLDR (Common Locale Data Repository) or UCA (Unicode Collation Algorithm) collations, making EBCDIC simulation more complex

Character display consistency: Foreign characters (non-ASCII) display correctly when data is properly converted during transfer. Without conversion, EBCDIC bytes appear as invalid characters in LUW (for example, 0xC1 represents A in EBCDIC but is invalid in ASCII). Db2’s DRDA protocol and tools like db2move handle automatic conversion.

Code page selection strategy for Amazon RDS for Db2

In this section, we review mainframe compared to Amazon RDS code page flexibility.

Mainframe Db2 allows code page definition at multiple levels:

Subsystem level: Defining CCSIDs (SBCS_CCSID, MIXED_CCSID, and DBCS_CCSID) establishes defaults for objects and applications that don’t explicitly specify a CCSID
Tablespace and Table level: A CCSID clause during creation sets the default encoding for all character columns unless overridden at the column level
Column level: Specific CCSID for individual columns (for example, VARCHAR(50) CCSID 1208 for UTF-8)

Db2 defines the code page only at database creation time and cannot be modified afterward.

Key considerations

When creating an Amazon RDS for Db2 instance, avoid specifying the default database name during instance creation, because it defaults to UTF-8 with US territory.

Code page equivalency:

IBM code page 819 (ISO-8859-1) is the exact equivalent of mainframe code pages 37, 500, 1047, and 273 (excluding the Euro symbol).
For Euro symbol support, ISO-8859-15 (Latin-9) provides compatibility with code pages 1141 and other Euro-enabled variants.
Japanese code pages (930, 939, 1390, and 1399) require UTF-8 encoding for proper character representation. However the counterpart of Japanese code pages 930 and 939 in Db2 LUW is IBM-943. It appears that most Japanese customers choose IBM-943 for Japanese code pages 930 and 939.

Decision framework

Choose the appropriate code page based on your specific migration requirements.

Option 1a: ISO-8859-1 Code set

Use ISO-8859-1 when:

Using only one CCSID (37, 500, or 1047) for all tables and columns in mainframe Db2.
Zero data truncation is required.
Exact mainframe sorting order must be preserved. The sorting order in Db2 is enabled at the time of the database creation through the collation parameter.
Multi-language support is not needed.
The Euro symbol isn’t required (ISO-8859-1 predates Euro introduction).

Implementation:

$ db2 connect to rdsadmin user <MasterUserName> using <MasterUserPassword>
$ db2 "call rdsadmin.create_database('<DBNAME>',32768,'ISO-8859-1','US','EBCDIC_819_037')"

Replace <MasterUserName>, <MasterUserPassword>, and <DBNAME> with your actual values.

Option 1b: ISO-8859-15 code set (Euro support)

Use ISO-8859-15 when:

Migrating from Euro-enabled mainframe code pages (1141, and 1390)
Euro symbol (€) support is required
Maintaining single-byte character encoding—SBCS (Single Byte Character Set)
Limited multi-language requirements

Implementation:

$ db2 "call rdsadmin.create_database('<DBNAME>',32768,'ISO-8859-15','US','SYSTEM')"

Collation sequence options:

The fifth parameter specifies the collation sequence. Choose based on your mainframe CCSID:

Collation value	Target code page	Mainframe CCSID	Description
EBCDIC_819_037	ISO-8859-1	37	EBCDIC US English
EBCDIC_819_500	ISO-8859-1	500, 1047	EBCDIC International
EBCDIC_850_037	ASCII Latin	37	EBCDIC US English
EBCDIC_850_500	ASCII Latin	500	EBCDIC International
EBCDIC_932_5026	Japanese	932 (collation 5026)	EBCDIC US English
EBCDIC_932_5035	Japanese	932 (collation 5035)	EBCDIC International
EBCDIC_1252_037	Windows Latin	37	EBCDIC US English
EBCDIC_1252_500	Windows Latin	500	EBCDIC International

Selection guide:

Mainframe CCSID 37: Use EBCDIC_819_037
Mainframe CCSID 500 or 1047: Use EBCDIC_819_500
Mainframe CCSID 273: Use EBCDIC_819_500
Mainframe CCSID 930: Use EBCDIC_932_5026 (Katakana collation)
Mainframe CCSID 939: Use EBCDIC_932_5035 (Latin collation)
Euro-enabled CCSID (1141, 1390, 1399): Consider UTF-8 or ISO-8859-15

The CREATE_DATABASE command creates a Db2 database with:

Recommended 32 K page size. The default page size is 8 K.
ISO-8859-1 code set (equivalent to IBM code page 819)
US territory
EBCDIC collation for the specified mainframe code page

Supported territory codes for ISO-8859-1 and ISO-8859-15: AL, AU, AT, BE, BR, CA, CH, CN, DE, DK, ES, ET, FI, GB, ID, IE, IN, IS, IT, JP, KE, KR, Lat, MY, NL, NZ, NO, PH, PT, TW, TZ, US, and ZA

Consult the IBM documentation for valid territory and code page combinations.

Option 2: UTF-8 code page

Use UTF-8 when:

Multi-language data support is required
Migrating from Japanese mainframe code pages (930, 939, or 1390)
Euro symbol (€) and comprehensive international character support needed
Future proofing for global character handling is important
Mainframe sorting order preservation is not critical

Important considerations:

Character expansion: UTF-8 might require more storage space
- Accented vowels (à, é, î), currency symbols (¢, £, ¥), fractions (¼, ½, ¾), and special characters (ß, ¬, µ) require 1 byte in mainframe code pages but 2 bytes in UTF-8
- DDL modifications required: CHAR and VARCHAR lengths must be adjusted, which is not trivial

Implementation:

$ db2 connect to rdsadmin user <MasterUserName> using <MasterUserPassword>
$ db2 "call rdsadmin.create_database('<DBNAME>',32768,'UTF-8','US','SYSTEM')"

UTF-8 is supported in all territories. Consult the IBM documentation for available territory options.

Understanding CODEUNITS32 in UTF-8 Db2 databases

When using UTF-8 in Amazon RDS for Db2, mainframe migration inevitably encounters data truncation issues because character expansion—some characters grow from 1 byte on mainframe to 2 bytes or more in UTF-8.

To avoid DDL changes, Db2 provides a mechanism to change the default string measurement from OCTETS to CODEUNITS32 by modifying the STRING_UNITS database configuration parameter.

$ db2 connect to rdsadmin user <MasterUserName> using <MasterUserPassword>
$ db2 "call rdsadmin.update_db_param('<DBNAME>','STRING_UNITS', 'CODEUNITS32', 'NO')"

This change requires an instance restart because STRING_UNITS isn’t a dynamic parameter. The DDLs must be created after updating the parameter for this parameter to take effect.

Practical example: ISO-8859-1 database

Alternatively, you can specify CODEUNITS32 at the column level when creating objects without changing the STRING_UNITS parameter.

Create an ISO-8859-1 database:

$ db2 connect to rdsadmin user <MasterUserName> using <MasterUserPassword>
$ db2 "call rdsadmin.create_database('MYDB',32768, 'ISO-8859-1','US','EBCDIC_819_037')"

Test with sample data:

db2 connect to mydb user <MasterUserName> using <MasterUserPassword>
db2 "create table t1 (c1 char(2))"
db2 "insert into t1 values ('ßA')"
db2 "insert into t1 values ('ßB')"
db2 "insert into t1 values ('ßC')"
db2 "insert into t1 values ('¬I')"
db2 "insert into t1 values ('µ5')"
db2 "insert into t1 values ('¼O')"
db2 "insert into t1 values ('¼Q')"
db2 "insert into t1 values ('¼S')"
db2 "select c1, character_length(c1, octets) LENGTH from t1"

Result:

C1 LENGTH          
-- -----------
ßA           2
ßB           2
ßC           2
¬I           2
µ5           2
¼O           2
¼Q           2
¼S           2
 
  8 record(s) selected.

Notice that each column uses exactly 2 bytes in the ISO-8859-1 database. Do the same test with a UTF-8 database.

Create a UTF-8 database:

$ db2 connect to rdsadmin user <MasterUserName> using <MasterUserPassword>
$ db2 "call rdsadmin.create_database('MYUTFDB',32768, 'UTF-8','US','SYSTEM')"

Test with CODEUNITS32:

db2 connect to myutfdb user <MasterUserName> using <MasterUserPassword>
db2 "create table t1 (c1 char(2 CODEUNITS32))"
db2 "insert into t1 values ('ßA')"
db2 "insert into t1 values ('ßB')"
db2 "insert into t1 values ('ßC')"
db2 "insert into t1 values ('¬I')"
db2 "insert into t1 values ('µ5')"
db2 "insert into t1 values ('¼O')"
db2 "insert into t1 values ('¼Q')"
db2 "insert into t1 values ('¼S')"
db2 "select c1, character_length(c1, octets) LENGTH from t1"

Result:

C1 LENGTH          
-- -----------
ßA           3
ßB           3
ßC           3
¬I           3
µ5           3
¼O           3
¼Q           3
¼S           3
 
  8 record(s) selected.

Key observation: Each column uses 3 bytes in the UTF-8 database, despite being defined as CHAR(2 CODEUNITS32). The CODEUNITS32 specification allows for two characters with up to 4 bytes each (8 bytes total allocation).

CODEUNITS32 trade-offs and recommendations

Impact of database level CODEUNITS32:

Default CHAR and VARCHAR allocation changes from 1 byte to 4 bytes per character
Reduces maximum lengths:
- CHAR: from 255 to 63 characters
- VARCHAR: from 32,704 to 8,174 bytes (32 KB page size)
For databases with 90% ASCII characters, size can expand to 3.8 times the original

It’s recommended to avoid database level CODEUNITS32. One option is to use selective column-level CODEUNITS32, but do so with caution, because it can waste significant space.

For example, CREATE TABLE t1 (c1 CHAR(2 CODEUNITS32)) allocates 8 bytes.

A better approach is to use OCTETS with adjusted length. OCTETS is a representation of a byte. It is a one-to-one mapping between a character and a number associated in non-Unicode code pages. CREATE TABLE t1 (c1 CHAR(4 OCTETS)) allocates exactly 4 bytes.

Best practice: Use CODEUNITS32 only when certain that you’ll store 3–4 byte characters (for example, East Asian languages). For most international characters, use and adjust OCTETS length accordingly.

Collation differences: EBCDIC compared to SYSTEM

Sort order comparison.

EBCDIC collation order: Special characters, lowercase letters, uppercase letters, then numerals

SYSTEM collation order: Numerals, uppercase letters, lowercase letters, then special characters

Practical examples

EBCDIC collation test:

db2 "with s(v) as (values '-','0','A','a','ß','¬','µ','¼') select v as VALUE from s order by v"

Result:

VALUE
-----
ß    
¬    
-    
a    
µ    
¼    
A    
0    
  8 record(s) selected.

The result shows the expected sort order of special characters, lowercase, uppercase, and numerals.

SYSTEM collation test:

$ db2 "with s(v) as (values '-','0','A','a','ß','¬','µ','¼') select v as VALUE from s order by v"

Result:

VALUE
-----
-
0
A
a
¬
µ
¼
ß
  8 record(s) selected.

The result shows the expected sort order of numerals, uppercase, lowercase, and special characters.

Handling unsupported characters in ISO-8859-1

When applications attempt to insert characters not available in ISO-8859-1 (CCSID 819), the database performs silent character substitution.

Example test:

db2 "create table t1(c1 char(4))"
db2 "insert into t1 values ('常')" -- Japanese character
db2 "select c1, hex(c1) hex, character_length(c1, octets) length from t1"

Result:

C1   HEX      LENGTH     
---- -------- -----------
     1A202020           4

What happened:

No error or warning was generated
The Japanese character was replaced by SUB (substitute) character (0x1A)
The remaining bytes are filled with spaces (0x20)

Key takeaway: ISO-8859-1 databases cannot store unsupported characters. Unknown characters are silently replaced with 0x1A followed by spaces (0x20) for remaining byte positions.

Ensuring character consistency across platforms

Data migration best practices

Conversion tools: Use Db2 federation (catalog z/OS database in LUW) or tools like db2move/EXPORT with ASC/DEL formats for automatic character conversion.

Check the mainframe CCSID configuration

Table-level CCSID:

SELECT NAME, ENCODING_SCHEME
FROM SYSIBM.SYSTABLES
WHERE NAME = '<TABLE_NAME>' AND CREATOR = '<CREATOR_NAME';
Example Result:
NAME             |ENCODING_SCHEME|
-----------------+---------------+
TABLE_NAME.      |E              | -- E = EBCDIC

Column-level CCSID:

SELECT NAME, CCSID
FROM SYSIBM.SYSCOLUMNS WHERE TBNAME = '<TABLE_NAME>' AND TBCREATOR = '<CREATOR_NAME>';

Example result:

NAME                |CCSID|
--------------------+-----+
COL1.               |   37|
COL2                |    0| -- 0 = subsystem default
COL3                |    0|
COL4                |   37|

Validation and testing

Character testing: Insert test data with international characters (for example, café, niño) on mainframe, export, and import to LUW and verify display consistency using GUI clients such as:

DBeaver
DataGrip
IBM Data Studio
IBM Database Management Console

Conversion risks: Round-trip conversions (EBCDIC → ASCII → EBCDIC) might lose character variants without exact mappings. Verify the results using IBM’s official CCSID tables.

Conclusion

Code page, collation, and territory parameters in Db2 are immutable after database creation—unlike other database systems. This makes the initial selection critical for migration success.Key considerations:

ISO-8859-1: Choose for exact mainframe compatibility and zero data loss. Use an Amazon RDS for Db2 defined collation sequence to preserve mainframe sorting behavior
ISO-8859-15: Consider for Euro symbol support when migrating from Euro-enabled mainframe code pages (1141, 1390) and verify Amazon RDS support
UTF-8: Select for multi-language support, but be aware of character expansion and potential DDL modifications
Early planning: Address code page and collation requirements early in the migration cycle to prevent data quality issues

Proper code page selection prevents data truncation, manages character expansion effectively, and helps ensure consistent application behavior across platforms.

Acknowledgements

Thanks to Chiranjeev Mukherjee, Hajime Minagawa, and Akira Shimosako for carefully reviewing this blog post from the mainframe migration perspective and especially for Japanese character sets.

AWS Database Blog

Choosing the right code page and collation for migration from mainframe Db2 to Amazon RDS for Db2

Understanding mainframe CCSID and code page

Regional code pages

Regional code page details

Migration compatibility and ISO standards

Understanding Db2 LUW code pages

Code page selection strategy for Amazon RDS for Db2

Key considerations

Decision framework

Option 1a: ISO-8859-1 Code set

Option 1b: ISO-8859-15 code set (Euro support)

Option 2: UTF-8 code page

Understanding CODEUNITS32 in UTF-8 Db2 databases

Practical example: ISO-8859-1 database

CODEUNITS32 trade-offs and recommendations

Collation differences: EBCDIC compared to SYSTEM

Practical examples

Handling unsupported characters in ISO-8859-1

Ensuring character consistency across platforms

Data migration best practices

Validation and testing

Conclusion

Acknowledgements

About the authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help