AWS Database Blog

Choosing the right code page and collation for migration from mainframe Db2 to Amazon RDS for Db2

When migrating from Db2 mainframe (z/OS) to Amazon Relational Database Service (Amazon RDS) for Db2 on Linux, selecting the appropriate code page and collation sequence is critical for ensuring data compatibility. The right choice will help prevent data truncation, character expansion issues, and maintain consistent display of foreign characters (such as accented Latin letters) and sorting behavior.

While Amazon RDS for Db2 defaults to Unicode, you can achieve mainframe compatibility through careful configuration. This post guides you through selecting the appropriate code page for Amazon RDS for Db2 to help ensure seamless data migration.

Code page, collation, and territory parameters in Amazon RDS for Db2 cannot be modified after database creation. Careful selection upfront avoids the need for database recreation and remigration.

Understanding mainframe CCSID and code page

In Db2 for z/OS®, each character is mapped to a number depending upon the CCSID (coded character set identifier) and code page associated with that character. A code page is a numeric identifier for a specific mapping of bytes and is only the byte-to-glyph mapping. An example is code page 037, also known as Extended Binary Coded Decimal Interchange Code (EBCDIC) for the US and Canada. The CCSID is a full identifier describing the character set, encoding, and conversion rules. The CCSID includes code page and locale with conversion metadata. An example is CCSID 0037 (EBCDIC for the US and Canada including conversion rules). Every CCSID implies a code page, but a code page does not imply a complete CCSID. Db2 for z/OS performs character conversion internally and needs CCSID to know exactly how to convert data.

Regional code pages

North American and Western European code pages

  • IBM-37 (CCSID 37, CP037)
  • IBM-500 (CCSID 500, CP500)
  • IBM-1047 (CCSID 1047, CP1047)

German and Austrian code pages

  • IBM-273 (CCSID 273, CP273): Traditional German and Austrian EBCDIC
  • IBM-1141 (CCSID 1141, CP1141): German and Austrian EBCDIC with Euro symbol support

Japanese code pages

  • IBM-930 (CCSID 930): Japanese EBCDIC mixed SBCS and DBCS code page using Katakana SBCS and JIS X 0208 DBCS
  • IBM-939 (CCSID 939): Japanese EBCDIC mixed SBCS and DBCS code page using Latin (Roman) SBCS and JIS X 0208 DBCS
  • IBM-1390 (CCSID 1390): Japanese EBCDIC mixed SBCS/DBCS character CCSID with Euro (€) symbol support
  • IBM-1399 (CCSID 1399): Japanese EBCDIC DBCS-only graphic CCSID
  • IBM-5026 (CCSID 5026): Japanese EBCDIC Katakana SBCS code page with uppercase-only Latin letters and Katakana-oriented collation
  • IBM-5035 (CCSID 5035): Japanese EBCDIC DBCS-only graphic CCSID

These EBCDIC code pages encode character sets optimized for specific regions and languages. While many share similar Latin-1 character repertoires, they differ in the assignment of specific code points to enhance compatibility with local requirements and open systems. On IBM platforms, character CCSIDs (CHAR and VARCHAR) and graphic CCSIDs (GRAPHIC and VARGRAPHIC) are distinct. DBCS-only CCSIDs such as 1399 and 5035 are intended for graphic data and cannot represent SBCS characters.

Regional code page details

North American and Western European:

  • CCSID 37 (CP037): Traditional EBCDIC for the U.S., Canada, the Netherlands, and Portugal. Optimized for legacy mainframe applications with standard symbol placements. Originated in the 1960s for System/360 mainframes.
  • CCSID 500 (CP500): International EBCDIC variant for Belgium, French-Canada, and Switzerland. Features adjustments for local symbols (for example, currency symbols like CHF). Shares most code points with CCSID 37 but includes swaps for multilingual support.
  • CCSID 1047 (CP1047): Open Systems EBCDIC introduced in the 1990s for OS/390 USS (Unix System Services). A variant of CCSID 37 with six specific character swaps to align with POSIX/C standards.
  • CCSID 1140 (CP1140): Enhanced U.S. and Canada (CP037) that includes the Euro symbol (€).

German and Austrian:

  • CCSID 273 (CP273): Traditional German and Austrian EBCDIC encoding with German-specific characters such as ä, ö, ü, and ß. Widely used in German-speaking countries for legacy mainframe systems.
  • CCSID 1141 (CP1141): Enhanced German and Austrian EBCDIC that includes the Euro symbol (€). Introduced to support European monetary union requirements while maintaining German character support.

Japanese:

  • CCSID 930 (CP930): Japanese EBCDIC mixed SBCS and DBCS code page using Latin (Roman) SBCS and JIS X 0208 DBCS.
  • CCSID 939 (CP939): Japanese EBCDIC mixed SBCS and DBCS code page using Katakana SBCS and JIS X 0208 DBCS. It supports both SBCS (Katakana, Latin, and digits) and DBCS Kanji. It’s very common on z/OS and older mainframe environments.
  • CCSID 1390 (CP1390): Japanese EBCDIC mixed SBCS and DBCS code with Euro (€) symbol support.
  • CCSID 1399 (CP1399): Japanese EBCDIC DBCS graphic CCSOID (no SBCS).
  • CCSID 5026 (CP5026): Japanese EBCDIC Katakana SBCS code page with uppercase-only Latin letters and a Katakana-oriented collation sequence.
  • CCSID 5035 (CP5035): Japanese EBCDIC DBCS graphic CCSID (DBCS only).

Migration compatibility and ISO standards

For Latin-based code pages (37, 500, 1047, 273): ISO-8859-1 (Latin-1 or CCSID 819) serves as the direct ASCII equivalent, enabling lossless conversion for most characters.

Euro symbol consideration: ISO-8859-1 doesn’t include the Euro symbol (€) because it was standardized before the Euro’s introduction. For mainframe systems using Euro-enabled code pages (1140, 1141, 1390), consider:

  • ISO-8859-15 (Latin-9): Includes the Euro symbol and replaces some rarely used characters from ISO-8859-1
  • UTF-8: Comprehensive Unicode support including Euro and all international characters

For Japanese code pages (930, 939, 1390, 1399): UTF-8 is the recommended target encoding because ISO-8859-1 cannot represent Japanese characters. Note that either 930 or 939 can be converted to UTF-8 because conversion is one direction. If you’re trying to switch back from UTF-8 to either 930 or 939, almost half of the characters in UTF-8 won’t work in 930 and 939. Because of this, you won’t be able to convert back and forth without issues.

Understanding Db2 LUW code pages

Db2 LUW (Linux, Unix, and Windows) doesn’t use CCSIDs the same way Db2 on z/OS uses.

  • Db2 LUW is built on a code page instead of CCSIDs.
  • A code page is immutable and defined at the database creation time.
  • DB2 LUW relies on client/server code page conversion.
  • Db2 LUW doesn’t have a CCSID catalog column—the platform uses an International Components for Unicode (ICU) and US locales rather than IBM CCSID definitions.

Code page support: Db2 LUW supports various code pages including UTF-8 (Unicode) for comprehensive character support.

Character compatibility: ISO-8859-1 serves as the direct ASCII counterpart to IBM 37, IBM 500, and IBM-1047’s character repertoire, enabling lossless conversion for Latin-1 characters.

Collation behavior:

  • Standard collations are locale-based (for example, en_US)
  • Custom 256-byte sequences can simulate EBCDIC sorting for single-byte databases
  • Unicode databases use CLDR (Common Locale Data Repository) or UCA (Unicode Collation Algorithm) collations, making EBCDIC simulation more complex

Character display consistency: Foreign characters (non-ASCII) display correctly when data is properly converted during transfer. Without conversion, EBCDIC bytes appear as invalid characters in LUW (for example, 0xC1 represents A in EBCDIC but is invalid in ASCII). Db2’s DRDA protocol and tools like db2move handle automatic conversion.

Code page selection strategy for Amazon RDS for Db2

In this section, we review mainframe compared to Amazon RDS code page flexibility.

Mainframe Db2 allows code page definition at multiple levels:

  • Subsystem level: Defining CCSIDs (SBCS_CCSID, MIXED_CCSID, and DBCS_CCSID) establishes defaults for objects and applications that don’t explicitly specify a CCSID
  • Tablespace and Table level: A CCSID clause during creation sets the default encoding for all character columns unless overridden at the column level
  • Column level: Specific CCSID for individual columns (for example, VARCHAR(50) CCSID 1208 for UTF-8)

Db2 defines the code page only at database creation time and cannot be modified afterward.

Key considerations

When creating an Amazon RDS for Db2 instance, avoid specifying the default database name during instance creation, because it defaults to UTF-8 with US territory.

Code page equivalency:

  • IBM code page 819 (ISO-8859-1) is the exact equivalent of mainframe code pages 37, 500, 1047, and 273 (excluding the Euro symbol).
  • For Euro symbol support, ISO-8859-15 (Latin-9) provides compatibility with code pages 1141 and other Euro-enabled variants.
  • Japanese code pages (930, 939, 1390, and 1399) require UTF-8 encoding for proper character representation. However the counterpart of Japanese code pages 930 and 939 in Db2 LUW is IBM-943. It appears that most Japanese customers choose IBM-943 for Japanese code pages 930 and 939.

Decision framework

Choose the appropriate code page based on your specific migration requirements.

Option 1a: ISO-8859-1 Code set

Use ISO-8859-1 when:

  • Using only one CCSID (37, 500, or 1047) for all tables and columns in mainframe Db2.
  • Zero data truncation is required.
  • Exact mainframe sorting order must be preserved. The sorting order in Db2 is enabled at the time of the database creation through the collation parameter.
  • Multi-language support is not needed.
  • The Euro symbol isn’t required (ISO-8859-1 predates Euro introduction).

Implementation:

$ db2 connect to rdsadmin user <MasterUserName> using <MasterUserPassword>
$ db2 "call rdsadmin.create_database('<DBNAME>',32768,'ISO-8859-1','US','EBCDIC_819_037')"

Replace <MasterUserName><MasterUserPassword>, and <DBNAME> with your actual values.

Option 1b: ISO-8859-15 code set (Euro support)

Use ISO-8859-15 when:

  • Migrating from Euro-enabled mainframe code pages (1141, and 1390)
  • Euro symbol (€) support is required
  • Maintaining single-byte character encoding—SBCS (Single Byte Character Set)
  • Limited multi-language requirements

Implementation:

$ db2 "call rdsadmin.create_database('<DBNAME>',32768,'ISO-8859-15','US','SYSTEM')"

Collation sequence options:

The fifth parameter specifies the collation sequence. Choose based on your mainframe CCSID:

Collation value

Target code page

Mainframe CCSID

Description

EBCDIC_819_037

ISO-8859-1

37

EBCDIC US English

EBCDIC_819_500

ISO-8859-1 500, 1047 EBCDIC International
EBCDIC_850_037 ASCII Latin 37 EBCDIC US English
EBCDIC_850_500 ASCII Latin 500 EBCDIC International
EBCDIC_932_5026 Japanese 932 (collation 5026) EBCDIC US English
EBCDIC_932_5035 Japanese 932 (collation 5035) EBCDIC International
EBCDIC_1252_037 Windows Latin 37 EBCDIC US English
EBCDIC_1252_500 Windows Latin 500 EBCDIC International

Selection guide:

  • Mainframe CCSID 37: Use EBCDIC_819_037
  • Mainframe CCSID 500 or 1047: Use EBCDIC_819_500
  • Mainframe CCSID 273: Use EBCDIC_819_500
  • Mainframe CCSID 930: Use EBCDIC_932_5026 (Katakana collation)
  • Mainframe CCSID 939: Use EBCDIC_932_5035 (Latin collation)
  • Euro-enabled CCSID (1141, 1390, 1399): Consider UTF-8 or ISO-8859-15

The CREATE_DATABASE command creates a Db2 database with:

  • Recommended 32 K page size. The default page size is 8 K.
  • ISO-8859-1 code set (equivalent to IBM code page 819)
  • US territory
  • EBCDIC collation for the specified mainframe code page

Supported territory codes for ISO-8859-1 and ISO-8859-15: AL, AU, AT, BE, BR, CA, CH, CN, DE, DK, ES, ET, FI, GB, ID, IE, IN, IS, IT, JP, KE, KR, Lat, MY, NL, NZ, NO, PH, PT, TW, TZ, US, and ZA

Consult the IBM documentation for valid territory and code page combinations.

Option 2: UTF-8 code page

Use UTF-8 when:

  • Multi-language data support is required
  • Migrating from Japanese mainframe code pages (930, 939, or 1390)
  • Euro symbol (€) and comprehensive international character support needed
  • Future proofing for global character handling is important
  • Mainframe sorting order preservation is not critical

Important considerations:

  • Character expansion: UTF-8 might require more storage space
    • Accented vowels (à, é, î), currency symbols (¢, £, ¥), fractions (¼, ½, ¾), and special characters (ß, ¬, µ) require 1 byte in mainframe code pages but 2 bytes in UTF-8
    • DDL modifications required: CHAR and VARCHAR lengths must be adjusted, which is not trivial

Implementation:

$ db2 connect to rdsadmin user <MasterUserName> using <MasterUserPassword>
$ db2 "call rdsadmin.create_database('<DBNAME>',32768,'UTF-8','US','SYSTEM')"

UTF-8 is supported in all territories. Consult the IBM documentation for available territory options.

Understanding CODEUNITS32 in UTF-8 Db2 databases

When using UTF-8 in Amazon RDS for Db2, mainframe migration inevitably encounters data truncation issues because character expansion—some characters grow from 1 byte on mainframe to 2 bytes or more in UTF-8.

To avoid DDL changes, Db2 provides a mechanism to change the default string measurement from OCTETS to CODEUNITS32 by modifying the STRING_UNITS database configuration parameter.

$ db2 connect to rdsadmin user <MasterUserName> using <MasterUserPassword>
$ db2 "call rdsadmin.update_db_param('<DBNAME>','STRING_UNITS', 'CODEUNITS32', 'NO')" 

This change requires an instance restart because STRING_UNITS isn’t a dynamic parameter. The DDLs must be created after updating the parameter for this parameter to take effect.

Practical example: ISO-8859-1 database

Alternatively, you can specify CODEUNITS32 at the column level when creating objects without changing the STRING_UNITS parameter.

Create an ISO-8859-1 database:

$ db2 connect to rdsadmin user <MasterUserName> using <MasterUserPassword>
$ db2 "call rdsadmin.create_database('MYDB',32768, 'ISO-8859-1','US','EBCDIC_819_037')" 

Test with sample data:

db2 connect to mydb user <MasterUserName> using <MasterUserPassword>
db2 "create table t1 (c1 char(2))"
db2 "insert into t1 values ('ßA')"
db2 "insert into t1 values ('ßB')"
db2 "insert into t1 values ('ßC')"
db2 "insert into t1 values ('¬I')"
db2 "insert into t1 values ('µ5')"
db2 "insert into t1 values ('¼O')"
db2 "insert into t1 values ('¼Q')"
db2 "insert into t1 values ('¼S')"
db2 "select c1, character_length(c1, octets) LENGTH from t1"

Result:

C1 LENGTH          
-- -----------
ßA           2
ßB           2
ßC           2
¬I           2
µ5           2
¼O           2
¼Q           2
¼S           2
 
  8 record(s) selected.

Notice that each column uses exactly 2 bytes in the ISO-8859-1 database. Do the same test with a UTF-8 database.

Create a UTF-8 database:

$ db2 connect to rdsadmin user <MasterUserName> using <MasterUserPassword>
$ db2 "call rdsadmin.create_database('MYUTFDB',32768, 'UTF-8','US','SYSTEM')"

Test with CODEUNITS32:

db2 connect to myutfdb user <MasterUserName> using <MasterUserPassword>
db2 "create table t1 (c1 char(2 CODEUNITS32))"
db2 "insert into t1 values ('ßA')"
db2 "insert into t1 values ('ßB')"
db2 "insert into t1 values ('ßC')"
db2 "insert into t1 values ('¬I')"
db2 "insert into t1 values ('µ5')"
db2 "insert into t1 values ('¼O')"
db2 "insert into t1 values ('¼Q')"
db2 "insert into t1 values ('¼S')"
db2 "select c1, character_length(c1, octets) LENGTH from t1"

Result:

C1 LENGTH          
-- -----------
ßA           3
ßB           3
ßC           3
¬I           3
µ5           3
¼O           3
¼Q           3
¼S           3
 
  8 record(s) selected.

Key observation: Each column uses 3 bytes in the UTF-8 database, despite being defined as CHAR(2 CODEUNITS32). The CODEUNITS32 specification allows for two characters with up to 4 bytes each (8 bytes total allocation).

CODEUNITS32 trade-offs and recommendations

Impact of database level CODEUNITS32:

  • Default CHAR and VARCHAR allocation changes from 1 byte to 4 bytes per character
  • Reduces maximum lengths:
    • CHAR: from 255 to 63 characters
    • VARCHAR: from 32,704 to 8,174 bytes (32 KB page size)
  • For databases with 90% ASCII characters, size can expand to 3.8 times the original

It’s recommended to avoid database level CODEUNITS32. One option is to use selective column-level CODEUNITS32, but do so with caution, because it can waste significant space.

For example, CREATE TABLE t1 (c1 CHAR(2 CODEUNITS32)) allocates 8 bytes.

A better approach is to use OCTETS with adjusted length. OCTETS is a representation of a byte. It is a one-to-one mapping between a character and a number associated in non-Unicode code pages. CREATE TABLE t1 (c1 CHAR(4 OCTETS)) allocates exactly 4 bytes.

Best practice: Use CODEUNITS32 only when certain that you’ll store 3–4 byte characters (for example, East Asian languages). For most international characters, use and adjust OCTETS length accordingly.

Collation differences: EBCDIC compared to SYSTEM

Sort order comparison.

EBCDIC collation order: Special characters, lowercase letters, uppercase letters, then numerals

SYSTEM collation order: Numerals, uppercase letters, lowercase letters, then special characters

Practical examples

EBCDIC collation test:

db2 "with s(v) as (values '-','0','A','a','ß','¬','µ','¼') select v as VALUE from s order by v"

Result:

VALUE
-----
ß    
¬    
-    
a    
µ    
¼    
A    
0    
  8 record(s) selected.

The result shows the expected sort order of special characters, lowercase, uppercase, and numerals.

SYSTEM collation test:

$ db2 "with s(v) as (values '-','0','A','a','ß','¬','µ','¼') select v as VALUE from s order by v"

Result:

VALUE
-----
-
0
A
a
¬
µ
¼
ß
  8 record(s) selected.

The result shows the expected sort order of numerals, uppercase, lowercase, and special characters.

Handling unsupported characters in ISO-8859-1

When applications attempt to insert characters not available in ISO-8859-1 (CCSID 819), the database performs silent character substitution.

Example test:

db2 "create table t1(c1 char(4))"
db2 "insert into t1 values ('常')" -- Japanese character
db2 "select c1, hex(c1) hex, character_length(c1, octets) length from t1"

Result:

C1   HEX      LENGTH     
---- -------- -----------
     1A202020           4

What happened:

  • No error or warning was generated
  • The Japanese character was replaced by SUB (substitute) character (0x1A)
  • The remaining bytes are filled with spaces (0x20)

Key takeaway: ISO-8859-1 databases cannot store unsupported characters. Unknown characters are silently replaced with 0x1A followed by spaces (0x20) for remaining byte positions.

Ensuring character consistency across platforms

Data migration best practices

Conversion tools: Use Db2 federation (catalog z/OS database in LUW) or tools like db2move/EXPORT with ASC/DEL formats for automatic character conversion.

Check the mainframe CCSID configuration

Table-level CCSID:

SELECT NAME, ENCODING_SCHEME
FROM SYSIBM.SYSTABLES
WHERE NAME = '<TABLE_NAME>' AND CREATOR = '<CREATOR_NAME';
Example Result:
NAME             |ENCODING_SCHEME|
-----------------+---------------+
TABLE_NAME.      |E              | -- E = EBCDIC

Column-level CCSID:

SELECT NAME, CCSID
FROM SYSIBM.SYSCOLUMNS WHERE TBNAME = '<TABLE_NAME>' AND TBCREATOR = '<CREATOR_NAME>';

Example result:

NAME                |CCSID|
--------------------+-----+
COL1.               |   37|
COL2                |    0| -- 0 = subsystem default
COL3                |    0|
COL4                |   37|

Validation and testing

Character testing: Insert test data with international characters (for example, café, niño) on mainframe, export, and import to LUW and verify display consistency using GUI clients such as:

  • DBeaver
  • DataGrip
  • IBM Data Studio
  • IBM Database Management Console

Conversion risks: Round-trip conversions (EBCDIC → ASCII → EBCDIC) might lose character variants without exact mappings. Verify the results using IBM’s official CCSID tables.

Conclusion

Code page, collation, and territory parameters in Db2 are immutable after database creation—unlike other database systems. This makes the initial selection critical for migration success.Key considerations:

  • ISO-8859-1: Choose for exact mainframe compatibility and zero data loss. Use an Amazon RDS for Db2 defined collation sequence to preserve mainframe sorting behavior
  • ISO-8859-15: Consider for Euro symbol support when migrating from Euro-enabled mainframe code pages (1141, 1390) and verify Amazon RDS support
  • UTF-8: Select for multi-language support, but be aware of character expansion and potential DDL modifications
  • Early planning: Address code page and collation requirements early in the migration cycle to prevent data quality issues

Proper code page selection prevents data truncation, manages character expansion effectively, and helps ensure consistent application behavior across platforms.

Acknowledgements

Thanks to Chiranjeev Mukherjee, Hajime Minagawa, and Akira Shimosako for carefully reviewing this blog post from the mainframe migration perspective and especially for Japanese character sets.


About the authors

Vikram S Khatri

Vikram

Vikram is a Sr. DBE for Amazon RDS for Db2. Vikram has over 20 years of experience in Db2. He enjoys developing new products from the ground up. In his spare time, he practices meditation and enjoys listening to podcasts.

Sumit Kumar

Sumit Kumar

Sumit is a Senior Solutions Architect at AWS and enjoys solving complex problems. He has been helping customers across various industries to build and design their workloads on the AWS Cloud. He enjoys cooking, playing chess, and spending time with his family.

Rajib Sarkar

Rajib Sarkar

Rajib is a Senior Database Engineer for Amazon RDS for Db2. Rajib has over 20 years of Db2 experience.

Kshitij Sanghoi

Kshitij Sanghoi

Kshitij is a Sr. Software Development Engineer and has over 15 years of IT experience including more than 11 years at AWS.