When the character set of MySQL column is utf8 and the SQL mode (sql_mode) is not strict mode (i.e. sql_mode does not include STRICT_ALL_TABLES nor STRICT_TRANS_TABLES), setting a character that will be 4 bytes when encoded with utf-8 (such as an emoji like 😁) will truncate the remainder of the characters (with a warning).
To support 4 bytes UTF-8 characters, the columns with utf8mb4 for CHARACTER SET (and utf8mb4_xxx such as utf8mb4_unicode_520_ci, etc for COLLATE) must be used, and the connection character set must also use utf8mb4.
Incidentally, on Rails, if you try to set 4 bytes UTF-8 characters when the character set of MySQL column is utf8, the following error occurs, so the string will not be truncated unnoticed.
An ActiveRecord::StatementInvalid occurred in news#update:
Mysql2::Error: Incorrect string value: 'xF0x9Fx98x80x0Dx0A' for column 'description' at row 1: UPDATE `news` SET `description` = '😀rn' WHERE `news`.`id` = 2
app/controllers/news_controller.rb:98:in `update'
This is because, unless otherwise specified, AbstractMysqlAdapter#configure_connection adds STRICT_ALL_TABLES to the session's SQL_MODE. (NO_AUTO_VALUE_ON_ZERO is also added.)
You can confirm it by doing the following.
- With the mysql client
mysql> show variables like 'sql_mode'; +---------------+------------------------+ | Variable_name | Value | +---------------+------------------------+ | sql_mode | NO_ENGINE_SUBSTITUTION | +---------------+------------------------+ 1 row in set (0.00 sec)
- With the Rails console against the same database
> con = ActiveRecord::Base.connection > con.select_all("SHOW VARIABLES LIKE 'sql_mode'") (0.8ms) SHOW VARIABLES LIKE 'sql_mode' => #<ActiveRecord::Result:0x00007fc6ca533728 @columns=["Variable_name", "Value"], @rows=[["sql_mode", "NO_AUTO_VALUE_ON_ZERO,STRICT_ALL_TABLES,NO_ENGINE_SUBSTITUTION"]], @hash_rows=nil, @column_types={}>
Workaround
You can convert the CHARACTER SET of the column to utf8mb4 and COLLATE to utf8mb4_xxx and use utf8mb4 for the connection character set, but if you can't convert the column to utf8mb4 for some reason, you'll probably want to reject 4 bytes UTF-8 characters with validation because it's not good to just shut up and truncate the 4 bytes UTF-8 characters and beyond.
The range of Unicode characters that result in 4 bytes when encoded in UTF-8 is U+10000 to U+10FFFF.
- https://en.wikipedia.org/wiki/UTF-8#Description
- https://en.wikipedia.org/wiki/Unicode#Code_point_planes_and_blocks
On PHP
if (preg_match('/[x{10000}-x{10FFFF}]/u', $s) { /* ... */ }
if (preg_match('/[xF0-xF7][x80-xBF][x80-xBF][x80-xBF]/', $s)) { /* ... */ }
preg_match_all('/[x{10000}-x{10FFFF}]/u', $s, $matches); // An array of 4-bytes UTF-8 characters is stored in `$matches[0]`.
On Ruby
if /[u{10000}-u{10FFFF}]/ =~ s # ... end
chars = s.scan(/[u{10000}-u{10FFFF}]/) # An array of 4-bytes UTF-8 characters is stored in `chars`.