Category Archives: Ruby

Validating 4 bytes UTF-8 characters

When the character set of MySQL column is utf8 and the SQL mode (sql_mode) is not strict mode (i.e. sql_mode does not include STRICT_ALL_TABLES nor STRICT_TRANS_TABLES), setting a character that will be 4 bytes when encoded with utf-8 (such as an emoji like 😁) will truncate the remainder of the characters (with a warning).

To support 4 bytes UTF-8 characters, the columns with utf8mb4 for CHARACTER SET (and utf8mb4_xxx such as utf8mb4_unicode_520_ci, etc for COLLATE) must be used, and the connection character set must also use utf8mb4.

Incidentally, on Rails, if you try to set 4 bytes UTF-8 characters when the character set of MySQL column is utf8, the following error occurs, so the string will not be truncated unnoticed.

An ActiveRecord::StatementInvalid occurred in news#update:

Mysql2::Error: Incorrect string value: 'xF0x9Fx98x80x0Dx0A' for column 'description' at row 1: UPDATE `news` SET `description` = '😀rn' WHERE `news`.`id` = 2
app/controllers/news_controller.rb:98:in `update'

This is because, unless otherwise specified, AbstractMysqlAdapter#configure_connection adds STRICT_ALL_TABLES to the session's SQL_MODE. (NO_AUTO_VALUE_ON_ZERO is also added.)

You can confirm it by doing the following.

  • With the mysql client
    mysql> show variables like 'sql_mode';
    +---------------+------------------------+
    | Variable_name | Value                  |
    +---------------+------------------------+
    | sql_mode      | NO_ENGINE_SUBSTITUTION |
    +---------------+------------------------+
    1 row in set (0.00 sec)
    
  • With the Rails console against the same database
    > con = ActiveRecord::Base.connection
    > con.select_all("SHOW VARIABLES LIKE 'sql_mode'")
       (0.8ms)  SHOW VARIABLES LIKE 'sql_mode'
     => #<ActiveRecord::Result:0x00007fc6ca533728 @columns=["Variable_name", "Value"], @rows=[["sql_mode", "NO_AUTO_VALUE_ON_ZERO,STRICT_ALL_TABLES,NO_ENGINE_SUBSTITUTION"]], @hash_rows=nil, @column_types={}>
    

Workaround

You can convert the CHARACTER SET of the column to utf8mb4 and COLLATE to utf8mb4_xxx and use utf8mb4 for the connection character set, but if you can't convert the column to utf8mb4 for some reason, you'll probably want to reject 4 bytes UTF-8 characters with validation because it's not good to just shut up and truncate the 4 bytes UTF-8 characters and beyond.

The range of Unicode characters that result in 4 bytes when encoded in UTF-8 is U+10000 to U+10FFFF.

On PHP

if (preg_match('/[x{10000}-x{10FFFF}]/u', $s) { /* ... */ }
if (preg_match('/[xF0-xF7][x80-xBF][x80-xBF][x80-xBF]/', $s)) { /* ... */ }
preg_match_all('/[x{10000}-x{10FFFF}]/u', $s, $matches);
// An array of 4-bytes UTF-8 characters is stored in `$matches[0]`.

On Ruby

if /[u{10000}-u{10FFFF}]/ =~ s
  # ...
end
chars = s.scan(/[u{10000}-u{10FFFF}]/)
# An array of 4-bytes UTF-8 characters is stored in `chars`.

Installing Ruby-2.5.0 on CentOS6

Attempting to install Ruby-2.5.0 from the source on CentOS6 causes an error in `make` and it can not be installed.
It is an error because gcc on CentOS6 is old.

$ ./configure --prefix=/opt/ruby-2.5.0 --disable-install-doc
$ make
...(略)
prelude.c: In function ‘prelude_eval’:
prelude.c:204: error: #pragma GCC diagnostic not allowed inside functions
prelude.c:205: error: #pragma GCC diagnostic not allowed inside functions
prelude.c:221: error: #pragma GCC diagnostic not allowed inside functions
トップレベル:
cc1: 警告: unrecognized command line option "-Wno-self-assign"
cc1: 警告: unrecognized command line option "-Wno-constant-logical-operand"
cc1: 警告: unrecognized command line option "-Wno-parentheses-equality"
cc1: 警告: unrecognized command line option "-Wno-tautological-compare"
make: *** [prelude.o] エラー 1

Bug #14234: Failed to build on CentOS 6.9 - Ruby trunk - Ruby Issue Tracking System

It will be fixed on the next release (Ruby-2.5.1), but you can install Ruby-2.5.0 using scl's devtoolset for the time being.

Installing scl devtoolset on CentOS6

Create under /etc/profile.d with the following content(example for devtoolset-4 collection)and enable scl's devtoolset for users after reboot. So you can install Passenger by `passenger-install-apache2-module` or install gems which need native build with Capistrano.

$ cat /etc/profile.d/enabledevtoolset-4.sh
#!/bin/bash
source scl_source enable devtoolset-4

Setting timeout for open-uri

require 'open-uri'
require 'resolv-replace'
require 'timeout'
TIMEOUT = 3
begin
  timeout(TIMEOUT) do
    open(url) do |f|
      # Do something
    end
  end
rescue TimeoutError => e
  # Do something on timeout
rescue => e
  # Do something on other errors
end
  • Interrupt with timeout is implemented by Ruby's Thread and it is not effective againt the process implemented by C, so DNS name resolution cannot be timeout. resolv-replace overwrites the methods to use Ruby's resolve library for DNS name resolution and enables timeout.
  • TimeoutError is not subclass of StandardError, so you have to catch TimeoutError explicitly.

Source: open-uriにtimeoutを設定する方法 | やむにやまれず