All posts by Pistolfly

Software Engineer in Tokyo, Japan

Validating 4 bytes UTF-8 characters

2020-04-26 Pistolfly Leave a comment

When the character set of MySQL column is utf8 and the SQL mode (sql_mode) is not strict mode (i.e. sql_mode does not include STRICT_ALL_TABLES nor STRICT_TRANS_TABLES), setting a character that will be 4 bytes when encoded with utf-8 (such as an emoji like 😁) will truncate the remainder of the characters (with a warning).

To support 4 bytes UTF-8 characters, the columns with utf8mb4 for CHARACTER SET (and utf8mb4_xxx such as utf8mb4_unicode_520_ci, etc for COLLATE) must be used, and the connection character set must also use utf8mb4.

Incidentally, on Rails, if you try to set 4 bytes UTF-8 characters when the character set of MySQL column is utf8, the following error occurs, so the string will not be truncated unnoticed.

An ActiveRecord::StatementInvalid occurred in news#update:
Mysql2::Error: Incorrect string value: 'xF0x9Fx98x80x0Dx0A' for column 'description' at row 1: UPDATE `news` SET `description` = '😀rn' WHERE `news`.`id` = 2
app/controllers/news_controller.rb:98:in `update'

This is because, unless otherwise specified, AbstractMysqlAdapter#configure_connection adds STRICT_ALL_TABLES to the session's SQL_MODE. (NO_AUTO_VALUE_ON_ZERO is also added.)

You can confirm it by doing the following.

With the mysql client

mysql> show variables like 'sql_mode';
+---------------+------------------------+
| Variable_name | Value                  |
+---------------+------------------------+
| sql_mode      | NO_ENGINE_SUBSTITUTION |
+---------------+------------------------+
1 row in set (0.00 sec)

With the Rails console against the same database

> con = ActiveRecord::Base.connection
> con.select_all("SHOW VARIABLES LIKE 'sql_mode'")
   (0.8ms)  SHOW VARIABLES LIKE 'sql_mode'
 => #<ActiveRecord::Result:0x00007fc6ca533728 @columns=["Variable_name", "Value"], @rows=[["sql_mode", "NO_AUTO_VALUE_ON_ZERO,STRICT_ALL_TABLES,NO_ENGINE_SUBSTITUTION"]], @hash_rows=nil, @column_types={}>

Workaround

You can convert the CHARACTER SET of the column to utf8mb4 and COLLATE to utf8mb4_xxx and use utf8mb4 for the connection character set, but if you can't convert the column to utf8mb4 for some reason, you'll probably want to reject 4 bytes UTF-8 characters with validation because it's not good to just shut up and truncate the 4 bytes UTF-8 characters and beyond.

The range of Unicode characters that result in 4 bytes when encoded in UTF-8 is U+10000 to U+10FFFF.

On PHP

if (preg_match('/[x{10000}-x{10FFFF}]/u', $s) { /* ... */ }

if (preg_match('/[xF0-xF7][x80-xBF][x80-xBF][x80-xBF]/', $s)) { /* ... */ }

preg_match_all('/[x{10000}-x{10FFFF}]/u', $s, $matches);
// An array of 4-bytes UTF-8 characters is stored in `$matches[0]`.

On Ruby

if /[u{10000}-u{10FFFF}]/ =~ s
  # ...
end

chars = s.scan(/[u{10000}-u{10FFFF}]/)
# An array of 4-bytes UTF-8 characters is stored in `chars`.

Windows

Change file timestamp on Windows

2018-10-13 Pistolfly Leave a comment

Run Set-ItemProperty with PowerShell.

Change last modified datetime

> Set-ItemProperty "<PATH TO FILE>" -Name LastWriteTime -Value "<DATETIME STRING>"

Change creation datetime

> Set-ItemProperty "<PATH TO FILE>" -Name CreationTime -Value "<DATETIME STRING>"

It seems to be good that the DATETIME STRING format specified for -Value is a standard date and time string that can be parsed with .NET, such as "2018/06/01 12:27:59".

Emacs

Display the character code of the character at the cursor position on Emacs

2018-10-12 Pistolfly Leave a comment

C-x = (M-x what-cursor-position)

The code point in Unicode of the character at the cursor position is displayed in the minibuffer.
For example, if you place the cursor on the letter "あ" in the file of Shift_JIS and press 'C-x =', the following is displayed in the minibuffer.

Char: あ (12354, #o30102, #x3042, file ...) point=1 of 2 (0%) column=0

`12354, #o30102, #x3042` are decimal, octal, hexadecimal notation of the code point in Unicode of "あ".

C-u C-x = (M-x describe-char)

Display detailed information of the character at the cursor position in the split window.
For example, place the cursor on the letter "あ" in the file of Shift_JIS and press 'C-u C-x =', then the following is displayed in the split window.

             position: 1 of 2 (0%), column: 0
            character: あ (displayed as あ) (codepoint 12354, #o30102, #x3042)
    preferred charset: japanese-jisx0208 (JISX0208.1983/1990 Japanese Kanji: ISO-IR-87)
code point in charset: 0x2422
               script: kana
               syntax: w  which means: word
             category: .:Base, H:2-byte Hiragana, L:Left-to-right (strong), c:Chinese, h:Korean, j:
Japanese, |:line breakable
             to input: type "C-x 8 RET 3042" or "C-x 8 RET HIRAGANA LETTER A"
          buffer code: #xE3 #x81 #x82
            file code: #x82 #xA0 (encoded by coding system japanese-shift-jis-dos)
              display: terminal code #xE3 #x81 #x82

`code point in charset: 0x2422` represents the code point of "あ" in character set JIS X 0208,
`buffer code: #xE3 #x81 #x82` represents the encoding in the buffer (UTF-8),
`file code: #x82 #xA0` represents the encoding in the file (Shift_JIS).

Windows

How to `tail` on Windows

2018-10-11 Pistolfly Leave a comment

On Windows PowerShell

> Get-Content <PATH> -Wait -Tail 1000

is equivalent to:

$ tail -f -n 1000 <PATH>

on *nix.

（"1000" in the above example represents the number of display lines）

Ruby

Installing Ruby-2.5.0 on CentOS6

2018-02-14 Pistolfly Leave a comment

Attempting to install Ruby-2.5.0 from the source on CentOS6 causes an error in `make` and it can not be installed.
It is an error because gcc on CentOS6 is old.

$ ./configure --prefix=/opt/ruby-2.5.0 --disable-install-doc
$ make
...(略)
prelude.c: In function ‘prelude_eval’:
prelude.c:204: error: #pragma GCC diagnostic not allowed inside functions
prelude.c:205: error: #pragma GCC diagnostic not allowed inside functions
prelude.c:221: error: #pragma GCC diagnostic not allowed inside functions
トップレベル:
cc1: 警告: unrecognized command line option "-Wno-self-assign"
cc1: 警告: unrecognized command line option "-Wno-constant-logical-operand"
cc1: 警告: unrecognized command line option "-Wno-parentheses-equality"
cc1: 警告: unrecognized command line option "-Wno-tautological-compare"
make: *** [prelude.o] エラー 1

Bug #14234: Failed to build on CentOS 6.9 - Ruby trunk - Ruby Issue Tracking System

It will be fixed on the next release (Ruby-2.5.1), but you can install Ruby-2.5.0 using scl's devtoolset for the time being.

Installing scl devtoolset on CentOS6

Create under /etc/profile.d with the following content（example for devtoolset-4 collection）and enable scl's devtoolset for users after reboot. So you can install Passenger by `passenger-install-apache2-module` or install gems which need native build with Capistrano.

$ cat /etc/profile.d/enabledevtoolset-4.sh
#!/bin/bash
source scl_source enable devtoolset-4

Linux

Setting the system time zone on Ubuntu(16.04)

2018-01-17 Pistolfly Leave a comment

Use `timedatectl`.

List available time zones

$ timedatectl list-timezones

Set the system time zone

$ sudo timedatectl set-timezone <time zone>

Example

$ sudo timedatectl set-timezone Asia/Tokyo

Mac

Show logs on macOS(10.12 Sierra or later)

2017-11-08 Pistolfly Leave a comment

Overview

Use `log`.

Streaming (like tail command)

`log stream`

Find from past log

`log show`

See `man log` for detail.

Example

cron

$ log stream --info --predicate 'process == "cron"'

$ log show --info --predicate 'process == "cron"' --start '2017-05-25'

postfix

$ log stream --info --predicate '(process == "smtp") || (process == "smtpd")'

$ log show --info --predicate '(process == "smtp") || (process == "smtpd")' --start '2017-05-25'

If you doesn't know process name to specify

$ log show --info --start '2017-11-08' | grep 'xxx'

and guess process name.

Mac

Restart services installed with MacPorts

2017-11-08 Pistolfly Leave a comment

You can restart services installed with MacPorts using `port reload <portname>`.

$ port help load
...
If you want to restart a daemon, you can use port reload, which is a convenience wrapper around port unload followed by a short delay and port load.
...

Example

Apache

$ sudo port reload apache2

MySQL

$ sudo port reload mysql56-server

Linux

Set the system hostname on Ubuntu

2017-11-08 Pistolfly Leave a comment

$ sudo hostnamectl set-hostname <hostname>

Linux, Mac

Where are`crontab -e` settings saved?

2017-10-11 Pistolfly Leave a comment

Where are`crontab -e` settings saved?

Linux

/var/spool/cron

macOS, Mac OS X

/var/at/tabs

Workaround

On PHP

On Ruby

Change last modified datetime

Change creation datetime

C-x = (M-x what-cursor-position)

C-u C-x = (M-x describe-char)

List available time zones

Set the system time zone

Example

Overview

Streaming (like tail command)

Find from past log

Example

cron

postfix

If you doesn't know process name to specify

Example

Apache

MySQL

Linux

macOS, Mac OS X

Developer Blog