How to work with non-BMP characters

ECMAScript and Duktape support for non-BMP

ECMAScript standard strings are 16-bit only

ECMAScript standard itself does not support non-BMP characters: all codepoints are strictly 16-bit. Non-BMP characters are intended to be represented using surrogate pairs:

ES2015 RegExp patterns having the u flag support non-BMP characters by interpreting the string data as UTF-16:

ES2015 String.prototype.trim() also has special handling for non-BMP characters (again interpreting the string as UTF-16):

Duktape strings support up to 32-bit codepoints

Duktape represents strings in an extended UTF-8 format which allows both arbitrary 16-bit codepoints (as required by ECMAScript) but also extended codepoints for the full 32-bit range. Also arbitrary byte sequences (which are invalid UTF-8) are allowed:

As a result, Duktape supports characters in the non-BMP range directly:

Main approaches for dealing with non-BMP characters

The main choice is between:

Using surrogate pairs

For example, to represent LEFT-POINTING MAGNIFYING GLASS U+1F50D:

// http://www.russellcottrell.com/greek/utilities/surrogatepaircalculator.htm
var magnifyingGlass = '\uD83D\uDD0D';

print(magnifyingGlass.length);  // prints 2
print(Duktape.enc('hex', magnifyingGlass));  // prints eda0bdedb48d, surrogate codepoints eda0bd edb48d
= eda0bdedb48d

Note in particular that the \u escape only accepts 4 digits, which may be misleading:

// This gets parsed as U+1F50 followed by ASCII 'D'.
var magnifyingGlass = '\u1F50D';

If you want your C code to see UTF-8 you'll need to encode/decode the surrogate pairs as appropriate. It's probably best to write helpers to:

Using Duktape UTF-8

This approach is convenient for C code, because strings can be expressed directly as UTF-8 with no conversion or dealing with surrogate pairs.

One limitation is that there is no ECMAScript syntax for non-BMP characters so you can't use them in literals. There are a few workarounds:

// Decode from Duktape JX format which supports UTF-8.
var magnifyingGlass = Duktape.dec('jx', '"\\U0001f50d"');
print(magnifyingGlass.length);  // prints 1
print(Duktape.enc('hex', magnifyingGlass));  // prints f09f948d, direct UTF-8  for U+1F50D

// Enter UTF-8 data directly as hex.
var magnifyingGlass = Duktape.dec('hex', 'f09f948d');
print(magnifyingGlass.length);
print(Duktape.enc('hex', magnifyingGlass));

// Enter UTF-8 data into a buffer and coerce to string.
var magnifyingGlass = String(Duktape.Buffer(new Uint8Array([ 0xf0, 0x9f, 0x94, 0x8d ])));
print(magnifyingGlass.length);
print(Duktape.enc('hex', magnifyingGlass));

// Use String.fromCharCode(); since Duktape 1.2.0 fromCharCode() has default
// non-standard behavior (which can be disabled) to accept non-BMP codepoints.
var magnifyingGlass = String.fromCharCode(0x1f50d);
print(magnifyingGlass.length);
print(Duktape.enc('hex', magnifyingGlass));