SQL Tidbits #3 – Finding Unicode Characters

In SQL Server you can declare string columns as either varchar or nvarchar. One main difference is the characters the column can accept. Nvarchar can accept all unicode characters where varchar can only accept characters in an 8-bit codepage. This means the character Ɛ won’t work with the varchar datatype. Fortunately, and unfortunately, SQL Server will automatically convert unsupported characters when inserting into a varchar field. But the results can be strange.

Starting with a simple table we can see the effects of various actions.

CREATE TABLE SimpleTest (
	AnsiValue varchar(10),
	UnicodeValue nvarchar(10)
);

INSERT INTO SimpleTest
VALUES ('A', 'A'),('B','B');

SELECT * FROM SimpleTest;

unicodecharaters_1

At this point there is no difference between the two fields. Any character that is acceptable in the varchar datatype also works in the nvarchar datatype. But what happens when the character is outside of the varchar acceptable range?

INSERT INTO SimpleTest
VALUES (N'ē', N'ē'),(N'Ɛ', N'Ɛ');

SELECT * FROM SimpleTest;

unicodecharaters_2

Now we can see the problem with the varchar datatype when using Unicode characters. If you look too fast, row 3 might appear the same but the conversion stripped off the bar over the e. Row 4 is even worse. Instead of the actual letter, the conversion results in a question mark. This is what happens to most Unicode characters when converting to varchar. So the good is that you won’t get errors with the implicit conversion. The bad is that you don’t know what characters you’re losing.

An area where errors can happen is outside of SQL Server when trying to write characters from an nvarchar datatype to an ASCII file. There are numerous tools for writing to files so I won’t go into that, but some tools will fail if, and usually only if, they encounter non-ASCII characters. It can be extremely frustrating when trying to write out 200,000 rows and the process fails because 1 row has an unsupported character. There are ways to resolve this. The first way is to switch to UTF-8 or UTF-16 for your file encoding. But we all know that failures happen in the middle of the night, when we least expect them, and you have to just get it working again…right now.

The other way to find unsupported characters may be apparent in the result set showing the side-by-side of AnsiValue and UnicodeValue. The varchar value that SQL Server uses in the conversion will not match the original nvarchar value. Using this we can compare the value to its converted self. If they are not equal then the column contains unsupported characters.

INSERT INTO SimpleTest
VALUES (N'Wērner', N'Wērner'),(N'FOƐ', N'FOƐ');

SELECT * FROM SimpleTest;

unicodecharaters_3

Now to find the rows where the UnicodeValue column contains unsupported characters use the following code.

SELECT * FROM SimpleTest
WHERE UnicodeValue <> CONVERT(NVARCHAR(10), CONVERT(VARCHAR(10), UnicodeValue));

unicodecharaters_4

Notice that the first two rows with standard ASCII characters don’t return. The rest contain values with the unsupported characters we expect.