Booking.com customers learn the hard way that Unicode…

Booking.com customers learn the hard way that Unicode is tricky

It's easy to mistake an "l" for a "1" or an "I" with a poorly designed typeface. (Ahem.) Fortunately, modern fonts tend to use a variety of techniques to disambiguate those easily confused alphanumeric characters. But those designs rarely account for ambiguity that results from similarities across different character sets, as a recent phishing campaign targeting Booking.com users demonstrates.

BleepingComputer reported that "the attack, first spotted by security researcher JAMESWT, abuses the Japanese hiragana character 'ん' (Unicode U+3093), which closely resembles the Latin letter sequence '/n' or '/~', at a quick glance in some fonts." The attacker's hope is that people will gloss over the funky character, follow the malicious link, and then fall prey to the malware they're distributing via this campaign.

Unicode has been exploited like this many times before—this is a relatively common way for spammers to make it past email filters, for example, or for particularly dedicated trolls to harass people online despite the prevalence of profanity filters. Yet it remains a difficult problem to solve because text rendering, much like DNS, is more cursed than most people realize. So let's go through a crash course on characters.

Computers originally supported the minimal American Standard Code for Information Interchange—or, as sane people call it, ASCII—standard. That was relatively simple: it allowed computers to deal with the 26 letters of the English alphabet in both their lowercase and uppercase forms, a smattering of critical punctuation, and various control codes that told the computer when to draw a new line, indent text, etc.

But it turns out that even the British empire couldn't make the English alphabet the only character set on the planet, and some of the people who use those characters wanted to use computers, too. That led to the creation of the Unicode standard, which is used to encode characters on every modern device. (Let's not get into the actual encoding being UTF-8 on all sensible systems, or, more specifically, non-Windows systems.)

The Unicode Consortium says that Unicode "can encode up to roughly 1.1 million characters, allowing it to support all of the world’s languages and scripts in a single, universal standard" and that "all modern operating systems, computing environments, programming languages, and applications support the core of the Unicode Standard." So we can have cool things like emojis, punctuation, and non-English letters.

We can also have attacks like the one targeting Booking.com users, though, and preventing them is non-trivial. An operating system, browser, etc., knows how to handle Unicode characters, but that doesn't mean it can determine when a character is being used deceptively. Sometimes people want to use mixed character sets to communicate effectively; sometimes they just want to make something look cool.

Just to drive home the point about this not being an easy problem to solve: Unicode makes it difficult to achieve seemingly basic things like count the number of characters in a given text snippet, for example, or determine whether two characters are visually aligned. That isn't to say that addressing this problem is impossible, but I suspect it's a lot more complicated than most people would expect.

As for what people can do to avoid falling victim to schemes like this one targeting Booking.com users, my official recommendation is to never read your email or click links. Unless they're to even more thorough explanations of why text rendering (and editing!) is cursed. Then, by all means, click away. Nothing wrong with a little cursed knowledge, or at least that's what I tell myself when I try to go to sleep at night.

Read news from 100’s of titles, curated specifically for you.

Already a member? Sign in here