背景图
When UTF-8 Isn't Just UTF-8

2025年11月22日 作者头像 作者头像 ArnoX 编辑

UTF-8.png


Last week while testing some localized builds, I ran into a familiar but subtle encoding issue: a French .po file that looked perfectly fine in VS Code, but rendered strange characters like é instead of é once loaded into the product.

This post serves as a quick exploration of UTF-8, BOM, and how they impact accented or non-ASCII characters in real localization workflows.

1. The Basics: What UTF-8 Actually Means

UTF-8 (Unicode Transformation Format – 8-bit) is a universal encoding standard that can represent any character from any writing system. Its biggest advantage is backward compatibility with ASCII — the first 128 Unicode characters map directly to ASCII, which means English text looks identical in UTF-8 and ASCII.

But the magic happens with multi-byte characters:
letters like é, ü, or ñ are encoded using multiple bytes (usually two or three).
That's where things get interesting and where encoding mismatches start to appear.

2. UTF-8 Without Signature vs With Signature

When you save a file in UTF-8, you might notice two options:
UTF-8 (without BOM) and UTF-8 (with BOM).

Let's decode what that really means:

TypeDescriptionByte Order Mark (BOM)Example (Hex)
UTF-8 without signaturePure UTF-8 text. No extra bytes at the start.NoneStarts directly with text bytes
UTF-8 with signatureUTF-8 + a "signature" (3 extra bytes: EF BB BF) at the beginningYesEF BB BF + text

That BOM (Byte Order Mark) is a small invisible header that tells a system,

"Hey, this file uses UTF-8 encoding."

It's not harmful, but not all systems handle it well. Some parsers interpret those three bytes as part of the text, leading to mysterious characters at the start of your file, something every localization engineer has probably seen at least once in .srt, .json, or .csv files.

3. Why BOM Matters in Localization Engineering

From a localization workflow perspective, BOM can silently break things:

  • In CAT tools: some TMS parsers may treat BOM bytes as text, creating ghost segments like "Welcome".
  • In web apps: JavaScript may misinterpret those bytes, especially if a BOM-encoded JSON file is fetched dynamically.
  • In software builds: resource compilers or scripts might fail to match string IDs because the invisible BOM alters the first few bytes.

That's why many localization engineers prefer UTF-8 without BOM for most assets.
It's cleaner, safer, and consistent across platforms.

However, there are exceptions:
Windows-based environments (like Excel exports or certain .NET resource files) often require BOM to detect encoding properly. In other words: BOM is neither good nor bad. It's contextual.

  1. The "Accent" Problem: Why é Becomes é

Here's the most common symptom of a UTF-8 mismatch:

You open a file and see Français instead of Français.

What's happening?

The file was encoded in UTF-8, but the program displaying it interpreted it as another encoding (usually Latin-1 or Windows-1252).

Let's look closer:

CharacterCorrect UTF-8 BytesMisread as Latin-1Result
éC3 A9éGarbled output
çC3 A7çGarbled output
ñC3 B1ñGarbled output

In short:

The bytes are fine, but the interpretation is wrong.

5. Lessons for Localization Engineers

Here are some practices I've learned (sometimes the hard way):

  • Check your encoding explicitly. Always confirm whether your file is UTF-8 with or without BOM. Don't simply rely on your editor's default.
  • Test in your target environment. Encoding issues often appear only once files are loaded by a CMS, game engine, or build pipeline.
  • Automate encoding checks. Use small scripts to validate encodings before delivery, like:

      import chardet
      with open('strings.po', 'rb') as f:
          print(chardet.detect(f.read()))
    
  • Normalize consistently. Ensure all team members (and translators) use the same export format. One rogue Excel sheet in UTF-16 can corrupt your entire batch.
  • Educate non-engineers. Many localization managers or linguists don't realize how a "Save As" option can affect encoding integrity. It's always worth explaining once.
  1. Wrapping Up

Character encoding might sound like an old problem, but in localization, it's a daily routine.

The key takeaway:
UTF-8 without signature is pure UTF-8.
UTF-8 with signature is UTF-8 plus a BOM header.
And knowing the difference can save hours of debugging invisible characters.