首页本地化 , 开发相关正文

Last week while testing some localized builds, I ran into a familiar but subtle encoding issue: a French .po file that looked perfectly fine in VS Code, but rendered strange characters like Ã© instead of é once loaded into the product.

This post serves as a quick exploration of UTF-8, BOM, and how they impact accented or non-ASCII characters in real localization workflows.

1. The Basics: What UTF-8 Actually Means

UTF-8 (Unicode Transformation Format – 8-bit) is a universal encoding standard that can represent any character from any writing system. Its biggest advantage is backward compatibility with ASCII — the first 128 Unicode characters map directly to ASCII, which means English text looks identical in UTF-8 and ASCII.

But the magic happens with multi-byte characters:
letters like é, ü, or ñ are encoded using multiple bytes (usually two or three).
That's where things get interesting and where encoding mismatches start to appear.

2. UTF-8 Without Signature vs With Signature

When you save a file in UTF-8, you might notice two options:
UTF-8 (without BOM) and UTF-8 (with BOM).

Let's decode what that really means:

Type	Description	Byte Order Mark (BOM)	Example (Hex)
UTF-8 without signature	Pure UTF-8 text. No extra bytes at the start.	None	Starts directly with text bytes
UTF-8 with signature	UTF-8 + a "signature" (3 extra bytes: `EF BB BF`) at the beginning	Yes	`EF BB BF` + text

That BOM (Byte Order Mark) is a small invisible header that tells a system,

"Hey, this file uses UTF-8 encoding."

It's not harmful, but not all systems handle it well. Some parsers interpret those three bytes as part of the text, leading to mysterious characters at the start of your file, something every localization engineer has probably seen at least once in .srt, .json, or .csv files.

3. Why BOM Matters in Localization Engineering

From a localization workflow perspective, BOM can silently break things:

In CAT tools: some TMS parsers may treat BOM bytes as text, creating ghost segments like "ï»¿Welcome".
In web apps: JavaScript may misinterpret those bytes, especially if a BOM-encoded JSON file is fetched dynamically.
In software builds: resource compilers or scripts might fail to match string IDs because the invisible BOM alters the first few bytes.

That's why many localization engineers prefer UTF-8 without BOM for most assets.
It's cleaner, safer, and consistent across platforms.

However, there are exceptions:
Windows-based environments (like Excel exports or certain .NET resource files) often require BOM to detect encoding properly. In other words: BOM is neither good nor bad. It's contextual.

The "Accent" Problem: Why é Becomes Ã©

Here's the most common symptom of a UTF-8 mismatch:

You open a file and see FranÃ§ais instead of Français.

What's happening?

The file was encoded in UTF-8, but the program displaying it interpreted it as another encoding (usually Latin-1 or Windows-1252).

Let's look closer:

Character	Correct UTF-8 Bytes	Misread as Latin-1	Result
é	C3 A9	Ã©	Garbled output
ç	C3 A7	Ã§	Garbled output
ñ	C3 B1	Ã±	Garbled output

In short:

The bytes are fine, but the interpretation is wrong.

5. Lessons for Localization Engineers

Here are some practices I've learned (sometimes the hard way):

Check your encoding explicitly. Always confirm whether your file is UTF-8 with or without BOM. Don't simply rely on your editor's default.
Test in your target environment. Encoding issues often appear only once files are loaded by a CMS, game engine, or build pipeline.

Automate encoding checks. Use small scripts to validate encodings before delivery, like:

  import chardet
  with open('strings.po', 'rb') as f:
      print(chardet.detect(f.read()))

Normalize consistently. Ensure all team members (and translators) use the same export format. One rogue Excel sheet in UTF-16 can corrupt your entire batch.
Educate non-engineers. Many localization managers or linguists don't realize how a "Save As" option can affect encoding integrity. It's always worth explaining once.

Wrapping Up

Character encoding might sound like an old problem, but in localization, it's a daily routine.

The key takeaway:
UTF-8 without signature is pure UTF-8.
UTF-8 with signature is UTF-8 plus a BOM header.
And knowing the difference can save hours of debugging invisible characters.

猜你喜欢

1. The Basics: What UTF-8 Actually Means

2. UTF-8 Without Signature vs With Signature

3. Why BOM Matters in Localization Engineering

5. Lessons for Localization Engineers

猜你喜欢

Localization Vendor Management

Genes & Memes

大语言模型伴学之旅（一） —— BLEU值的计算

转载：The State of Audio-Visual Accessibility and Localization

Nash Equilibrium in Game Theory

《How to Get Ideas》