Not everything is UTF-8

Raphaël Gomès
2020-06-05

Over the past few weeks I've helped a new developer get started with both Mercurial and Rust, exposing them to somewhat niche subjects that they've had (understandably) little experience with.

One of them is the encoding (or lack thereof) in Mercurial and how it affects how we write code in both Python and Rust. As easy as it was to explain the issue to said developer, in the few instances of asking around for help on implementation details (mostly to get information about what had already been done and what I needed to do myself) I've noticed that not everyone I'd interacted with outside of our circle of VCS developers even understood the problem I was trying to solve.

Please note that I am not pointing fingers or accusing anyone of being disingenuous, just about everyone I talked to was very much trying to help me and to understand what is it that I wanted to solve in the first place. I usually don't have that much trouble explaining things to people in those situations, so I figured this warranted a full blog post.

The core issue

There Ain’t No Such Thing As Plain Text

This is a quote from Joel Spolsky, most notably known as the co-founder and (until recently) CEO of Stack Overflow. It's from an article of his from 2003 called The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Read that one first and then come back, because it covers a lot of the "general not-VCS-related" encoding stuff that serves as a basis for the rest of this post, and it is still relevant today.

In version control software like Mercurial, we have to make no assumptions about what the contents of tracked files are and their encoding. For all we know, file foo could be a binary file, a latin-1 file, or even a mixed encoding file: it is a very real and relevant need for a VCS to be able to track and manipulate data without assuming it to be text (of any encoding).

Take the following example:

$ hg init test-repo
$ cd test-repo
$ echo -n "Raphaël Gomès" > foo  # assuming UTF-8 default
$ hg commit -Am "UTF-8"
$ iconv -f UTF-8 -t WINDOWS-1252 foo > foo2
$ mv foo2 foo
$ hg commit -Am "WINDOWS-1252"

Here, we create a new empty repository, create the (UTF-8) foo file containing my name, commit it, then convert it from UTF-8 to WINDOWS-1252, then commit that.

Running HGPLAIN= hg export here (HGPLAIN= ensures you are not customizing output with a separate diff tool, export is like git show) will show the correct bytes in each "half" of the diff if your terminal encoding is set to UTF8 or CP1252, no bytes are lost by Mercurial. Even without changing encodings in a commit, simply using an encoding other than UTF-8 like KOI-8 would be unusable if not for the diff algorithm being encoding-agnostic. Because the bytes are sent as-is by Mercurial, all the user has to do is have a terminal that has the right encoding, and everything will be fine: nowhere did the user need to provide encoding information.

But forget binary files for a minute, their diffs are usually useless compared to a hexdump and we could also use LFS for them, right? Couldn't users just convert the rest of their repositories to UTF-8 and be done with it? I think that every developer including myself would be much happier if they didn't have to consider multiple encodings and that text were UTF-8 everywhere... but the world is unfortunately more complicated than that.

Say you're designing a new VCS from scratch in Rust or, in my case, rewriting core parts of a VCS in Rust; which type do you use to manipulate file contents? If your answer was String, you've just disqualified any file that isn't UTF-8 from being tracked by your VCS at any point in the history. That means that anyone converting from Mercurial to your shiny new system will lose at least part of their history if not all of it: for example, you can't convert the nginx repo losslessly because early revisions used ISO/CEI 8859-5, not to mention any binary or mixed-encoding files (common in translation files). What type do you use to represent a file path? If your answer was String, you've made valid UNIX and Windows MBCS paths impossible to represent in your software. If your answer was PathBuf (or OsString), good guess, but it is also wrong in our use-case: file paths tracked by Mercurial need to be abstracted away from the current OS, otherwise you open yourself up to normalization and cross-OS/cross-FS compatibility issues that stem from the distributed nature of Mercurial.

EDIT 2020-06-09: The unusual reality is: most of our output has to be mixed-encoding. As mentioned in https://www.mercurial-scm.org/wiki/EncodingStrategy, hg log --patch will contain internal strings in local encoding to mark fields, UTF-8 metadata, and file contents in an unknown encoding.

Whatever the user puts in, the user gets back. It is their responsibility to have a compatible codepage/terminal encoding.

An ecosystem issue

I will be using Rust as the reference language, but this applies to all programmers of all languages, from embedded to web developers. Most of the time you might not have to take encoding into account because you're interacting with only UTF-8 as you have for the past 10 years: if it's the case, I'm happy for you.

But if you're doing anything that may handle text (or data) of unknown origin, I urge you to ask yourself "should there be a bytes API?". Too many times I've stumbled across a library that provides interesting functionality that assumed everything to be UTF-8 when there was no real need for it.

I think part of the reason is because Rust is one of the few languages that actually handles string types correctly. String, OsString, CString all play a distinct role that is needed to properly represent strings: String is for UTF-8 data, OsString for strings in your OS's representation (that may not be UTF-8), and CString for compatibility with C. This last one could die in theory in a world where C didn't exist, but Free Pascal didn't win so here we are. Because Rust makes it easy to properly handle UTF-8 data through String, developers are empowered to... sometimes do the wrong thing: in my opinion this is absolutely not a flaw in Rust, but merely a side-effect of how mis-understood encoding issues are. The decision of not having types and APIs for bytestrings in the Rust stdlib is probably the same as with any other: to keep it minimal.

Even well-known, widely used crates like regex or clap made by programmers that definitely understand the underlying issue did not have a non-String interface (regex#85, clap#262) until a few versions in because an issue was opened. There probably are other reasons why this feature wasn't implemented, but to me this underlines the lack of attention that this problem receives.

Please, look at your crates/packages/gems/whathaveyou and try to think for a minute if that UTF-8/Unicode restriction is really necessary.

Bytestring formatting

Because "There Ain’t No Such Thing As Plain Text", we do a lot of bytestring manipulation in Mercurial; in Python that would be b"this is a bytestring!", and in Rust you would use a Vec<u8> or maybe the bstr crate.

The initial question I had for the people I mentioned at the beginning of the article was as follows: is there a crate that allows me to do bytestring formatting like we use the format!() macro for String formatting? I wasn't able to find anything online in a good hour or so of searching, but I might have missed something. A particular person I interacted with was adamant that "implementing Display is enough", but Display uses std::fmt::Formatter, that only handles String. So all the format!-related macros in the Rust stdlib understandably use String, because Rust is voting for a UTF-8 future, which I am all for.

That however does not help me solve my issue. Even Python, that had bytestring formatting in Python 2, removed it in Python 3.0 and only re-introduced it in 3.5 after it was made clear that it is a very real need, albeit somewhat niche.

I'm planning on writing a macro soon, probably called format_bytes! for that very purpose and put it in a crate.

EDIT 2020-06-09: I should have put a more thorough explanation here, so here goes:

I don't intend the format_bytes! macro to have all the bells and whistles of the original one, but to use it more as a mixed-encoding concatenation helper to not have to write multiple writes to a Vec<u8> all the time, with maybe a few formatting tricks. format_bytes!(b"ascii text {} other ascii text", &vec_of_bytes), could very well end-up being the syntax. This assumes an ASCII bytestring as the format string (as is the policy in Mercurial, see EncodingStrategy), and any slice of bytes as argument(s).

If anyone already has similar functionality somewhere, I'd be happy to not do this work, otherwise I'll keep you posted.