Migrating the i18n approach

09.04.2018 | Arne Babenhauserheide

“In a clean code-base you can change the internationalization method, even if it’s all over your code.”

Internationalization (i18n) refers to approaches which provide appropriate messages for people from different cultures. In most cases that means providing different strings for different language codes (i.e. de_DE, en_EN, or en_US), some of them with parameters.

In our project we’ve used the native Eclipse internationalization: Messages classes with static variables that get filled from properties files by reflection via a static initializer. From now on, I’ll call this the NLS method. Eclipse provides great support for the NLS method by creating the required static variables and properties entries via the Externalize Strings Wizard. IntelliJ provides much less convenience for internationalization with the NLS method, and since many in our team use IntelliJ, this started to hurt. We therefore decided to change the project to the ResourceBundle approach. All conversion code discussed here is available on Github and licensed freely under Apache 2.0. A big thanks to Disy!

Let’s get down to the reality of bits on disk. This is what we have done with the NLS method:

// .../MyFile.java
class MyFile {
  public void something () {
      Messages.MyFile_some_identifier;
  }
}

// .../Messages.java
package <some.package>;

import org.eclipse.osgi.util.NLS;Resolved by updating the Cadenza version.

public class Messages extends NLS {
 private static final String BUNDLE_NAME = "net.disy.cadenza.accessmanager.messages"; //$NON-NLS-1$
  public static String MyFile_some_identifier;
  // ... many more of these, duplicating information from the property files ...

  static {
    NLS.initializeMessages(BUNDLE_NAME, Messages.class);
  }

  private Messages() {
  }
}

// .../messages.properties
MyFile_some_identifier=some message

Below you find the ResourceBundle approach we’ve eventually came up with (without the iterations it took to get there):

// .../MyFile.java
class MyFile {
  public void something () {
      Messages.getString("MyFile_some_identifier");
  }
}

// .../Messages.java
package <some.package>;

import net.disy.commons.core.locale.IMessageResolver;
import net.disy.commons.core.locale.ResourceBundleMessageResolver;

public class Messages { // no longer extends NLS
 private static final String BUNDLE_NAME = "net.disy.cadenza.accessmanager.messages"; //$NON-NLS-1$
  // no static variables, no static initializer, no reflection

  private static final IMessageResolver MSG = new ResourceBundleMessageResolver(BUNDLE_NAME);

  private Messages() {
  }
  public static String getString(String key) {
    return MSG.getString(key);
  }
}

// .../messages.properties
MyFile_some_identifier=some message

I’ve started the project by investigating the whole code-base for a day, checking all imports of NLS and all usages of Messages classes. I was happy to see that NLS was used consistently, with few cross-project dependencies and only very rare occurrences of using multiple Messages classes in the same file. Later I’ve learned that a colleague had already worked on cleaning up our internationalization from time to time in the past year (Shout out: You’re great! — you know who you are).

Investigating may sound like complicated parsing, but the actual work boiled down to installing The Silver Searcher (ag) and ripgrep (or one of the other code search tools), and then grepping through. I’ve used The Silver Searcher, but if I had to do it again, I’d use ripgrep because it provides some additional speed. I also did not need the more complex regexp expressions not supported by ripgrep. Here’s an example of searching for all classes which extend NLS:

ag 'class.*extends.*NLS' -G '.*java$' {} -l

Then I checked the support in Eclipse for ResourceBundles, since it has been necessary to ensure that Eclipse users would continue to have good IDE support. I found JInto plugin, which provides good support for message completion and only requires a one-time setup step.

Fun intersection: The creators of JInto work in the town Horb am Neckar, less than two hours by train from Disy. It’s a small world after all.

Now I had an idea of the requirements, and it was clear that the project could proceed well, so it was time to implement the migration. I looked into using IntelliJ scripting or structured replacement, but decided to go with a plain Python tool, because I already knew it by heart, and the code-base is clean enough that I could just go with string replacement (mind the shout out: you need pretty clean code for this).

So the first step was finding all Messages classes from Python — a simple check for all classes with extends NLS. Let’s strip away the last layer of illusion. I just did this:

import subprocess
import shlex
def get_paths_to_all_NLS_classes(module_path):
    str(subprocess.check_output(
    shlex.split("ag 'class.*extends.*NLS' -G '.*java$' {} -l".format(
        module_path))),
    encoding="utf-8").split()

Yep, that’s right: I did not implement anything new, but simply called out to The Silver Searcher (ag) I had also used to investigate the code-base. Using shlex.split ensured that I would not have to worry about any shell replacements, but could still write the command almost like I would on the shell. Praised be laziness (where it helps avoiding mistakes from translating between different domains).

Now I’d love to say that I knew I would need the packages of all Messages variables, and therefore I directly extracted them. But I did not. I had missed inter-package dependencies in some classes. I hoped I would not need them and implemented a simple replacement: Just get all public static variables in the Messages class and then grep for them. That would be completely disk-bound even with the SSDs we have at work, so no need for worrying about execution speed of my code – except for minimizing disk access.

But complexity strikes back! When I came upon the first import of a Messages class from a different package and then started searching, a glimpse of horror hit me: There were so many imports that I’d never be able to fix all of them manually, so I’d end up with incorrect replacements. Even if 95% of your i18n usage is wonderfully clean, working on a >1 million line project means that 5% unclean usages adds up to around 400 manual corrections. If I could fix one message in just 5 seconds, I’d still be spending more than a week fixing dependencies. That didn’t go according to plan.

dead end -- linus-sandvide-566314-unsplash

That’s when I realized how easy it is to parse Java. And why Java IDEs can so easily do safe refactorings which would be a horror-trip of guesswork in other languages or would require a complete syntax parser. Every single Java file contains the absolute path to its package. And with package and class name I can (almost) always find the exact import statement which matches a given class.

In Java, simple string replacements can do things which need a complete language parser in many other languages – and writing such string replacements is much more fun than writing even a simple parser for a language like Scheme – take this from someone who tackled both challenges: parsing string literals is no fun, but when your language has multi-line-strings you cannot avoid it, otherwise you end up with hard-to-trace errors.

So let’s rejoice in the simplicity of Java while I state that adding package information to every single message variable took less than an hour. I now understand why Java IDEs have such powerful refactoring capabilities.

Wow, three paragraphs of explanation, let’s spice that up with code. This is what I’ve done:

def build_replacement_patterns(filesandlines, filesandpackages):
    """ Create a list of tuples (FROM, TO, class, variable, package) which provide the information for replacing.
    >>> fal = {"foo/Bah.Java": ["   public static String FOO_thing;\\n"]}
    >>> fap = {"foo/Bah.Java": "foo"}
    >>> build_replacement_patterns(fal, fap)
    [('Bah.FOO_thing', 'Bah.getString("FOO_thing")', 'Bah', 'FOO_thing', 'foo')]
    ... """

If you don’t speak Python: that’s the doctest at the top of the function build_replacement_patterns. It basically states that I’ve used two prepared dictionaries: files with their codelines and the same files with the respective package. From these I created lists of simple replacement patterns: the string checking the simple case (FROM: class.variable), the replacement string (TO: class.getString("variable")), the class, the raw variable and the package.

Initially I’ve just grepped all public static String variables in the Messages. Now I also kept their packages to safely replace the variables in any file.

But this came at a price: instead of checking every single Java file against the few tens of variables in its module, I had to check each Java file against the several thousand messages in the whole project. And this was no longer disk bound: it took several hours to run. Talking about adding two orders of magnitude in cost - whatever you do, something always¹ ends up being performance critical (¹: this isn’t actually true, but much closer to the reality I see now than I would have expected a decade ago).

As I didn’t want to spend time re-writing all this code in a faster language, and for dubious gain, I moved to multiprocessing instead. Python is actually so fast at String manipulation, that with efficient code the gain to be expected from going to a faster language would only be about factor two. To parallelize the code in a handful of lines of code, I started out with the single process approach and replaced it with multiprocessing by simply swapping out one function:

def process_single_process(sublists, patterns):
    changedusage = []
    for sub in sublists:
        changedusage.extend(replace_patterns_in_filelist(sub, patterns))
    return changedusage

def process_multiprocessing(sublists, patterns, usecpus):
    with concurrent.futures.ProcessPoolExecutor(max_workers=usecpus) as e:
    futures = []
    for sub in sublists:
        futures.append(e.submit(replace_patterns_in_filelist, sub, patterns))
    changedusage = []
    for fut in futures:
        changedusage.extend(fut.result(timeout=900))
    return changedusage

This resulted in a healthy factor 12 speedup, fast enough that I could actually run these tests on the entire repository until everything worked smoothly.

Once I had everything in place that I could automate without prohibitive effort, I ran the full build and test suite while adding some pre-patches (to run before the conversion) and post-patches (to run afterwards):

# prepare the source to be easier to convert
cd $PROJ && for i in $I18NtoRB/patches/*; do patch -p0 < "$i" ; done

# convert i18n for a million lines of code from NLS to ResourceBundles
cd $I18NtoRB && ./convert_project.py $PROJ

# cleanup what could not be automated easily
cd $PROJ && for i in $I18NtoRB/post-patches/*; do patch -p0 < "$i" ; done

To create the patches, I simply did the changes required and then stored them with svn diff --patch-compatible $FILE > $PATCH. That allowed me to quickly undo and redo changes without messing with recorded history. With a decentralized version tracking system, I’d have used a local branch, but since we’re still on subversion, I did not want automated scripts to mess with history.

Almost done! After checking and re-checking everything, me and my team lead did a 30k lines commit:

rXXXXXX | babenhauserheide | 2018-02-07 18:15:45 +0100 (Mi,  7. Feb. 2018) | 2 Zeilen

Small change to Internationalization: from NLS to ResourceBundle

Happy Ending! Everything ran smoothly! No bugs! No bugs!?

… well, almost no bugs. A week later our technical writers informed us about a broken specialized launcher using reflection to display the message keys instead of the translations. It took me about two days to get it fixed, mostly because I’ve researched if it was possible to replace the message resolution method at runtime with reflection, or with a class loader, going far down the rabbit hole. We finally decided to pull the plug. To keep it simple we’ve added a switch to the resolver which turned out to be an improvement compared to what we’ve used before. I also created a tarball with the setup for JInto, so people would not have to do it manually for every single project (shout out to the Eclipse JInto devs: Thank you for using a simple additional config file we could just drop in!).

                       '
   *''*             .'.:.'.
  *_\/_*  bugfree!  -=:o:=-
  * /\ *            '.':'.'
   */ *                '
    |       '.\|/.'    |
    |       (\   /)    |
    |       - -O- -    |
    |       (/   \)    |
    |       ,'/|\'.    "
    "          |

Lesson learned: No code that has grown for more than a decade can be reshaped without complications. However by taking the ResourceBundle internationalization approach, the IntelliJ users are happy, the Eclipse users are - at least - not worse off, and the world is a slightly happier place.

I hope you’ve enjoyed my summary and its happy ending. Feel free to grab the code - let me know what you do with it and if it’s helpful. Fork, merge and enjoy!

…and while I’m at it: Since you’ve followed this blog post ‘til the end, you might be interested in working with us. We’re looking for people who enjoy a code dive, improving old code without breaking it. I love it here and I hope you will, too! So head over to our open positions page and find out if there’s a job for you!

The title image was published on 2017-08-22 by Delano Balten, the dead end image was published on 2018-02-19 by Linus Sandvine, both under the Unsplash License.

« Browser Integration in Java Applications Third Disy Hackathon »

Disy Tech-Blog

Migrating the i18n approach