galimatias — The URL Parser For Java

Today, I have released the 0.1.0 version of galimatias — The URL Parser For Java. Now that I left the 0.0.x world behind, it's a good time for a post.

Parsing a URL is supposed to be an easy task. It's supposed to be working out of the box by the standard libraries of most programming languages. But the fact is that it isn't something widely supported.

Over the years, web authors have produced all kind of URLs using different features, funky characters, unescaped non-ASCII characters, partially broken constructs, internationalized domain names (AKA IDNA Hell), etcetera. StackOverflow is full of cases illustrating this.

Web browsers have added more and more cases in order to support all cases. And now, there is finally a proper URL Standard by the WHATWG which is gaining traction among web browser vendors. But non-browser libraries are falling behind.

galimatias born to fill this gap in the JVM world. Some highlights:

  • If a web browser can parse a URL as found in the wild, galimatias can parse it too.
  • Provides efficient tools for further normalization of URL beyond parsing, which are particularly useful for web crawlers.
  • It's used by the HTML5 validator (thanks to Michael Smith!).
  • It's the first implementation of the URL Standard in Java.
  • To the best of my knowledge, it's the second idependent implementation of the URL Standard (the first being the JavaScript implementation by Anne van Kesteren.

And finally, an example:

// Let's get a funky URL
String urlString = " http:/日本.jp:80//.././[ FÜNKY ] ";

// Parse it
URL url = URL.parse(urlString);

System.out.println(url);
// OUTPUT: http://xn--wgv71a.jp/[%20F%C3%9CNKY%20]

System.out.println(url.toHumanString());
// OUTPUT: http://日本.jp/[ FÜNKY ]


// URLs can be modified with a fluent API
URL modifiedURL = url.withQuery(" let's do some query about 日本 ").withFragment(" and a fragment");

System.out.println(modifiedURL);
// OUTPUT: http://xn--wgv71a.jp/[%20F%C3%9CNKY%20]?let's%20do%20some%20query%20about%20%E6%97%A5%E6%9C%AC#and a fragment

System.out.println(modifiedURL.toHumanString());
// OUTPUT: http://日本.jp/[ FÜNKY ]?let's do some query about 日本#and a fragment


// And there are convenient canonicalizers to get URLs to a standard form

String differentUrlString = "http:/日本.jp/[20%46%c3%9c%4e%4B%59%20]";
URL differentURL = URL.parse(differentUrlString);
System.out.println(differentURL);
// OUTPUT: http://xn--wgv71a.jp/[%20%46%C3%9C%4E%4B%59%20]

URLCanonicalizer canonicalizer = new DecodeUnreservedCanonicalizer();
URL canonicalizedURL = canonicalizer.canonicalize(differentURL);
System.out.println(canonicalizedURL.toString());
// OUTPUT: http://xn--wgv71a.jp/[%20%46%C3%9C%4E%4B%59%20]
System.out.println(canonicalizedURL.toHumanString());
// OUTPUT: http://日本.jp/[ FÜNKY ]

That's it! Go get it at GitHub or Maven Central!

comments powered by Disqus