Today, I have released the 0.1.0 version of galimatias — The URL Parser For Java. Now that I left the 0.0.x world behind, it's a good time for a post.
Parsing a URL is supposed to be an easy task. It's supposed to be working out of the box by the standard libraries of most programming languages. But the fact is that it isn't something widely supported.
Over the years, web authors have produced all kind of URLs using different features, funky characters, unescaped non-ASCII characters, partially broken constructs, internationalized domain names (AKA IDNA Hell), etcetera. StackOverflow is full of cases illustrating this.
Web browsers have added more and more cases in order to support all cases. And now, there is finally a proper URL Standard by the WHATWG which is gaining traction among web browser vendors. But non-browser libraries are falling behind.
galimatias born to fill this gap in the JVM world. Some highlights:
- If a web browser can parse a URL as found in the wild, galimatias can parse it too.
- Provides efficient tools for further normalization of URL beyond parsing, which are particularly useful for web crawlers.
- It's used by the HTML5 validator (thanks to Michael Smith!).
- It's the first implementation of the URL Standard in Java.
And finally, an example:
// Let's get a funky URL String urlString = " http:/日本.jp:80//.././[ FÜNKY ] "; // Parse it URL url = URL.parse(urlString); System.out.println(url); // OUTPUT: http://xn--wgv71a.jp/[%20F%C3%9CNKY%20] System.out.println(url.toHumanString()); // OUTPUT: http://日本.jp/[ FÜNKY ] // URLs can be modified with a fluent API URL modifiedURL = url.withQuery(" let's do some query about 日本 ").withFragment(" and a fragment"); System.out.println(modifiedURL); // OUTPUT: http://xn--wgv71a.jp/[%20F%C3%9CNKY%20]?let's%20do%20some%20query%20about%20%E6%97%A5%E6%9C%AC#and a fragment System.out.println(modifiedURL.toHumanString()); // OUTPUT: http://日本.jp/[ FÜNKY ]?let's do some query about 日本#and a fragment // And there are convenient canonicalizers to get URLs to a standard form String differentUrlString = "http:/日本.jp/[20%46%c3%9c%4e%4B%59%20]"; URL differentURL = URL.parse(differentUrlString); System.out.println(differentURL); // OUTPUT: http://xn--wgv71a.jp/[%20%46%C3%9C%4E%4B%59%20] URLCanonicalizer canonicalizer = new DecodeUnreservedCanonicalizer(); URL canonicalizedURL = canonicalizer.canonicalize(differentURL); System.out.println(canonicalizedURL.toString()); // OUTPUT: http://xn--wgv71a.jp/[%20%46%C3%9C%4E%4B%59%20] System.out.println(canonicalizedURL.toHumanString()); // OUTPUT: http://日本.jp/[ FÜNKY ]