Über jars are a type of reuseable Java library that applications sometimes (knowingly or not) incorporate into their systems. Über jars are particularly challenging for software composition analysis (SCA) tools to understand because their structure and organization is complex. In this blog post I explain what über jars are, why they exist, and I provide a mini-benchmark to see how current SCA tools deal with this type of Java library.
- MergeBase is the only SCA tool I observed with comprehensive support for über jars.
- Sonatype also provides decent support but misses a few obvious cases. (Strangely, their legal analysis completely ignores their über jar results.)
- OWASP Depedency-Check and JFrog Xray are not bad, but their scanning is based solely on metadata files found inside über jars.
- Snyk, WhiteSource, and Github Dependabot currently have no ability to understand über jars at all.
I did not benchmark Black Duck (I don’t have access to that tool).
Über jars are the Java equivalent of taking everything in your fridge and throwing it all into your largest pot, giving it a good stir. From an SCA (Software Composition Analysis) perspective they are a bit of a nightmare.
Recall: normally you point your SCA tool at a Java jar file (a reuseable Java library) and your SCA tool responds by telling you the jar file’s name, version, and known-vulnerabilities.
But what if that single Java jar file is actually an agglomeration of dozens of Java jar files? What if some maniac cracked open all your jar files and poured all their contents into a single mega jar? That’s exactly what an über jar is and your SCA tool is going to need to reverse-engineer the contents accurately before it can say anything.
Why would anyone create an über jar in the first place?
Java programs are awkward to invoke. You have to tell Java the exact locations of all the jar files your program is using. Über jars are a way around that problem.
For example, a typical normal java program is started like this:
java -classpath lib1.jar:lib2.jar:main.jar name.of.MainEntry
With an über jar it’s less typing, since “lib1.jar” and “lib2.jar” have been blended directly into a single “main-uber.jar” file:
java -classpath main-uber.jar name.of.MainEntry
In this way über jars make Java programs easier to distribute and easier to start. That’s the main reason why they exist.
Recall that Jar files are actually just zip files. You can rename them to “.zip” and then double-click on them if you ever want to see what’s inside them. Über jars are what you get if you unzipped all of your Jar files and combined all the contents into a single zip file instead.
Über jars are a challenge for SCA
Most SCA tools are geared towards providing a single succinct answer for each library they scan.
At MergeBase we analyze every jar file against our master database for this possibility. For example, consider “apacheds-all-1.5.5.jar”, a large über jar containing over 500,000 lines of code coming from dozens of libraries. When we compare this jar file against all known versions of “slf4j-api” here are the results:
|Match Ratio||Known Library Version|
In the “slf4j-api” case there is also another hint inside the über jar. If I grep the jar’s contents for “sl4fj-api” I see these two entries:
Opening the latter, I see this:
#Generated by Maven #Fri Nov 21 14:48:07 CET 2008 version=1.5.6 groupId=org.slf4j artifactId=slf4j-api
This gives me further confidence that my binary analysis is correct: version 1.5.6 aligns with my MergeBase result. Some SCA scanners only consider this metadata when examining über jars, but philosophically I don’t agree with that approach, since metadata is not always present, as in the bouncy-castle example below. Metadata is also vulnerable to transcription mistakes and tampering.
You might be curious why this metadata is even present in the first place.
My own theory: it was probably present in the original “slf4j-api” jar. Über jars don’t just combine the software files – they combine all the files! And so if a metadata file is present in the original “slf4j-api” file, it will be dutifully copied into the über jar. I can download the original and see for myself. Sure enough, running “unzip -l slf4j-api-1.5.6.jar” shows both of those metadata files were in the original.
Moving onto to an example without metadata, here’s the results when we compare our über jar against “bcprov-jdk15”:
|Match Ratio||Known Library Version|
There is no metadata available to warn consumers that the highly vulnerable version 1.40 of bcprov-jdk15 was copied into apacheds-all-1.5.5.jar. Unfortunately email@example.com contains over 15 known-vulnerabilities. Scanners that rely on metadata (such as JFrog Xray and OWASP Dependency-Check) will miss this. And of course scanners that lack über jar handling (such as WhiteSource and Snyk) will also miss this.
Using our high-confidence matches we then query our known-vulnerability database for any corresponding vulnerabilities. Our technique is based on binary analysis – no metadata is involved at all, since metadata can be inaccurate. Using this technique we are able to identify dozens of sub-components encapsulated by the apacheds-all-1.5.5 über jar. Here’s a partial listing based on MergeBase’s analysis:
- 100.0% – firstname.lastname@example.org
- 100.0% – email@example.com
- 100.0% – firstname.lastname@example.org
- 100.0% – email@example.com
- 100.0% – firstname.lastname@example.org
- 100.0% – email@example.com-M6
- 100.0% – firstname.lastname@example.org
- 100.0% – email@example.com
(Etc… 25 more sub-components identified!)
Quick Competitive Check
We were curious to see if competing SCA tools are able to handle über jars. What follows is a quick benchmark against a half-dozen popular SCA tools.
For each SCA tool (MergeBase, OWASP Dependency-Check, Snyk, WhiteSource, Sonatype, etc…):
- Git clone: https://github.com/mergebase/vuln-example-apacheds-all
- Run “mvn install”.
- Apply each SCA tool against the built “vuln-example-apacheds-all”.
- Observe and compare the scan results.
As of February 2021, the apacheds-all-1.5.5 über jar contains two vulnerable sub-components. One of these (firstname.lastname@example.org) can only be identified using binary approaches since it had no metadata in the first place, and one of these (email@example.com) can be identified either via binary approaches or via metadata scanning.
We group the benchmark results into 3 categories:
1. Scanners that do not support über jars at all.
Snyk and Whitesource appear to have no idea that “firstname.lastname@example.org” is made by combining many jar files together.
Similarly, Github’s Dependabot also has no idea about this.
2. Scanners that support a metadata-based understanding of über jars.
OWASP Dependency-Check and JFrog Xray both detect the “email@example.com” metadata inside the über jar.
3. Scanners that support deep understanding of über jars.
Sonatype fails to identify any known-vulnerabilities with respect to firstname.lastname@example.org, and yet it does correctly identify that email@example.com contains firstname.lastname@example.org! This is a lopsided result: Sonatype clearly has a deep understanding here (otherwise it would be impossible to identify bcprov-jdk15), and yet somehow Sonatype is failing to spot the easy one. We also noted that Sonatype reported the license as Apache 2.0, when bcprov-jdk15 uses the MIT license.
MergeBase identifies all vulnerabilities correctly in this case. 🙂
Über jars are a special type of Java software component made by combining several jars into a single jar. Aside from MergeBase, most SCA scanners currently provide sub-par or even zero support for this component type.
Specials thank to Dr. Ken Warkentyne, our principal engineer, who built MergeBase’s über jar scanning capabiliity.