Software Composition Analysis (SCA) vs. Java Über Jars

Why Über jars are a challenge for SCA tools?

Introduction

Über jars are a type of reuseable Java library that applications sometimes (knowingly or not) incorporate into their systems. Über jars are particularly challenging for software composition analysis (SCA) tools to understand because their structure and organization are complex.

In this blog post, I explain what über jars are and why they exist, and I provide a mini-benchmark to see how current SCA tools deal with this type of Java library.

1. TLDR 2. Background
3. Why would anyone create an über jar in the first place?
4. Über jars are a challenge for SCA
5. Quick Competitive Check

TLDR

Circa February 2021, both MergeBase and Sonatype use deep binary analysis to measure software composition within über jars.

  • MergeBase is the only SCA tool I observed with comprehensive support for über jars.
  • Sonatype also provides decent support but misses a few obvious cases. (Strangely, their legal analysis completely ignores their über jar results.)
  • OWASP Depedency-Check and JFrog Xray are not bad, but their scanning is based solely on metadata files found inside über jars.
  • Snyk, Mend (previous WhiteSource) and Github Dependabot currently have no ability to understand über jars at all.

ps. I did not benchmark Black Duck (I don’t have access to that tool).

Background

Über jars are the Java equivalent of taking everything in your fridge and throwing it all into your largest pot, giving it a good stir. From an SCA (Software Composition Analysis) perspective, they are a bit of a nightmare.

Recall: normally, you point your SCA tool at a Java jar file (a reusable Java library), and your SCA tool responds by telling you the jar file’s name, version, and known vulnerabilities.

But what if that single Java jar file is actually an agglomeration of dozens of Java jar files? What if some maniac cracked open all your jar files and poured all their contents into a single mega jar? That’s exactly what an über jar is, and your SCA tool is going to need to reverse-engineer the contents accurately before it can say anything.

Why would anyone create an über jar in the first place?

Java programs are awkward to invoke. You have to tell Java the exact locations of all the jar files your program is using. Über jars are a way around that problem.

For example, a typical normal java program is started like this:

java -classpath lib1.jar:lib2.jar:main.jar name.of.MainEntry

With an über jar it’s less typing, since “lib1.jar” and “lib2.jar” have been blended directly into a single “main-uber.jar” file:

java -classpath main-uber.jar name.of.MainEntry

In this way über jars make Java programs easier to distribute and easier to start. That’s the main reason why they exist.

Recall that Jar files are actually just zip files. You can rename them to “.zip” and then double-click on them if you ever want to see what’s inside them. Über jars are what you get if you unzipped all of your Jar files and combined all the contents into a single zip file instead.

Über jars are a challenge for SCA

Most SCA tools are geared towards providing a single succinct answer for each library they scan.

Identified Library:  
**Apache Commons-Collections 3.2.1.

**Vulnerabilities:
CVE-2017-15708, CVE-2015-7501, and CVE-2015-6420

With über jars the answer is more complicated. “Well, actually… this library is a combination of many libraries.”

At MergeBase we analyze every jar file against our master database for this possibility. For example, consider “apacheds-all-1.5.5.jar”, a large über jar containing over 500,000 lines of code coming from dozens of libraries. When we compare this jar file against all known versions of “slf4j-api” here are the results:

Match Ratio Known Library Version
81.0% slf4j-api@1.5.11
90.5% slf4j-api@1.5.8
100.0% slf4j-api@1.5.6
90.5% slf4j-api@1.5.5
90.5% slf4j-api@1.5.4

These results show that version 1.5.6 of slf4j-api is contained inside the apacheds-all-1.5.5 über jar file.

In the “slf4j-api” case there is also another hint inside the über jar. If I grep the jar’s contents for “sl4fj-api” I see these two entries:

META-INF/maven/org.slf4j/slf4j-api/pom.xml
META-INF/maven/org.slf4j/slf4j-api/pom.properties

Opening the latter, I see this:

#Generated by Maven
#Fri Nov 21 14:48:07 CET 2008
version=1.5.6
groupId=org.slf4j
artifactId=slf4j-api

This gives me further confidence that my binary analysis is correct: version 1.5.6 aligns with my MergeBase result. Some SCA scanners only consider this metadata when examining über jars, but philosophically I don’t agree with that approach, since metadata is not always present, as in the bouncy-castle example below. Metadata is also vulnerable to transcription mistakes and tampering.

You might be curious why this metadata is even present in the first place.

My own theory: it was probably present in the original “slf4j-api” jar. Über jars don’t just combine the software files – they combine all the files! And so if a metadata file is present in the original “slf4j-api” file, it will be dutifully copied into the über jar. I can download the original and see for myself. Sure enough, running “unzip -l slf4j-api-1.5.6.jar” shows both of those metadata files were in the original.

Moving onto to an example without metadata, here’s the results when we compare our über jar against “bcprov-jdk15”:

Match Ratio Known Library Version
84.7% bcprov-jdk15@1.44
91.3% bcprov-jdk15@1.43
100.0% bcprov-jdk15@1.40
82.0% bcprov-jdk15@1.38
48.6% bcprov-jdk15@1.32

There is no metadata available to warn consumers that the highly vulnerable version 1.40 of bcprov-jdk15 was copied into apacheds-all-1.5.5.jar. Unfortunately bcprov-jdk15@1.40 contains over 15 known-vulnerabilities. Scanners that rely on metadata (such as JFrog Xray and OWASP Dependency-Check) will miss this. And of course scanners that lack über jar handling (such as WhiteSource and Snyk) will also miss this.

Using our high-confidence matches we then query our known-vulnerability database for any corresponding vulnerabilities. Our technique is based on binary analysis – no metadata is involved at all, since metadata can be inaccurate. Using this technique we are able to identify dozens of sub-components encapsulated by the apacheds-all-1.5.5 über jar. Here’s a partial listing based on MergeBase’s analysis:

  1. 100.0% – antlr/antlr@2.7.7
  2. 100.0% – commons-io/commons-io@1.4
  3. 100.0% – commons-lang/commons-lang@2.4
  4. 100.0% – org.apache.directory.server/apacheds-core-jndi@1.5.5
  5. 100.0% – org.apache.directory.shared/shared-ldap@0.9.15
  6. 100.0% – org.apache.mina/mina-core@2.0.0-M6
  7. 100.0% – org.bouncycastle/bcprov-jdk15@1.43
  8. 100.0% – org.slf4j/slf4j-api@1.5.8

(Etc… 25 more sub-components identified!)

Quick Competitive Check

We were curious to see if competing SCA tools are able to handle über jars. What follows is a quick benchmark against a half-dozen popular SCA tools.

Methodology

For each SCA tool (MergeBase, OWASP Dependency-Check, Snyk, WhiteSource, Sonatype, etc…):

  1. Git clone: Repository
  2. Run “mvn install”.
  3. Apply each SCA tool against the built “vuln-example-apacheds-all”.
  4. Observe and compare the scan results.

Mini-Benchmark Results

As of February 2021, the apacheds-all-1.5.5 über jar contains two vulnerable sub-components. One of these (bcprov-jdk15@1.40) can only be identified using binary approaches since it had no metadata in the first place, and one of these (commons-collections@3.2.1) can be identified either via binary approaches or via metadata scanning.

We group the benchmark results into 3 categories:

1. Scanners that do not support über jars at all.

Snyk and Whitesource appear to have no idea that “apacheds-all@1.5.5” is made by combining many jar files together. Similarly, Github’s Dependabot also has no idea about this.

2. Scanners that support a metadata-based understanding of über jars.

OWASP Dependency-Check and JFrog Xray both detect the “commons-collections@3.2.1” metadata inside the über jar.

3. Scanners that support deep understanding of über jars.

Sonatype fails to identify any known-vulnerabilities with respect to commons-collections@3.2.1, and yet it does correctly identify that apacheds-all@1.5.5 contains bcprov-jdk15@1.40! This is a lopsided result: Sonatype clearly has a deep understanding here (otherwise it would be impossible to identify bcprov-jdk15), and yet somehow Sonatype is failing to spot the easy one. We also noted that Sonatype reported the license as Apache 2.0, when bcprov-jdk15 uses the MIT license.

MergeBase identifies all vulnerabilities correctly in this case. 🙂

Conclusion

Über jars are a special type of Java software component made by combining several jars into a single jar. Aside from MergeBase, most SCA scanners currently provide sub-par or even zero support for this component type.

Last piece of advice: Have Über jars? Give MergeBase a closer look!



Special Thanks
Specials thank you to Dr. Ken Warkentyne, our principal engineer, who built MergeBase’s über jar scanning capability.

Julius Musseau

About the Author

Julius Musseau

Co-founder & Adivsor. Senior architect and developer with strong academic background and roots in the open source community. Contributor to a number of important open source projects.