Surveying open source licenses [LWN.net]

By Michael Kerrisk
April 17, 2013

In adjacent slots at the 2013 Free Software Legal and Licensing Workshop, Daniel German and and Walter van Holst presented complementary talks that related to the topic of measuring the use of free software and open source (FOSS) licenses. Daniel's talk considered the challenges inherent in trying to work out which are the most widely used FOSS licenses, while Walter's talk described his attempts to measure license proliferation. Both of those talks served to illustrate just how hard it can be to produce measurements of FOSS licenses.

Toward a census of free and open source software licenses

Daniel German is a Professor in the Department of Computer Science at the University of Victoria in Canada. One of his areas of research is a topic that interests many other people also: which are the most widely used FOSS licenses? His talk considered the methodological challenges of answering that question. He also looked at how those challenges were addressed in studying license usage in a subset of the FOSS population, namely, Linux distributions.

Finding the license of a file or project can be difficult, Daniel said. This is especially so when trying to solve the problem for a large population of files or projects. "I'm one of the people who has probably seen the most different licenses in his lifetime, and it's quite a mess." The problem is that projects may indicate their license in a variety of ways: via comments in source code files, via README or COPYING files, via project metadata (e.g., on SourceForge or Launchpad), or possibly other means. Other groups then abstract that license data. For example, Red Hat and Debian both do this, although they do it in different ways that are quite labor intensive.

"I really want to stress the distinction between being empirical and being anecdotal". Here, Daniel pointed at the widely cited statistics on FOSS license usage provided by Black Duck Software. One of the questions that Daniel asks himself, is, can he replicate those results? In short, he cannot. From the published information, he can't determine the methodology or tools that were used. It isn't possible to determine the accuracy of the tools used to determine the licenses. Nor is it possible to determine the names of the licenses that Black Duck used to develop its data reports. (For example, one can look at the Black Duck license list and wonder whether "GPL 2.0" means GPLv2-only, GPLv2 or later, or possibly both?)

Daniel then turned to consider the challenges that are faced when trying to take a census of FOSS licenses. When one looks at the question of what licenses are most used, one of the first questions to answer is: what is "the universe of licenses" that are considered to be FOSS? For example, the Open Source Initiative has approved a list of around 70 licenses. But, there are very many more licenses in the world that could be broadly categorized as free, including rather obscure and little-used licenses such as a Beerware license. "I don't think anyone knows what the entire universe is." Thus, one must begin by defining the subset of licenses that one considers to be FOSS.

Following on from those questions is the question of what is "an individual" for the purpose of a census? Should different versions of the same project be counted individually? (What if the license changes between versions?) Are forks individual? What about "like" forks on GitHub? Do embedded copies of source code count as individuals? Here, Daniel was referring to the common phenomenon of copying source code files in order to simplify dependency management. And then, is an individual defined at the file level or at the package level? It's very important from a methodological point of view that we are are told what is being counted, not just the numbers, Daniel said.

Having considered the preceding questions, one then has to choose a corpus on which to perform the census. Any corpus will necessarily be biased, Daniel said, because the fact that the corpus was gathered for some purpose implies some trade-offs.

Two corpuses that Daniel likes are the Red Hat and Debian distributions. One reason that he likes these distributions is that they provide a clearly defined data set. "I can say, I went to Debian 5.0, and I determined this fact." Another positive aspect of these corpuses is that they are proxies for "successful" projects. The fact that a project is in one of those distributions indicates that people find it useful. That contrasts with a random project on some hosting facility that may have no users at all.

While presence in a Linux distribution can be taken as a reasonable proxy of a successful project, a repository such as Maven Central is, by contrast, a "big dumpster" of Java code, but "it's Java code that is actually being used by someone". On the other hand, Daniel called SourceForge the "cemetery for open source". In his observation, there is a thin layer of life on SourceForge, but nobody cares about most of the code there.

Then there are domain-specific repositories such as CPAN, the Perl code archive. There is clearly an active community behind such repositories, but, for the purpose of taking a FOSS license census, one must realize that the contents of such repositories often have a strong bias in favor of particular licenses.

Having chosen a corpus, the question is then how to count the licenses in the corpus. Daniel considered the example of the Linux kernel, which has thousands of files. Those files are variously licensed GPLv2, GPLv2+, LGPLv2, LGPLv2.1, BSD 3 clause, BSD 2 clause, MIT/X11, and more. But the kernel as a whole is licensed GPLv2-only. Should one count the licenses on each file individually, or just the individual license of the project as a whole, Daniel asked. A related question comes up when one looks at the source code of the FreeBSD kernel. There, one finds that the license of some source files is GPLv2. By default, those files are not compiled to produce the kernel binary (if they were, the resulting kernel binary would need to be licensed GPL). So, do binaries play a role in a license census, Daniel asked.

When they started their work on studying FOSS licenses, Daniel and his colleagues used FOSSology, but they found that it was much too slow for studying massive amounts of source code. So they wrote their own license-identification tool, Ninka. "It's not user-friendly, but some people use it."

Daniel and his colleagues learned a lot writing Ninka. They found it was not trivial to identify licenses. The first step is to find the license statement, which may or may not be in the source file header. Then, it is necessary to separate comments from any actual license statement. Then, one has to identify the license; Ninka uses a sentence-based matching algorithm for that task.

Daniel then talked about some results that he and his colleagues have obtained using Ninka, although he emphasized repeatedly that his numbers are very preliminary. In any case, one of the most interesting points that the results illustrate is the difficulty of getting accurate license numbers.

One set of census results was obtained by scanning the source code of Debian 6.0. The scan covered source code files for just four of the more popular programming languages that Daniel found particularly interesting: C, Java, Perl, and Python.

In one of the scans, Ninka counted the number of source files per license. Unsurprisingly, GPLv2+ was the most common license. But what was noteworthy, he said, is that somewhat more than 25% of the source code files have no license, although there might be a license file in the same directory that allows one to infer what the license is.

In addition, Ninka said "Unknown" for just over 15% of the files. This is because Ninka has been consciously designed to have a strong bias against mis-identifying licenses. If it has any doubt about the license, Ninka will return "Unknown" rather than trying to make a guess; the 15% number is an indication of just how hard it can be to identify the licenses in a file. Ninka does still occasionally make mistakes. The most common reason is that a source file has multiple licenses and Ninka does not identify them all; Daniel has seen a case where one source code file had 30 licenses.

The other set of results that Daniel presented for Debian 6.0 measured packages per license. In this case, if at least one of the source files in a package uses a license, then that use is counted as an individual for the census. Again, the GPLv2+ is the most common of the identified licenses, but comparing this result against the "source files per license" measure showed some interesting differences. Whereas the Eclipse Public License version 1 (EPLv1) easily reached the list of top twenty most popular source file licenses, it did not appear in the top twenty packages licenses. The reason is that there are some packages—for example, Eclipse itself—that consist of thousands of files that use the EPLv1 license. However, the number of packages that make any use of the EPLv1 as a license is relatively small. Again, this illustrated Daniel's point about methodology when it comes to measuring FOSS license usage: what is being measured?

Daniel then looked at a few other factors that illustrated how a FOSS license census can be biased. In one case, he looked at the changes in license usage in Debian between version 5.0 and 6.0. Some licenses showed increased usage that could be reasonably explained. The GPLv3 was one such license: as a new, well-publicized license, the reasons for its usage are easily understood. On the other hand, the EPLv1 license also showed significant growth. But, Daniel explained, that was at least in part because, for legal reasons, Java code that uses that license was for a long time under-represented in Debian.

Another cause of license bias became evident when Daniel turned to look at per-file license usage broken down across three languages: Java, Perl, and Python. Notably, around 50% of Perl and Python source files had no license; for Java, that number was around 12%. "Java programmers seem to be more proactive about specifying licenses." Different programming language communities also show biases towards particular licenses: for example, the EPLv1 and Apache v2 licenses are much more commonly used with Java than with the Python or Perl; unsurprisingly the "Same as Perl" license is used only with Perl.

In summary, Daniel said: "every time you see a census of licenses, take it with a grain of salt, and ask how it is done". Any license census will be biased, according to the languages, communities, and products that it targets. Identifying licenses is hard, and tools will make mistakes, he said. Even a tool such as Ninka that tries to very carefully identify licenses cannot do that job for 15% of source files. For a census, 15% is a huge amount of missing data, he said.

License proliferation: a naive quantitative analysis

Walter van Holst is a legal consultant at the Dutch IT consulting company mitopics. His talk presented what he describes as "an extremely naive quantitative analysis" of license proliferation.

The background to Walter's work is that in 2009 his company sponsored a Master's thesis on license proliferation that produced some contradictory results. The presumption going into the research was that license proliferation was a problem. But some field interviews conducted during the research found that the people in free software communities didn't seem to consider license proliferation to be much of a problem. Four years later, it seemed to Walter that it was time for a quantitative follow-up to the earlier research, with the goal of investigating the topic of license proliferation further.

In trying to do a historical analysis of license proliferation, one problem that Walter encountered is that there were few open repositories that could be used to obtain historical license data. Thus, trying to use one of the now popular FOSS project-hosting facilities would not allow historical analysis. Therefore, Walter instead chose to use data from a software index, namely Freecode (formerly Freshmeat, before an October 2011 name change). Freecode provides project licensing information that is available for download from FLOSSmole, which acts a repository for dumps of metadata from other repositories. FLOSSmole commenced adding Freecode data in 2005, but Walter noted that the data from before 2009 was of very low quality. On the other hand, the data from 2009 onward seemed to be of high enough quality to be useful for some analysis.

How does one measure license proliferation? One could, Walter said, consider the distribution of license choices across projects, as Daniel German has done. Such an analysis may provide a sign of whether license proliferation is a problem or not, he said.

Another way of defining license proliferation is as a compatibility problem, Walter said. In other words, if there is proliferation of incompatible licenses, then projects can't combine code that technically could be combined. Such incompatibility is, in some sense, a loss in the value of that FOSS code. This raises a related question, Walter said: "is one-way license compatibility enough?" (For example, there is one-way compatibility between the BSD and GPL licenses, in the direction of the GPL: code under the two licenses can be combined, but the resulting work must be licensed under the GPL.) For his study, Walter presumed that one-way compatibility is sufficient for two projects to be considered compatible.

Going further, how can one assign a measure to compatibility, Walter asked. This is, ultimately, an economic question, he said. "But, I'm still not very good at economics", he quipped. So, he instead chose an "extremely naive" measure of compatibility, based on the following assumptions:

Treat all open source projects in the analysis as nodes in a network.

Consider all possible links between pairs of nodes (i.e., combinations of pairs of projects) in the network.

Treat each possible combination as equally valuable.

This is, of course, a rather crude approach that treats combinations between say the GNU C library (glibc) and some obscure project with few users as being equal in importance to (say) the combination of glibc and gcc. This approach also completely ignores language incompatibilities, which is questionable, since it seems unlikely that one would want to combine Lisp and Java code, for example.

Given a network of N nodes, the potential "value" of the network is the maximum number of possible combinations of two nodes. The number of those combinations is N*(N-1)/2. From a license-compatibility perspective, that potential value would be fully realized if each node was license-compatible with every other node. So, for example, Walter's 2009 data set consisted of 38,674 projects, and, following the aforementioned formula, the total possible interconnections would be approximately 747.9 million.

Walter's measure of license incompatibility in a network is then based on asking two questions:

For each license in the network, how many combinations of two nodes in the network can produce a derived work under that license? For example, how many pairs of projects under GPL-compatible licenses can be combined in the network?

Considering the license that produces the largest number of possible connections for a derived work, how does the number of connections for that license measure up against the total number of possible combinations?

Perhaps unsurprisingly, the license that allows the largest number of derived work combinations is "any version of the GPL". By that measure, 38,171 projects in the data set were compatible, yielding 728.5 million interconnections.

Walter noted that the absolute numbers don't matter in and of themselves. What does matter is the (proportional) difference between the size of the "best" compatible network and the theoretically largest network. For 2009, that loss is the difference between the two numbers given above, which is 19.3 million. Compared to the total potential connections, that loss is not high (expressed as a proportion, it is 2.5%). Or to put things another way, Walter said, these figures suggest that in 2009, license proliferation appears not to have been too much of a problem.

Walter showed corresponding numbers for subsequent years, which are tabulated below. (The percentage values in the "Value loss" column are your editor's addition, to try and make it easier for the reader to get a feel for the "loss" value.)

Year Potential value
(millions) Value loss
(millions) GPL market
share

2009 747.8 19.3 (2.5%) 72%

2010 534.6 30.8 (5.7%) 63%

2011 565.9 56.4 (9.9%) 61%

2012 599.6 79.8 (13.3%) 59%

2013 621.6 60.3 (9.7%) 58%

The final column in the table shows the number of projects licensed under "any version of the GPL". In addition, Walter presented pie charts that showed the proportion of projects under various common licenses. Notable in those data sets was that, whereas in 2009 the proportion of projects licensed GPLv2-only and GPLv3 was respectively 3% and 2%, by 2013, those numbers had risen to 7% and 5%.

Looking at the data in the table, Walter noted that the "loss" value rises from 2010 onward, suggesting that incompatibility resulting from license proliferation is increasing.

Walter then drew some conclusions that he stressed should be treated very cautiously. In 2009, license proliferation appears not to have been much of a problem. But looking at the following years, he suggested that the increased "loss" value might be due to the rise in the number of projects licensed GPLv2-only or GPLv3-only. In other words, incompatibility rose because of a licensing "rift" in the GPL community. The "loss" value decreased in 2013, which he suggested may be due to an increase in the number of projects that have moved to Apache License version 2 (which has better license compatibility with the the GPL family of licenses).

Concluding remarks

In questions at the end of the session, Daniel and Walter both readily acknowledged the limitations of their methodologies. For example, various people raised the point that the Freecode license information used by Walter tends to be out of date and inaccurate. In particular, the data does not seem to be too precise on which version of the GPL a project is licensed under; the license for many projects is just defined as "GPL" which provided Walter's "any version of the GPL" license measure above. Walter agreed that his source data is dirty, but pointed out that the real question is how to get better data.

As Walter also acknowledged, his measure of license incompatibility is "naive". However, his goal was not to present highly accurate numbers. Instead, he wants to get some clues about possible trends and suggest some ideas for future study. It is easy to see other ways in which his results might be improved. Comparing his presentation with Daniel's, one can immediately come up with ideas that could lead to improvements. For example, approaches that consider compatibility at the file level or bring programming languages into the equation might produce some interesting results.

Inasmuch as one can find faults in the methodologies used by Daniel and Walter, that is only possible because, unlike the widely cited Black Duck license census, they have actually published their methodologies. In revealing their methodologies and the challenges they faced, they have shown that any FOSS licensing survey that doesn't publish its methodology should be treated with considerable suspicion. Clearly, there is room for further interesting research in the areas of FOSS license usage, license proliferation, and license incompatibility.

(Log in to post comments)