By Michael Kerrisk
April 17, 2013
In adjacent slots at the 2013 Free Software Legal and Licensing
Workshop, Daniel German and and Walter van Holst presented complementary
talks that related to the topic of measuring the use of free software and
open source (FOSS) licenses. Daniel's talk considered the challenges
inherent in trying to work out which are the most widely used FOSS
licenses, while Walter's talk described his attempts to measure
license proliferation. Both of those talks served to illustrate just how
hard it can be to produce measurements of FOSS licenses.
Toward a census of free and open source software licenses
Daniel German is a Professor in the Department of Computer Science at
the University of Victoria in Canada. One of his areas of research is a
topic that interests many other people also: which are the most widely used
FOSS licenses? His talk considered the
methodological challenges of answering that question. He also looked at how
those challenges were addressed in studying license usage in a subset of the
FOSS population, namely, Linux distributions.
Finding the license of a file or project can be difficult, Daniel
said. This is especially so when trying to solve the problem for a large
population of files or projects. "I'm one of the people who has probably
seen the most different licenses in his lifetime, and it's quite a mess."
The problem is that projects may indicate their license in a variety of
ways: via comments in source code files, via README or COPYING files, via
project metadata (e.g., on SourceForge or Launchpad), or possibly other
means. Other groups then abstract that license data. For example, Red Hat
and Debian both do this, although they do it in different ways that are
quite labor intensive.
"I really want to stress the distinction between being empirical and
being anecdotal". Here, Daniel pointed at the widely cited
statistics on FOSS license usage provided by Black Duck Software. One
of the questions that Daniel asks himself, is, can he replicate those
results? In short, he cannot. From the published information, he can't
determine the methodology or tools that were used. It isn't possible to
determine the accuracy of the tools used to determine the licenses. Nor is
it possible to determine the names of the licenses that Black Duck used to
develop its data reports. (For example, one can look at the Black Duck
license list and wonder whether "GPL 2.0" means GPLv2-only, GPLv2 or later,
or possibly both?)
Daniel then turned to consider the challenges that are faced when
trying to take a census of FOSS licenses. When one looks at the question
of what licenses are most used, one of the first questions to answer is:
what is "the universe of licenses" that are considered to be FOSS? For
example, the Open Source Initiative has approved a list of around 70
licenses. But, there are very many more licenses in the world that could be
broadly categorized as free, including rather obscure and little-used
licenses such as a Beerware
license. "I don't think anyone knows what the entire universe is."
Thus, one must begin by defining the subset of licenses that one considers to be
FOSS.
Following on from those questions is the question of what is "an
individual" for the purpose of a census? Should different versions of the
same project be counted individually? (What if the license changes between
versions?) Are forks individual? What about "like" forks on GitHub? Do
embedded copies of source code count as individuals? Here, Daniel was
referring to the common phenomenon of copying source code files in order to
simplify dependency management. And then, is an individual defined at the
file level or at the package level? It's very important from a
methodological point of view that we are are told what is being
counted, not just the numbers, Daniel said.
Having considered the preceding questions, one then has to choose a
corpus on which to perform the census. Any corpus will necessarily be
biased, Daniel said, because the fact that the corpus was gathered for some
purpose implies some trade-offs.
Two corpuses that Daniel likes are the Red Hat
and Debian distributions. One reason that he likes these distributions is
that they provide a clearly defined data set. "I can say, I went to Debian
5.0, and I determined this fact." Another positive aspect of these corpuses
is that they are proxies for "successful" projects. The fact that a project
is in one of those distributions indicates that people find it useful. That
contrasts with a random project on some hosting facility that may have
no users at all.
While presence in a Linux distribution can be taken as a reasonable
proxy of a successful project, a repository such as Maven Central is, by contrast, a "big
dumpster" of Java code, but "it's Java code that is actually being used by
someone". On the other hand, Daniel called SourceForge the "cemetery for
open source". In his observation, there is a thin layer of life on
SourceForge, but nobody cares about most of the code there.
Then there are domain-specific repositories such as CPAN, the Perl code
archive. There is clearly an active community behind such repositories,
but, for the purpose of taking a FOSS license census, one must realize that
the contents of such repositories often have a strong bias in favor of
particular licenses.
Having chosen a corpus, the question is then how to count the licenses
in the corpus. Daniel considered the example of the Linux kernel, which has
thousands of files. Those files are variously licensed GPLv2, GPLv2+,
LGPLv2, LGPLv2.1, BSD 3 clause, BSD 2 clause, MIT/X11, and more. But the
kernel as a whole is licensed GPLv2-only. Should one count the licenses on
each file individually, or just the individual license of the project as a
whole, Daniel asked. A related question comes up when one looks at the
source code of the FreeBSD kernel. There, one finds that the license of
some source files is GPLv2. By default, those files are not compiled to
produce the kernel binary (if they were, the resulting kernel binary would
need to be licensed GPL). So, do binaries play a role in a license census,
Daniel asked.
When they started their work on studying FOSS licenses, Daniel and his
colleagues used FOSSology, but they
found that it was much too slow for studying massive amounts
of source code. So they wrote their own license-identification
tool, Ninka. "It's not
user-friendly, but some people use it."
Daniel and his colleagues learned a lot writing Ninka. They found it was
not trivial to identify licenses. The first step is to find the license
statement, which may or may not be in the source file header. Then, it is
necessary to separate comments from any actual license statement. Then, one
has to identify the license; Ninka uses a sentence-based matching algorithm
for that task.
Daniel then talked about some results that he and his colleagues have
obtained using Ninka, although he emphasized repeatedly that his numbers
are very preliminary. In any case, one of the most interesting points that
the results illustrate is the difficulty of getting accurate license
numbers.
One set of census results was obtained by scanning the source code of
Debian 6.0. The scan covered source code files for just four of the more
popular programming languages that Daniel found particularly interesting:
C, Java, Perl, and Python.
In one of the scans, Ninka counted the number of source files per
license. Unsurprisingly, GPLv2+ was the most common license. But what was
noteworthy, he said, is that somewhat more than 25% of the source code
files have no license, although there might be a license file in the same
directory that allows one to infer what the license is.
In addition, Ninka said "Unknown" for just over 15% of the files. This
is because Ninka has been consciously designed to have a strong bias against
mis-identifying licenses. If it has any doubt about the license, Ninka will
return "Unknown" rather than trying to make a guess; the 15% number is an
indication of just how hard it can be to identify the licenses in a file.
Ninka does still occasionally make mistakes. The most common reason is that
a source file has multiple licenses and Ninka does not identify them all;
Daniel has seen a case where one source code file had 30 licenses.
The other set of results that Daniel presented for Debian 6.0 measured
packages per license. In this case, if at least one of the source files in
a package uses a license, then that use is counted as an individual for the
census. Again, the GPLv2+ is the most common of the identified licenses,
but comparing this result against the "source files per license" measure showed
some interesting differences. Whereas the Eclipse Public License version 1
(EPLv1) easily reached the list of top twenty most popular source file
licenses, it did not appear in the top twenty packages licenses. The reason is
that there are some packages—for example, Eclipse itself—that
consist of thousands of files that use the EPLv1 license. However, the number of
packages that make any use of the EPLv1 as a license is relatively
small. Again, this illustrated Daniel's point about methodology when it
comes to measuring FOSS license usage: what is being measured?
Daniel then looked at a few other factors that illustrated how a FOSS
license census can be biased. In one case, he looked at the changes in
license usage in Debian between version 5.0 and 6.0. Some licenses showed
increased usage that could be reasonably explained. The GPLv3 was one such
license: as a new, well-publicized license, the reasons for its usage are
easily understood. On the other hand, the EPLv1 license also showed
significant growth. But, Daniel explained, that was at least in part
because, for legal reasons, Java code that uses that license was for a long
time under-represented in Debian.
Another cause of license bias became evident when Daniel turned to
look at per-file license usage broken down across three languages: Java,
Perl, and Python. Notably, around 50% of Perl and Python source files had
no license; for Java, that number was around 12%. "Java programmers seem to
be more proactive about specifying licenses." Different programming
language communities also show biases towards particular
licenses: for example, the EPLv1 and Apache v2 licenses are much more
commonly used with Java than with the Python or Perl; unsurprisingly the
"Same as Perl" license is used only with Perl.
In summary, Daniel said: "every time you see a census of licenses, take it
with a grain of salt, and ask how it is done". Any license census will be
biased, according to the languages, communities, and products that it
targets. Identifying licenses is hard, and tools will make mistakes,
he said. Even a tool such as Ninka that tries to very carefully identify
licenses cannot do that job for 15% of source files. For a census, 15% is a
huge amount of missing data, he said.
License proliferation: a naive quantitative analysis
Walter van Holst is a legal consultant at the Dutch IT consulting
company mitopics. His talk presented what he describes as "an
extremely naive quantitative analysis" of license proliferation.
The background to Walter's work is that in 2009 his company sponsored a
Master's thesis on license proliferation that produced some contradictory
results. The presumption going into the research was that license
proliferation was a problem. But some field interviews conducted during the
research found that the people in free software communities didn't seem to
consider license proliferation to be much of a problem. Four years later,
it seemed to Walter that it was time for a quantitative follow-up to the
earlier research, with the goal of investigating the topic of license
proliferation further.
In trying to do a historical analysis of license proliferation, one
problem that Walter encountered is that there were few open repositories
that could be used to obtain historical license data. Thus, trying to use
one of the now popular FOSS project-hosting facilities would not allow
historical analysis. Therefore, Walter instead chose to use data from a
software index, namely Freecode (formerly Freshmeat, before an October 2011
name change). Freecode provides project licensing information that is
available for download from FLOSSmole,
which acts a repository for dumps of metadata from other repositories.
FLOSSmole commenced adding Freecode data in 2005, but Walter noted that the
data from before 2009 was of very low quality. On the other hand, the
data from 2009 onward seemed to be of high enough quality to be
useful for some analysis.
How does one measure license proliferation? One could, Walter said,
consider the distribution of license choices across projects, as Daniel
German has done. Such an analysis may provide a sign of whether license
proliferation is a problem or not, he said.
Another way of defining license proliferation is as a compatibility
problem, Walter said. In other words, if there is proliferation
of incompatible licenses, then projects can't combine code that
technically could be combined. Such incompatibility is, in some sense, a loss
in the value of that FOSS code. This raises a related question, Walter said: "is
one-way license compatibility enough?" (For example, there is one-way
compatibility between the BSD and GPL licenses, in the direction of the
GPL: code under the two licenses can be combined, but the resulting work
must be licensed under the GPL.) For his study, Walter presumed that one-way
compatibility is sufficient for two projects to be considered compatible.
Going further, how can one assign a measure to compatibility, Walter asked.
This is, ultimately, an economic question, he said. "But, I'm still not very
good at economics", he quipped. So, he instead chose an "extremely naive" measure of
compatibility, based on the following assumptions:
-
Treat all open source projects in the analysis as nodes in a network.
-
Consider all possible links between pairs of nodes (i.e., combinations
of pairs of projects)
in the network.
-
Treat each possible combination as equally valuable.
This is, of course, a rather crude approach that treats combinations between
say the GNU C library (glibc) and some obscure project with few users as
being equal in importance to (say) the combination of glibc and gcc. This
approach also completely ignores language incompatibilities, which is
questionable, since it seems
unlikely that one would want to combine Lisp and Java code, for example.
Given a network of N nodes, the potential "value" of the network
is the maximum number of possible combinations of two nodes. The
number of those combinations is N*(N-1)/2. From a
license-compatibility perspective, that potential value would be fully
realized if each node was license-compatible with every other node. So, for
example, Walter's 2009 data set consisted of 38,674 projects, and,
following the aforementioned formula, the total
possible interconnections would be approximately 747.9 million.
Walter's measure of license incompatibility in a network is then based on asking
two questions:
-
For each license in the network, how many combinations of two nodes in
the network can produce a derived work under that license? For example,
how many pairs of projects under GPL-compatible licenses can be
combined in the network?
-
Considering the license that produces the largest number of possible
connections for a derived work, how does the number of connections for
that license measure up against the total number of possible
combinations?
Perhaps unsurprisingly, the license that allows the largest number of
derived work combinations is "any version of the GPL". By that measure, 38,171
projects in the data set were compatible, yielding 728.5 million
interconnections.
Walter noted that the absolute numbers don't matter in and of
themselves. What does matter is the (proportional) difference between the
size of the "best" compatible network and the theoretically largest
network. For 2009, that loss is the difference between the two
numbers given above, which is 19.3 million.
Compared to the total potential connections, that loss is not high
(expressed as a proportion, it is 2.5%). Or to put things another way, Walter
said, these figures suggest that in 2009, license proliferation appears not
to have been too much of a problem.
Walter showed corresponding numbers for subsequent years, which are
tabulated below. (The percentage values in the "Value loss" column are your
editor's addition, to try and make it easier for the reader to get a feel
for the "loss" value.)
| Year |
Potential value (millions) |
Value loss (millions) |
GPL market share |
| 2009 |
747.8 |
19.3 (2.5%) |
72% |
| 2010 |
534.6 |
30.8 (5.7%) |
63% |
| 2011 |
565.9 |
56.4 (9.9%) |
61% |
| 2012 |
599.6 |
79.8 (13.3%) |
59% |
| 2013 |
621.6 |
60.3 (9.7%) |
58% |
The final column in the table shows the number of projects licensed
under "any version of the GPL". In addition, Walter presented pie charts
that showed the proportion of projects under various common
licenses. Notable in those data sets was that, whereas in 2009 the
proportion of projects licensed GPLv2-only and GPLv3 was respectively 3%
and 2%, by 2013, those numbers had risen to 7% and 5%.
Looking at the data in the table, Walter noted that the "loss" value
rises from 2010 onward, suggesting that incompatibility resulting from
license proliferation is increasing.
Walter then drew some conclusions that he stressed should be treated
very cautiously. In 2009, license proliferation appears not to have been
much of a problem. But looking at the following years, he suggested
that the increased "loss" value might be due to the rise in the number of
projects licensed GPLv2-only or GPLv3-only. In other words, incompatibility
rose because of a licensing "rift" in the GPL community. The "loss" value
decreased in 2013, which he suggested may be due to an increase in the
number of projects that have moved to Apache License version 2 (which has
better license compatibility with the the GPL family of licenses).
Concluding remarks
In questions at the end of the session, Daniel and Walter both readily
acknowledged the limitations of their methodologies. For example, various
people raised the point that the Freecode license information used by
Walter tends to be out of date and inaccurate. In particular, the data does
not seem to be too precise on which version of the GPL a project is
licensed under; the license for many projects is just defined as "GPL"
which provided Walter's "any version of the GPL" license measure
above. Walter agreed that his source data is dirty, but pointed out that
the real question is how to get better data.
As Walter also acknowledged, his measure of license incompatibility is
"naive". However, his goal was not to present highly accurate
numbers. Instead, he wants to get some clues about possible trends and
suggest some ideas for future study. It is easy to see other ways in which
his results might be improved. Comparing his presentation with Daniel's,
one can immediately come up with ideas that could lead to improvements. For
example, approaches that consider compatibility at the file level or bring
programming languages into the equation might produce some interesting
results.
Inasmuch as one can find faults in the methodologies used by Daniel and
Walter, that is only possible because, unlike the widely cited Black
Duck license census, they have actually published their methodologies. In
revealing their methodologies and the challenges they faced, they have
shown that any FOSS licensing survey that doesn't publish its methodology
should be treated with considerable suspicion. Clearly, there is room for
further interesting research in the areas of FOSS license usage, license
proliferation, and license incompatibility.
(
Log in to post comments)