Do What The Fuck You Want To: a brief look at open source licenses.

Open source software - commercial open source software (COSS) in particular - has had a lot of focus in recent years. In from the cold since its hippy, formative years when nobody thought "free" software could build a big company, open source products have shown they can create large enterprises such as MySQL, Red Hat, and GitLab to name a few. This article, however, is not about building the next open source unicorn. Oh no, this is much more interesting.
This is about licenses.
Strap in!

For the uninitiated, an open source license is essentially a file included in the file directory of a source code repository (like this one) for a given project which states some commitments should you wish to make use of the code. These vary, some requiring more commitments and others less.

These licenses are very important. If you are publishing source code under the MIT license, for instance, it will be a different proposition to anyone wanting to use your code than it would be if you published it under a GPL license.
In the world of open source licenses, these two, along with their respective peers, fall under either a permissive family of licenses or a copyleft one. These can be generalised as licenses that let you do a lot with fewer obligations (permissive) or those that require you to be reciprocal with any modifications (copyleft).

This division of licensing can leave you in a conflicted state.
On the one hand, you may like to build products that rely on permissive projects such that you can focus on monetisation without potentially running afoul of licensing restrictions; yet you would like your own product to be protected and suddenly copyleft feels much more comfortable.
And when your software product is reliant on hundreds, if not thousands, of open source dependencies this gets complex and important to understand.
Quickly this can get messy...

A walk down License Lane

I had no thesis upon delving into the upcoming analysis; I was simply curious and took it from there. I have on occasion been asked things like "what's the most common open source license you see used by startups?" - and I actually don't know. I just instinctively say something like "probably the MIT license...".
Not happy with such a mealy-mouthed response I decided to write some code to educate myself. I hope to pass some of this education onto you, the reader!

So what are we dealing with?

  • 3.5k organisations
  • 85k projects
  • 47k projects once dupes (e.g. tensorflow) and other unhelpful ones were removed.
  • 35 different licenses

Now for the first couple of charts:

Projects and their licenses by year.
Here are the project licenses cumulatively.

These are really about contextualising the data. You can see I looked at newer projects (past decade or so) and they're skewed to more recent years. Mostly startup-ish, I think. The year in question is the year the license was added to the repo (okay, not perfect). Most of these projects are found in repos that likely would interest an investor so it's not like my various half-finished shopping list apps in different languages and frameworks or anything. These are commercially interesting repos - at least in theory.

You have likely also noticed colours kind of run out so I'll walk you through some of the interesting stuff as I see it.

MIT is #1

I was correct! Yes, the MIT license is the most common across the projects in question. This is likely a good sign. We are not seeing paranoia undermine the open source philosophy and there hasn't been some mad dash towards highly restrictive (copyleft) licensing which we might have expected with prominent companies like Mongo and Elastic migrating away from their once permissive licenses.

Mongo changed its license in 2018 so you might have expected that to be the year in which a change in distribution across licenses emerged as startups considered this defensive stance from inception. The data doesn't support this, however. There is a legitimate question of what this would look like over a larger set which includes startups across 20 or 30 years, not just 10. What I've done is a few cups of tea worth of work so it's not a proper research project - I urge someone else to leap to this challenge, though!

The top ten

MIT might be #1 but what are the top ten most popular licenses?

  1. mit (54%)
  2. apache-2.0 (27%)
  3. gpl-3.0 (4%)
  4. bsd-3-clause (4%)
  5. agpl-3.0 (2%)
  6. gpl-2.0 (1%)
  7. lgpl-3.0 (1%)
  8. mpl-2.0 (1%)
  9. bsd-2-clause (1%)
  10. isc (1%)

As you can see, it's really a tale of two licenses, MIT and Apache 2.
The General Public License (GPL) creeps in third with a lowly 4% but is still the major copyleft license in the rankings.

Here is a distribution of the licenses across the data set by year (yes, it is still hard to read):

Remember most of the data is in the later years.

Permissive vs. Copyleft

Perhaps unsurprisingly then we see the following division between permissive licenses and copyleft ones across the entire data set of 47k projects:

I didn't categorise "weak copyleft", "strong copyleft", etc, so bear that in mind.

Actually, if anything, you can see that copyleft has dwindled post 2012.

I don't really know why this might be. It could simply be the fewer data points back then in this data set.
There is, of course, a tonne of qualitative information that would be very arduous to analyse across so many projects but it could possibly be that there are lots of repos for things like templates, examples, and the like, which I've noticed are increasingly prevalent and these are often permissively licensed.
I'm sure someone reading this might have a more sophisticated suggestion.
Answers on a postcard, please.

The long tail

I was (but perhaps shouldn't have been) surprised that there's 35 different licenses in the data set. My curiosity piqued, I read through some of the other more obscure licenses to see what was there.

The "cc" licenses

  • There are a number of licenses that start with "cc". These are the Creative Commons licenses.
  • The Creative Commons is an organisation I associate more with photos, and similar stuff, than source code.
  • Taking a quick look it does in fact appear to be more creative works such as project docs that use this license: https://github.com/littledata/segment-docs

The "Unlicense"

  • Not one I'd ever come across before but it was created by Arto Bendiken in 2010 .
  • It seems to be derived from the SQLite project and is aiming to promote the Public Domain approach as distinct from copyleft and even permissive open source licensing.
  • Worth reading: https://ar.to/2010/01/set-your-code-free

Company specific licenses

  • There are postrgresql and ms-pl (Microsoft) licenses in the dataset and unsurprisingly these are found in projects providing extensions to postgres and things of that nature.
  • My educated guess is that this is mostly represented through project inheritance. Code derived from ms-pl projects is therefore carrying that license.

Databases

  • Want to share a database? You can use The Open Database License (ODbL).
  • Though ODbL does clearly define the contents of the database to be subject to other licenses - it is focused on the access to the database.
  • You'd use this in combination with other licensing for the images, or whatever is within the database. Perhaps, one of the CC licenses.

European Union Public Licence

  • The EU wants its own license? * shocked face *

RAIL

  • The Responsible AI License - and the lack of it.
  • I really thought this license would be in the data set and it wasn't.
  • This license can be found in, unsurprisingly, AI-based projects these days. It is used for CompVis's stable diffusion project, for example.
  • Where this license steps in is really around how people use open source AI projects behaviorally. It's a bit like the "don't be evil" of licenses.
  • Worth reading: https://www.licenses.ai/

WTFPL

  • The "Do What The Fuck You Want To Public License."
  • Having only just discovered this license in this analysis, it is now my favourite license and I'll default to using it instead of MIT going forward.
What a masterpiece.

There are many more licenses but these are some that took my interest on reading through the list. I encourage you to go read through some yourself and you'll quickly appreciate that open source licensing rivals the Amazon rainforest in its diversity.

Some final thoughts

I'm not sure if this is especially interesting but I enjoyed building the data set and poking around it. It felt right to "open source" the findings and maybe it'll spark some interest in the topic which is no bad thing.

There are actually so many complexities in open source software licensing that perhaps a whole article is needed to wade through its nuances. The interplay between software components (and the licenses!) can quckly spin a web of obligations that can bring you out in a sweat.

It also does not escape me that there's so much further analysis that could be done on this data. A lot of it is qualitative; you need to look into the projects to see what they do. I have done this a bit but it's a drop in the ocean.
For instance, a question that I'm mulling over is the distribution of invested cash across the organisations in the data set, and if there's any patterns around money raised and the license composition. Is there relatively more cash in the few GPL projects because they are perceived to be "safer" to invest in? Maybe the ones really gunning for commercial success are heavily weighted towards the copyleft licenses?
Another day, maybe.

To my peers in the investment industry, I urge you to put open source licensing on the agenda when you're speaking with founders. It needs to be on the DD checklist. If the company is planning to monetise an open source project then this is even more important. All it takes is one LGPL in there and you need to really think about how the application is constructed to avoid potential obstacles down the road.
It's concerning how over the years this topic is rarely, if ever, mentioned when I hear someone excitedly pursuing an investment into the latest project to get 10k Github stars.

As for the charts/data/advice/ramblings in this article:
Do What The Fuck You Want To (with it).