Hi Hadrian,
I found actual GPL code in many of these projects.
Felix:
Here we found GPL in the following locations:
/src/compiler/cil/ocamlutil/intmap.ml*
/src/compiler/cil/src/ext/pta/setp.ml*
/src/judy/JudySL/JudySL.c
/src/tre/*.c, *.h (many files)
ActiveMQ:
Many files in this directory:
http://svn.apache.org/repos/asf/activemq/sandbox/amazon/ltmain.sh
Hadoop:
It appears that the Hadoop code contains GPL in ltmain.sh and related files, similar to ActiveMQ. However, I'm having trouble verifying -- it looks like the Hadoop SVN URL has changed recently. Do you know the new URL?
ServiceMIX:
Here we did falsely trigger LGPL. Ironically, our detector triggers because of a comment in the root pom.xml file, which specifically excludes a portion of the xtream lib precisely because it is LGPL. It's conceivable (but a long shot) that this could be fixed by ignoring any comment portions of a pom.xml file that occur within an exclude element. However, I have to imagine this is the only such case on Ohloh.
Camel:
The Camel example is interesting. It is in general difficult to distinguish between comments that invoke a license and comments that merely mention a license. In this way, Camel is like the Ohcount library itself, which tests positive for many licenses simply because it contains code that deals with licenses. I have to admit I don't see any easy fix for the case in Camel.
Frustratingly, the problems with Camel, Ohcount, and John's code are all the same: The Ohcount software can't actually read and understand a license agreement. That's the core of this whole rambling thread.
Our software is not a lawyer. Flawless license detection is not our goal.
What we can do is say, "Hey, there's something in this file that needs looking at," and that turns out to be very helpful. We've turned up hidden GPL in a lot of projects, and false positives are quite rare.
There are always going to be projects with a few stray detections. As time goes on, we find patterns in these, and we get better at ruling them out. However, it's an inefficient use of our time to dwell on individual cases unless they are symptomatic of a lot of similar cases. That's simply not true here.