Some mozilla related projects are complaining about being called shell script based projects because of a single file (configure).
Configure (typically generated by configure.in) is interesting in that it's likely to get thousands of small changes. As such total commit count is not going to solve the problem of "total lines of code misdetermines the language"
For reference:
http://bonsai.mozilla.org/cvslog.cgi?file=mozilla/configure
1.1916 cltbld 2007-10-13 14:21 Automated update from host egg.build.mozilla.org
Configure of course tends to be massively long.
Unfortunately, a java project which commits its javadoc (or a perl project that commits its html generated pod) will be dwarfed in the total number of files (documentation > sources).
I think the trick is to try to generate an algorithm that determines "copying". If configure seems to be a derivative copy of configure.in (good luck), or foopy.html seems to be a derivative copy of foopy.pl / foopy.java, then it should be discounted.
I'm not sure how well that'll work, a project with heavy manual documentation will probably still be penalized, but overall it might enable you to count by numbers of files.
Another thing you could do is drop languages used in only one file.
I was wondering if this would penalize the mozilla despot project (which I believe is not yet imported into ohloh, I might do that just to see what happens)
http://mxr.mozilla.org/webtools/source/despot/
From memory, despot basically has one file (despot.cgi) and a help file (help.html), but it does actually have a couple of other .pl files, so today it'd probably be counted as perl. When enough of despot is converted to use .templ's, despot would probably be called html.
And I don't know that I'd mind.
Probably the simplest solution is to list the major language, and the second language and a note about which calculation would make the second language the major language.
It of course only works for 2, and would fail for perl+html+js, but it should probably help for configure :).