Perl revision history 4+ months old

Avatar

pjf

about 1 year ago

G'day awesome ohloh team!

It looks like the enlistments for the perl project aren't in great shape. The last update for most of the enlistments is 4 months ago, and some of them appear to have failed entirely.

I know the ninja ohloh spiders run automatically when they find themselves with free capacity, but I fear that the Perl repository may have become stuck.

A lot has changed in the Perl source in the last four months. I know, because I'm coordinating the release notes for 5.8.9. ;)

Many thanks,

Paul


Avatar

Robin Luckey

about 1 year ago

Hi Paul,

It looks like we never successfully cloned the the perl source after the switch to Git. I've started fresh pulls of the Git repositories, and I'll keep an eye on it today.

Thanks, Robin


Avatar

pjf

about 1 year ago

You're awesome, Robin. Thanks for handling all my enlistment problems for me. ;)


Avatar

Robin Luckey

about 1 year ago

Hi Paul,

The massive Perl recount has finally completed, and unfortunately it looks like something fishy is going on in the line count totals.

For each of these branches, we are consistently finding a negative net lines of "dos batch script". This probably means that we are miscounting all of the other languages as well, but only dos batch script happened to turn out negative.

Typically, when this happens it's because the source control history is inconsistent. There are probably duplicate commits, or commits that delete code that didn't actually get deleted.

If this were a smaller repository, I could drill in and try to help find the problem, but with 30K commits, it's a lot to wade through. I might be able to rig up a query that isolates the guilty commits, but it would take some time.

Based on past experience, I'm guess there was an error during conversion to git (you did convert this from something else, right?). A common error we see is a phantom commit that claims to delete the entire code tree.

You may or may not care enough to get this fixed; I'm sure this repository was no cakewalk to create the first time around. The Ohloh report is probably more correct than not, and the repository is evidently working for your team in their day-to-day work.

Let me know if there's something I can do to help,

Robin


Avatar

pjf

about 1 year ago

Aha! All the dos batch scripts that other people no longer have to write are being included in Perl's metrics. ;)

On a more serious note, Perl is currently using perforce as a source control system, but we're in the process of moving over to git. The git repos that ohloh are indexing are courtesy of Sam V, who has magic that tracks the perforce commits and turns them into git commits.

When the move to git is completed (and I don't have a date for this), the location of the "official" Perl 5 repos may change again, so I'm not going to ask you to go digging through the history to see what's happened.

I will let Sam know of the funny results, in case he feels inclined to pursue things furter.

Many thanks again for your always quick and very helpful responses!

Paul


Avatar

mugwump

about 1 year ago

Yes. Well, the problem here is that this comment simply makes no sense:

the source control history is inconsistent. There are probably duplicate commits, or commits that delete code that didn't actually get deleted.

What does "inconsistent" mean?

There are no duplicate commits; they all have unique identifiers (the commit ID). It is possible that there are multiple commits that introduce an identical change; this is called "cherry picking" and are an inevitable consequence of working with branches.

That being said, there are multiple enlistments which share history because OHLOH currently only allows tracking of one branch in a given repository.

The final part of the comment is a non-sequiter.

I am forced to draw the conclusion that your conceptual model of git is deficient. I suggest reading chapter 7 of the user manual which outlines the object model, and consider where in your code the error lies.


Avatar

Robin Luckey

about 1 year ago

Hi mugwump,

My apologies if I've come across as flip, or seemed to point the finger elsewhere rather than inspect our own code. My comments are deliberately vague because I admit I don't know exactly what has gone wrong. Hopefully this is not from some personal deficiency, but simply an attempt to economize my time across the thousands of repositories we are monitoring.

I'll be the last one to defend our Git importer. I've been struggling with its limitations for a long time. The Ohloh data model was originally designed to work with CVS and Subversion, which have linear commit sequences. Our attempts to graft Git support onto this model have been mixed at best. The need to reinvent our internal model is one of the reasons we have hesitated to add support for other DVCSes.

However, I've crammed a lot of repositories through our meat grinder, and I've become familiar with the flavors of sausage that come out. While it's not unusual for our system to get things wrong (and cherry-picking is indeed one way to get us confused), when reports go wrong in this particular way, the underlying repository almost always turns out to have been created by some kind of conversion tool. And lo, that is again the case.

I'm happy to help get to the bottom of this. It may be that the problem is entirely on our end, and while I will ultimately debug this problem, I cannot fix it immediately. In the meantime, I do suspect that the conversion from Perforce was not flawless, and thought you may be interested.

I'm not sure what you meant about the non-sequiter, but I accidently the whole fleshlight.

Robin


Avatar

mugwump

about 1 year ago

Oh, those dreadful, dreadful conversion tools. How dare they test the edge cases of your code.

Look, at the end of the day, the repository is simply not corrupted in any way at all, and your insistence that this implies some kind of error in the conversion introduced by the conversion tool is categorically wrong. Like I say, this will be obvious to you if you simply read the chapter that describes the object model.

If you can point to the source to the importing tool, perhaps I can have a look.


Avatar

mugwump

11 months ago

Robin, I'm just wondering if the import might be choking on changes happening in parallel branches. If so perhaps passing '--first-parent' to whichever 'git log'-type command you're using to get the list of commits might be useful; it will linearize the history and make sure that you don't get any confusing changes.


Avatar

Robin Luckey

11 months ago

Hi mugwump,

I think you may be exactly right. I've been thinking about this off and on for a while, and I think this might be a repro case:

  1. Add a file on the master branch.

  2. Create branch "B" from the master.

  3. Delete the file from the master.

  4. Delete the file from branch "B" also.

  5. Merge "B" back into the master.

  6. Submit to Ohloh.

Ohloh now sees only one create commit, but two delete commits. This leads to negative line counts.

The --first-parent flag would prevent the double-deletion bug (and other kinds of similar bugs), but it would also cause us to miss a lot of commits. Either way, we have unsatisfactory reports. At least --first-parent would be internally consistent.

I haven't moved on the solution yet (we're trying to add Mercurial and possibly Bazaar first), but it's clear that we have to correct our branching/merging code. I suspect that the same problems are going to affect pretty much all DVCSes.

Our source control adapter code will in fact be published pretty soon (probably sometime this week).

Thanks, Robin


Avatar

Adam Kennedy

11 months ago

Ohloh now sees only one create commit, but two delete commits. This leads to negative line counts.

Oh dear, this sounds completely familiar to me :)

I wrote a tool years ago called cvsmonitor (that FishEye later "cloned" and made much better).

I had a fundamental bad assumption in my math that could be described as the following.

"The line total at a desired point in time is equal to the known line totals at a given point, plus the sum of the deltas between the known time and the desired time"

This works just fine on the trunk, but leaves you in a quandary of two choices.

EITHER

a) Track "line total over time" for only the trunk, ignoring all work occuring on branches.

b) Track "line total over time" including the branches but have incorrect totals.

Most people choose b) because it gives maximum credit for work done, and if it artificially inflates the apparent size of a project, well then most people are happy to quietly turn a blind eye.

It only really gets noticed when the totals go blatantly out of sync, or during branched deletes.

The solution, really, is that you need to track contributions, rates of change and total change DIFFERENTLY to the running line totals, because in a situation with multiple timelines the two unhook from each other.

You keep tracking change over time on branches and crediting them to contributors, but you derive the project line totals over time ONLY from the subset of changesets that occur on the trunk.