Ohloh causes DOS to my site

Avatar

Bob Friesenhahn

8 months ago

It seems that ohloh is quite aggressive when accessing the CVS server. This may be ok for high bandwidth sites but my site has a usable 60 kilo bytes outbound transfer bandwidth ("business class" ADSL).

For 24 hours now my site has experienced extremely high latencies and has been essentially unusable due to aggressive access by the ohloh bots as they access the rather large GraphicsMagick CVS repository. At times there has been access from three machines at once. The pounding continues so I can barely access this web site.

Is this normal? Is there a way to tune-down the transfer rates so that ohloh is non-invasive like most 'bots'?

Thanks,

Bob


Avatar

Robin Luckey

8 months ago

Hi Bob,

I'm sorry, but we don't easily have a way to throttle back our servers. On the contrary, we've spent most of our energy trying to figure out how to become faster. We work on the assumption that the vast majority of the code we track is hosted on major forges with large infrastructure.

The one consolation I can offer right now is that the initial downloads of the full history are the hard part, and after this, the incremental updates should represent a much lighter burden on your server.

It looks like our system is almost finished with the downloads, with only a single CVS module still in progress.

If this is a critical issue, let me know and I will kill the download, but be aware that this means that Ohloh won't be able to track this repository.


Avatar

Bob Friesenhahn

8 months ago

A strange thing is that most of the project was absorbed fairly quickly but just one fairly small module seemed to take all day as if there was something special about it. So I de-registered the module and killed off the CVS processes.

I know that I need to buy a better router which can better load-share the outbound traffic.


Avatar

Bob Friesenhahn

8 months ago

I decided to add back the enlistment which was causing the problem (VisualMagick). We shall see if the problem resumes. There is apparently something strange about this module which caused ohloh to churn on it for maybe 10 hours straight. The other modules updated quickly.


Avatar

Bob Friesenhahn

8 months ago

My site is getting pounded again due to this module. I wonder what is so strange about it that it requires so many accesses and bandwidth to evaluate?


Avatar

Bob Friesenhahn

8 months ago

This is absolutely nuts. At the current rate it will take days to update for this tiny bit of the source code repository. I am not sure what it is trying to do but it is now at "Step 1 of 3: Downloading source code history (Running 195/535)" and is keeping the ping times pegged to 16 seconds. It has been like this for 14 hours already, entirely blocking any other access to my site.

I will attempt to shut the access down now.


Avatar

Robin Luckey

8 months ago

Hi Bob,

I spent some time digging into this problem this morning, and I think I was able to get to the bottom of it.

First, the big picture. Our CVS importer is fairly brute force: we request each revision, in chronological order, and we convert the local CVS working tree into a Git repository as we go.

For efficiency reasons, we do not do a CVS checkout of the entire tree with each revision. Instead, we do individual CVS update commands on the individual directories that are affected, according to the log.

Sometimes something goes wrong (usually a network hiccup or other transient effect), and we get an error of some kind during an update. When this happens, we wipe our working directory clean and do full checkout of the entire tree. This should only happen rarely.

However, with the VisualMagick enlistment it seems to be happening with every single update, which is why this download has degenerated to an enormous amount of wasted time and bandwidth.

The issue seems to be that the CVS module name you provided, VisualMagick, is not the actual directory path of this module. Instead, it looks like there is some kind of symlink or redirect which occurs to reach this module name.

In order to do updates of individual directories, our CVS code internally does some string operations on the CVS paths, assuming that the provided module name is the same as the actual path in the CVS repository. But in this case, they aren't, and so we end up requesting some non-existent directories during CVS update. When the update fails, we wipe and checkout the full tree. This happens every time.

Ironically, all of this fancy string operating and update juggling is an attempt to dramatically reduce our burden on CVS servers, but in your case it has backfired.

I did a local test on my development box this morning using the actual CVS path platforms/VisualMagick, and the download worked correctly, with very quick updates of the deltas between revisions.

I imagine that this same problem afflicts modules TclMagick and BCBMagick, since the module names given for these also do not match their actual paths in CVS.

So the remedy is to remove the current enlistments from Ohloh, and replace them with ones in which the module name is the actual CVS path.

Out of curiosity, since I've done very little CVS hosting myself, how are these CVS redirects or symlinks or whatever they are accomplished? After 18,000 CVS modules, these are the first that have stumped us.

Thanks, Robin


Avatar

Bob Friesenhahn

8 months ago

Module entries are defined via the CVSROOT/modules file and what is why they are called 'modules'. Use of physical directories is only a fallback and not actually recommended.


Avatar

James Ross

8 months ago

The modules file can actually define multiple paths for a module, and even path exclusions, IIRC.

However, the modules file isn't versioned, which sigificantly reduces its usefulness for certain tasks/projects, and may explain the sparse use of it encountered by Ohloh.


Avatar

Bob Friesenhahn

8 months ago

CVS itself does not version directories. If most projects do not use a modules file then it is likely because they have not read the documentation or they are lazy.

Regardless, I re-registered the problem module using a physical directory path rather than a module name and it took just a few minutes rather than a day.


Avatar

Thorsten Glaser

8 months ago

You might want to get a mirror outside of your home which you update via, say, rsync. Ohloh can then pull from there. The MirOS Project makes it like that (my home is also on ADSL).


Avatar

Glenn Randers-Pehrson

8 months ago

I noticed that the libpng project on SourceForge got hammered to the tune of 800 pages per hour for about 24 hours, total about 24000 pages, right after I enlisted libpng several days ago, and then for another several hours a day or so later. I don't know if that is due to ohloh or if it was some robot traversing the repository via gitweb (which would result in 1.3GB of downloads instead of the 5MB that it takes to clone the git repository).


Avatar

Bob Friesenhahn

8 months ago

When Ohloh was indexing correctly, it caused no problem for my site. When it went haywire, it maxed it out the transfer rate. Any site which has less outbound bandwith than Ohloh has inbound bandwidth would have been brought to its knees.

It seems wise for Ohloh to estimate the available bandwidth from the site it is sucking from and throttle back if the remote site is slower.