Posted about 17 hours ago
Here at Team Rubber, we pride ourselves on working in a fairly agile manner. For the viral ad network in particular, this means that the most important thing at the end of an iteration is to ship working features, rather than to wait until everything
... [More]
is perfect before shipping.
As a side effect of this, optimisation is generally left until it becomes a noticeable issue. Obviously we worry about the complexity of our algorithms, and not designing ourselves into a corner – but I’m generally happy to take a constant speed reduction in exchange for faster development.
The code I’ve been working on this iteration was a different story, however, and I thought this story might be useful to people in the same situation.
My first implementation took about six CPU hours (in userspace alone!) to process just 1.5 Gb of data. Sure it scales sub-linearly, but I don’t want to have to bring loads of extra hardware on-line just to support this program.
The first thing to look at was the profiling information. pstats showed me that we were spending over 35 minutes looking up entries in my custom cache class – which is shared between different applications. This has a knock-on effects on the rest of the system as our cached items have to expire within a set time – this 35 minute delay means at least 10,000 extra (expensive) cache misses during this run. Each cache miss takes an average of 0.03 CPU seconds, so that’s an extra five minutes on top
Writing a C extension module to behave in the same way as my python class proved relatively painless – and reduced memory overhead at the same time (a single empty dict takes up about 140 bytes of space). As I was working in C, I could easily optimise the structure for looking up the most common items, and tune the memory management for my use case.
Since I know this code isn’t going anywhere near windows, I could also inline the gettimeofday() calls into my code – which reduces a few python function calls.
I tried two different methods – a binary tree (which let me drop entire sections of the cache at once), and a prioritised array (i.e. the most requested are highest up). Here are my results (tested over a mixture of lookups and assignment – a simulation of my real data):
Time for 9,000,000 entries:
python
took 44.56 seconds
C (tree)
took 31.20 seconds
C (array)
took 15.74 seconds
That’s about 15 minutes of CPU time saved already on my test data
Next on my list of possible improvements was this little method:
def batchRow(self, rowdata):
self._batchlist.append(rowdata)
self._batchsize += 1
if self._batchsize > self._batchmax:
self.sendBatch()
It turns out that this single method was taking over 26 minutes of CPU time.
Why does this take so much time? Well I tend to write analysis code in a map and reduce fashion so it can be easily distributed (note: I’m not using a framework like Hadoop for this – I mean that in the functional way). This method is where the mappers add their results to a batch list ready for a reduce step, and it gets called many times for each entry being processed.
I tried re-writing this as such:
def batchRow(self, rowdata):
self._batchlist.append(rowdata)
i = self._batchsize + 1
self._batchsize = i
if i > self._batchmax:
self.sendBatch()
, which did give a noticable speed improvement (due to one less property lookup), but not significant enough.
Clearly the easiest things to optimise were the property access, and the integer comparison – so I turned to Cython to (painlessly) re-write this method on a mixin class as such:
cdef class _MixinClass:
cdef long _batchmax
cdef long _batchsize
cdef object _batchlist
def batchRow(object self, object rowdata):
self._batchlist.append(rowdata)
self._batchsize += 1
if self._batchsize > self._batchmax:
self.sendBatch()
How this works:
The “cdef” statements serve two purposes – they provide the (C) type of the variables, and they move the class properties into a C struct.
For example, this generates C that looks something like this:
struct _MixinClass {
PyObject_HEAD;
long _batchmax;
long _batchsize;
PyObject *_batchlist;
}
, which means that property access becomes very cheap. The integer comparison part above becomes something far more like:
self->_batchsize += 1;
if( self->_batchsize > self._batchmax ){
// Do method call
}
Results (1,000,000 runs):
python:
Took 0.995 seconds
_MixinClass
Took 0.187 seconds
Subclass of _MixinClass
Took 0.223 seconds
- that’s a five times increase, which I’m more than happy with [Less]
Posted about 21 hours ago
It took a while before we restarted our podcast “Data Without Borders” but we are back on track with the third episode after the break where we talk about what happened at the Internet Identity Workshop (IIW) which happened last week.
We
... [More]
also got two new people on board: Eve Maler from PayPal who is also the chair of the Kantara Inititative’s User Managed Access Work Group and Elias Bizannes, like myself board member of the DataPortability Project.
Of course we also still have Trent Adams from the Internet Society (the organization behind the IETF) and another member of the DataPortability Project board: Steve Greenberg.
And last but not least we are now an official podcast of the DataPortability Project (whatever that actually means, in terms of content probably nothing and it should be noted that the opinions expressed in thep podcast are our personal opinions and not some official opinion of the DataPortability Project unless stated otherwise).
The Internet Identity Workshop
I heard a lot of very good things about the Internet Identity Workshop which happenes twice a year in Mountain View, CA. Unfortunately I never made it there though. But at least I know people who are going there and this time all of my podcast co-hosts have been there and so we talk about some topics which have been discussed there, like the future of OpenID, Webfinger, LRDD, XRD and lot of other upcoming and established standards.
But give it a listen yourself:
(Download MP3, 45 min, 46 MB)
For the shownotes check the blog entry.
You can also subscribe to us on iTunes.
We are also streaming the recording of new episodes live on fridays, 1800 UTC. More information will be posted to the Data Without Borders Twitter account. [Less]
Posted 1 day ago by nor...@blogger.com (pyDanny)
When I'm evaluating a package to use in my work or play I tend to look at five things. I think many of my on-line colleagues look at a similar list. If its missing too many of these things then odds are I'll go somewhere else for my needs or roll my
... [More]
own.
Documentation
Did the author bother with a README file? How about some sphinx documentation? How complete it it? Does it get me started and give a few basic examples?
I'm okay with typos and mistakes. These happen. But I want to see
Licensing
Everyone has their own idea of what they like for licenses. I like the MIT/BSD thing. I can understand the attraction to LGPL and GPL although they aren't for me. What I can't stand and won't use are monstrosities like GPL/Commercial used by such libraries as ExtJs.
Want to make money off your software? Easy... let anyone use it and charge for support. Worked damn well for communities and companies like Python, Django, Plone, various Linux distributions (Redhat anyone?), etc...
Eggification
Is your software constructed so that it can be installed via easy_install or pip? And yes, this is a bit of mild embarrassment for me, so I'm happy enough to eggify other people's work.
Tests
Do you have tests? Even a nearly empty tests file or folder? How about a test application? If you have no tests then your package is suspect. How do I know it will work independently of your personal computer?
Code Quality
Does the code smell bad? Can it be easily extended? If its innovative but the code needs work is it on a DVCS so more people can easily contribute?
22 more posts to go! [Less]
Posted 1 day ago
We were getting along so well...
Posted 1 day ago
Plone was named best non-PHP-based open-source CMS by Packt Publishing for a second year in a row, winning a $2000 award for the Plone Foundation.
Posted 1 day ago by nor...@blogger.com (Schlepp)
And the winner is... Plone!
For the second year running Plone makes it to the winner's circle and walks away with $2000. Second place went to dotCMS and third place to mojoPortal.
I still think the category should be Best Python
... [More]
Open Source CMS. Best Other OS CMS would then include PHP and dotNET frameworks. ;-)
Remember that this is much of a black box award, with community voting contributing to the nomination process and then final votes factoring in with Packt's judging panel in some mysterious way. None-the-less, I don't recommend that the Plone Foundation turn down the prize money.
Also remember that Packt is in the business of selling books and their judging panel would be foolish if they didn't take books sales into account. Right now their are four main Plone titles from Packt and a fifth expected any day now.
By comparison (hey, I've got to have some metrics in this post) Drupal and Joomla have 16 titles each, while WordPress only has 8.
Meanwhile Plonistas, kick back and have a well deserved cup o' joe (for those of us in the Western Hemisphere). The rest of you in Europe and Africa can go straight for other celebratory beverages. For Australia, New Zealand and the Far East, get out of bed and set off some fireworks! [Less]
Posted 1 day ago
Introduction
Earlier this year I was at PyCon in the US. I had an interesting
experience there: people were talking about the problem of packaging
and distributing Python libraries. People had the impression that this
was an urgent
... [More]
problem that hadn't been solved yet. I detected a vibe
asking for the Python core developers to please come and solve our
packaging problems for us.
I felt like I had stepped into a parallel universe. I've been using
powerful tools to assemble applications from Python packages
automatically for years now. Last summer at EuroPython, when this
discussion came up again, I maintained that packaging and distributing
Python libraries is a solved problem. I put the point strongly, to
make people think. I fully agree that the current solutions are
imperfect and that they can be improved in many ways. But I also
maintain that the current solutions are indeed solutions.
There is now a lot of packaging infrastructure in the Python
community, a lot of technology, and a lot of experience. I think that
for a lot of Python developers the historical background behind all
this is missing. I will try to provide one here. It's important to
realize that progress has been made, step by step, for more than a
decade now, and we have a fine infrastructure today.
I've named some important contributors to the Python packaging story,
but undoubtedly I've also did not mention a lot of other important
names. My apologies in advance to those I missed.
The dawn of Python packaging
The Python world has been talking about solutions for packaging and
distributing Python libraries for a very long time. I remember when I
was new in the Python world about a decade ago in the late 90s, it was
considered important and urgent that the Python community implement
something like Perl's CPAN. I'm sure too that this debate had started
long before I started paying attention.
I've never used CPAN, but over the years I've seen it held up by many
as something that seriously contributes to the power of the Perl
language. With CPAN, I understand, you can search and browse Perl
packages and you can install them from the net.
So, lots of people were talking about a Python equivalent to CPAN with
some urgency. At the same time, the Python world didn't seem to move
very quickly on this front...
Distutils
The Distutils SIG (special interest group) was started in late 1998.
Greg Ward in the context of this discussion group started to create
Distutils about this time. Distutils allows you to structure your
Python project so that it has a setup.py. Through this
setup.py you can issue a variety of commands, such as creating a
tarball out of your project, or installing your project. Distutils
importantly also has infrastructure to help compiling C extensions for
your Python package. Distutils was added to the Python standard
library in Python 1.6, released in 2000.
Metadata
We now had a way to distribute and install Python packages, if we did
the distribution ourselves. We didn't have a centralized index (or
catalog) of packages yet, however. To work on this, the Catalog SIG
was started in the year 2000.
The first step was to standardize the metadata that could be cataloged
by any index of Python packages. Andrew Kuchling drove the effort on
this, culminating in PEP 241 in 2001, later updated by PEP 314:
Distutils was modified so it could work with this standardized
metadata.
PyPI
In late 2002, Richard Jones started work on the Python Package Index,
PyPI. PyPI is also known as the Cheeseshop, a name I prefer but
apparently has been deprecated. The first work on an implementation
started, and PEP 301 that describes PyPI was also created
then. Distutils was extended so the metadata and packages themselves
could be uploaded to this package index. By 2003, the Python package
index was up and running.
The Python world now had a way to upload packages and metadata to a
central index. If we then manually downloaded a package we could
install it using setup.py thanks to Distutils.
Setuptools
Phillip Eby started work on Setuptools in 2004. Setuptools is a whole
range of extensions to Distutils such as from a binary installation
format (eggs), an automatic package installation tool, and the
definition and declaration of scripts for installation. Work continued
throughout 2005 and 2006, and feature after feature was added to
support a whole range of advanced usage scenarios.
By 2005, you could install packages automatically into your Python
interpreter using easy_install. Dependencies would be
automatically pulled in. If packages contained C code it would pull
in the binary egg, or if not available, it would compile one
automatically.
The sheer amount of features that Setuptools brings to the table must
be stressed: namespace packages, optional dependencies, automatic
manifest building by inspecting version control systems, web scraping
to find packages in unusual places, recognition of complex version
numbering schemes, and so on, and so on. Some of these features
perhaps seem esoteric to many, but complex projects use many of them.
The problems of shared packages
The problem remained that all these packages were installed into your
Python interpreter. This is icky. People's site-packages
directories became a mess of packages. You also need root access to
easy_install a package into your system Python. Sharing all packages
in a direcory in general, even locally, is not always a good idea: one
version of a library needed by one application might break another
one.
Solutions for this emerged in 2006.
Virtualenv
Ian Bicking drove one line of solutions: virtual-python, which evolved
into workingenv, which evolved into virtualenv in 2007. The concept
behind this approach is to allow the developer to create as many fully
working Python environments as they like from a central system
installation of Python. When the developer activates the virtualenv,
easy_install will install all packages into its the virtualenv's
site-packages. This allows you to create a virtualenv per project
and thus isolate each project from each other.
Buildout
In 2006 as well, Jim Fulton created Buildout, building on Setuptools
and easy_install. Buildout can create an isolated project environment
like virtualenv does, but is more ambitious: the goal is to create a
system for repeatable installations of potentially very complex
projects. Instead of writing an INSTALL.txt that tells others who
to install the prerequites for a package (Python or not), with
Buildout these prerequisites can be installed automatically.
The brilliance of Buildout is that it is easily extensible with new
installation recipes. These recipes themselves are also installed
automatically from PyPI. This has spawned a whole ecosystem of
Buildout recipes that can do a whole range of things, from generating
documentation to installing MySQL.
Since Buildout came out of the Zope world, Buildout for a long time
was seen as something only Zope developers would use, but the
technology is not Zope-specific at all, and more and more developers
are picking up on it.
In 2008, Ian Bicking created an alternative for easy_install called
pip, also building on Setuptools. Less ambitious than buildout, it
aimed to fix some of the shortcomings of easy_install. I haven't used
it myself yet, so I will leave it to others to go into details.
Setuptools and the standard library
The many improvements that Setuptools brought to the Python packaging
story hadn't made it into the Python Standard Library, where Distutils
was stagnating. Attempts had been made to bring Setuptools into the
standard library at some point during its development, but for one
reason or another these efforts had foundered.
Setuptools probably got where it is so quickly because it worked
around often very slow process of adopting something into the standard
library, but that approach also helped confuse the situation for
Python developers.
Last year Tarek Ziade started looking into the topic of bringing
improvements into Distutils. There was a discussion just before PyCon
2009 about this topic between various Python developers as well, which
probably explains why the topic was in the air. I understood that some
decisions were made:
let the people with extensive packaging experience (such as Tarek)
drive this process.
free the metadata from Distutils and Setuptools so that other
packaging tools can make use of it more easily.
Distribute
By 2008, Setuptools had become a vital part of the Python development
infrastructure. Unfortunately the Setuptools development process has
some flaws. It is very centered around Phillip Eby. While he had been
extremely active before, by that time he was spending a lot less
energy on it. Because of the importance of the technology to the wider
community, various developers had started contributing improvements
and fixes, but these were piling up.
This year, after some period of trying to open up the Setuptools project itself, some of these developers led by Tarek Ziade decided to fork Setuptools. The fork is named Distribute. The aim is to develop the technology with a larger community of developers. One of the first big improvements of the Distribute project is Python 3 support.
Quite understandably this fork led to some friction between Tarek,
Phillip and others. I trust that this friction will resolve itself and
that the developers involved will continue to work with each other, as
all have something valuable contribute.
Operating system packaging
One point that always comes up in discussions about Python packaging
tools is operating system packaging. In particular Linux distributions
have developed extremely powerful ways to distribute and install
complex libraries and application, manage versions and dependencies
and so on.
Naturally when the topic of Python packaging comes up, people think
about operating system packaging solutions like this. Let me start off
that I fully agree that Python packaging solutions can learn a lot
from operating system packaging solutions.
Why don't we just use a solution like that directly, though? Why is a
Python specific packaging solution necessary at all?
There are a number of answers to this. One is that operating packaging
solutions aren't universal: if we decided to use Debian's system, what
would we do on Windows?
The most important answer however is that there are two related but
also very different use cases for packaging:
system administration: deploying and administrating existing software.
development: combining software to develop new software.
The Python packaging systems described above primarily try to solve
the development use case: I'm a Python developer, and I'm developing
multiple projects at the same time, perhaps in multiple versions, that
have different dependencies. I need to reuse packages created by other
developers, so I need an easy way to depend on such packages. These
packages are sometimes in a rather early state of development, or
perhaps I'm even creating a new one. If I want to improve such a
package I depend on, I need an easy way to start hacking on it.
Operating system packaging solutions as I've seen them used are ill
suited for the development use case. They are aimed at creating a
single consistent installation that is easy to upgrade with an eye on
security. Backwards compatibility is important. Packages tend to be
relatively mature.
For all I know it might indeed be possible to use an operating system
packaging tool as a good development packaging tool. But I've heard
very little about such practices. Please enlighten me if you
have.
It's also important to note that the Python world isn't as good as it
should be at supporting operating system packaging solutions. The
freeing up of package metadata from the confines of the setup.py
file into a more independently reusable format as was decided at PyCon
should help here.
Conclusions
We are now in a time of consolidation and opening up. Many of the
solutions pioneered by Setuptools are going to be polished to go into
the Python Standard Library. At the same time, the community
surrounding these technologies is opening up. By making metadata used
by Distutils and Setuptools more easily available to other systems,
new tools can also more easily be created.
The Python packaging story had many contributors over the years. We
now have a powerful infrastructure. Do we have an equivalent to CPAN?
I don't know enough about CPAN to be sure. But what we have is
certainly useful and valuable. In my parallel universe, I use advanced
Python packaging tools every day, and I recommend all Python
programmers to look into this technology if they haven't already. Join
me in my parallel universe! [Less]
Posted 1 day ago
To have some local control over pypi's availability, we're using an internal
proxy in front of pypi using collective.eggproxy. (I've written about that
before in collective.eggproxy improvements).
Our (= The Health Agency) internal
... [More]
packages
that aren't on pypi: how to release those? The quickest way was to use an
html page with svn url links, one per release. You can put the url to that
page in your buildout's find-links setting. Originally we maintained that
list by hand. I quickly wrote a script to handle it once the amount of
packages got out of hand: tha.taglist.
We hit some problems with that approach due to subversion 1.6. Not everywhere
was there an updated setuptools/distribute that could handle those newer
subversion checkouts. A real sdist tarball would be way better. The
solution:
I already had a script that finds all the tags in an svn structure
(including filtering, blacklists, etc.). The actual finding is extracted in
a small library: tha.tagfinder. I used the same tag to
create the html page with svn urls.
I added a script that goes through all those svn tag urls and that makes a
checkout + sdist tarball out of them. And copies it to a directory
structure, just like the simple pypi index:
libraryname/library-versionnumber.tar.gz.
I changed the eggproxy's apache config to only pass certain requests through
to the proxy, using the information on
http://wiki.python.org/moin/PyPiImplementations .
The relevant portions from the apache config:
DocumentRoot /server/dir/var/private
# Allow indexing, but make it fancy and sort
# on version numbers
Options +Indexes
IndexOptions FancyIndexing VersionSort
RewriteEngine On
RewriteRule ^/icons/.* - [L]
# Catch all index.html, index.cgi: we handle it
RewriteRule ^/index\..* - [L]
# Use our var/private/project if available, proxy to
# the eggproxy at port 8080 otherwise.
# "!-f" means "if there's no file, "!-d" is about
# directories.
RewriteCond /server/dir/var/private/$1 !-f
RewriteCond /server/dir/var/private/$1 !-d
RewriteRule ^/([^/]+)/?$ http://localhost:8080/$1/ [P,L]
# Use our var/private/project/project-0.1.tar.gz if available,
# otherwise proxy to the eggproxy at 8080
RewriteCond /server/dir/var/private/$1 !-d
RewriteRule ^/([^/]+)/([^/]+)$ http://localhost:8080/$1/$2 [P,L]
Pretty luxurious. Just mention this one proxy (index = my.proxy.org) in
buildout and you're set for both your own packages and for pypi packages.
Just make sure the package names don't overlap :-) [Less]
Posted 1 day ago
After the great Plone Conference in Budapest and its sprint we did a huge step forward to a working AGX transformation chain. Last friday november the 6th we sprinted internally on the AGX engine and now have a working XMI to UML transformation. Some
... [More]
tiny bits are missing, but the overall transformation works. The UML to Python transformation is about 40% finished either.
We invite everybody interested in writing handlers for code generation to a three day development-sprint in Innsbruck.
Goals are:
Getting more people into AGX and showing them the easiness of writing AGX transforms.Finish the Python-Code-GenerationInvent a domain-specifc UML language (profile) for Dexterity.Generate Dexterity types for Plone 4.
It will be held at the office of Klein & Partner KEG. We start at November 26th at 9:00 am. Our office is available the whole 24h of 3 days until Nov 28th. We help booking accomodiation and recommend Hotel Zillertal (where we probably get 10% reduction - waiting for conformation). Snacks, lunch-buffet and drinks are available. For those still needing x-mas presents I can recommend the famous Christkindl-Market.
Contact: jens@bluedynamics.com
[Less]
Posted 4 days ago
Short summary about RuPy - strongly dynamic conference. The philosophy of RuPy is to put togheter Python & Ruby experts with young programmers and to support a good communication channel for East-West exchange of prospective ideas.
Copyright © 2009 Geeknet, Inc., All Rights Reserved.