[97 total ]
How Subtext’s Lucene.net index is structured

Lucene.net tutorial How to get started with Lucene.net Lucene.net: the main concepts Lucene.net: your first application Dissecting Lucene.net storage: Documents and Fields Lucene - or how I stopped worrying, and ... [More] learned to love unstructured data How Subtext Lucene.net index is structured In the last part of the tutorial about Lucene.net we talked about how to organized a Lucene index, and how it is important to have a well planned strategy for it. In this post I’m going to show you how I applied those concepts and Nic’s tips during the design of the index for Subtext.

Requirements Here are the requirements we are designing the index for:

Free-text searches using the search box When someone comes from a search engine, show more results related to the search he did Show more posts related to a post The first two requirements are the usual ones: being able to search for some terms in the index, but the last one requires more than just the list of terms: it’s a MoreLikeThis search and it needs also the Term Vector to be stored.

Than there are other “hidden requirements”: a post can just be a draft (and I don’t want it to appear in searches), or it can be scheduled for future publishing (and again, I don’t want it to appear in search results). Then we also have the “aggregated blog”, which is a collection of all the blogs of the site. To make things even more complex, it’s not just one “wall”, but blogs can be grouped in different “walls” (for example all blogs talking about Silverlight and all the ones talking about ASP.NET MVC). And last, users can decide not to push their posts to the their group.

Structure of the Index With that all these requirements in mind here is how Subtext’s index is structured:

Name Index Store TV Boost Description Title TOKENIZED YES YES 2 The title of the post Body TOKENIZED NO YES - Body of the post Tags TOKENIZED NO YES 4 List of tags PubDate UN_TOKENIZED YES NO - The publishing data BlogID UN_TOKENIZED NO NO - The id of the blog Published UN_TOKENIZED NO NO - Is post draft or not? GroupID UN_TOKENIZED NO NO - The group id (0 if not pushed to aggregator) PostURL NO YES NO - The URL of the post BlogName NO YES NO - The name of the blog PostID UN_TOKENIZED YES NO - The id of the post Explaining why Let’s explain it a bit more. The only fields that need to be full-text searched are the one that contain some kind of real content: so Title, Body and Tags are the only ones that need to be analyzed and tokenized.

But to comply to all the other requirements, when we do a search we have to search also using other criteria:

PubDate must be less than Now Published must be true BlogID must be the one of the blog I’m searching from (when searching inside a single blog) GroupID must be the one of the aggregated site I’m searching from (when searching inside an aggregated site) So I also needed index the fields above, but since they are single terms I don’t need to tokenize them.

And I also need the PostID since when we’ll be using the MoreLikeThis query I’ve to pass to supply to Lucene the id of the document which I want to search similar document for.

And finally, a row in the results will be like:

Dissecting Lucene.net storage: Documents and Fields – Sept 4th, 2009 (CodeClimber)

So the only fields I need to retrieve, and thus store, are Title, PubDate, BlogName (shown in case I’m doing a search from the aggregated site) and obviously the URL to link to the complete post.

What do you think? Am I missing something? Would you have done something differently? Please answer with your comments.

The next step Now that the index has been designed, in the next post we’ll cover some infrastructural code, and show how the search engine service works inside Subtext.

Disclaimer: This is all work in progress and might be (and probably will be) different from the final version of the search engine service that will be included into the next version of Subtext.

Tags: Lucene.net,Document,Field,Subtext [Less]

How to get started with Lucene.net

A few weeks ago I expressed my intention of introducing Lucene.net into Subtext, and that I would have written about the journey. In this post I’m going to write some hints on how to get started with Lucene.net.

Download the bits ... [More] Unfortunately Lucene.net is not officially releasing new versions since March 2007 when Lucene.net 2.0 was released. But since 2007 a lot of new features and bug fixes where introduced, so the best way to get the latest bits is to download the latest tag of the SVN repository. At the time of writing the latest tag is version 2.3.2. This is also the recommendation of the developers since “bureaucracy” is preventing the official release to happen.

And then, once you got the latest bits from the Subversion repository, you have to start the build process. The solution file available in the repository is still with VS2005 so either you use that version, or run the file through the update wizard and use VS2008.

The structure of the repository is quite deep, so it’s worth to have a few words on it before going on:

src: contains 3 subfolders with the source for the core library, the unit tests and a few demo project that are interesting to show the main usage scenario contrib: this contains a few external libraries that enhance the feature of Lucene. There is a small utility class that helps highlighting the matching terms, there is another lib that helps building similarity queries, there is spellchecker, there are stemmers (a “thing” that normalizes the text to be indexed) and there is also a distributed search engine I hope the developers of Lucene.net find a way to publish an official updated release because seeing a 2 years old release might frighten people and make them think that this project is not maintained any more.

Or alternatively you can skip to the end of this post and download the assembly I built to make life easier for you.

Now that you’ve got a binary version of Lucene.net the next step is learning how to do it. I’m going to write some more posts on this in the next weeks, but if you want to start reading directly from the original documents, in the next section you’ll find where to go.

Find the documentation Documentation is probably the thing that is lacking the most in this project (and this is one of the reason why I decided to write this series of posts). But being a class-by-class port of the Java version, you can find a bit more information on the Lucene for Java website (but just a bit).

I would suggest starting from the overview available in the JavaDoc, which explains the main packages and a basic usage scenario. And then go for the MSDN-style documentation for Lucene.net (but be careful since it’s related to Lucene.net 2.1). And some more unstructured documents are available in the wiki.

Finally, if you want to get deeper into Lucene, a book from Manning is available: Lucene In Action. It’s a bit old and talks about Lucene 1.4. But most of the key concepts are still the same. There is also the second edition of the book, still in MEAP: Lucene in Action, Second Edition. But it talks about the forthcoming version 3.0, so a bit too much if you are interested in Lucene.net which is still at version 2.3.

That reminds me of how important a good documentation is for opensource projects: even if you built the best OSS library in the World, if it’s not well documented all your efforts are useless. But that’s probably a topic for another post.

Next Now you know how to get Lucene.net and where to look to find more information. In a future post I’ll write some more code about I implemented Lucene.net into Subtext, starting from the bootstrapping of the index.

If you trust me, and want to avoid all the hassle of getting the source from SVN, migrating the solution to VS2008 and build it, I did it for you, and you can download the main Lucene.net 2.3.2 library.

Download Lucene.net 2.3.2   Tags: lucene.net,lucene [Less]

CKEditor 3.0 is out

After a long period of development, the new major release of FCKEditor has shipped.

The most notable change is the new name: it went from FCKEditor, which many people didn’t like because of its similarity to an English bad word (it was ... [More] made by the initials of the project founder, Frederico Caldeira Knabben) to CKEditor where CK stands for “content & knowledge”.

But obviously this is not the only change: it’s amazingly fast, it has a completely new UI, no more popup but only js modal dialogs, produces valid XHTML code (and hopefully it will not screw-up the code snippets written with WLW) and much more. Read the official release note to know more: CKEditor 3.0 is here!

And also I recommend playing with the demo for more complete experience. It’s impressive!

At the moment there is no server-side integration like it was with FCKEditor 2.x, but given the very easy Javascript API this is not a big deal. But the really missing feature is the lack of an integrated file/image browser: if you are building a closed source project you use the other product built by CKSource, CKFinder, which is a very powerful file browser. But unfortunately it doesn’t come with an opensource license: this prevents the usage of CKEditor in opensource projects like Subtext and DotNetNuke.

They say there will be a trimmed-down CKFinder with CKEditor 3.1 but until then, don’t expect to see CKEditor integrated with Subtext.

Tags: CKEditor,Subtext,FCKEditor [Less]

Scheduling post while on vacation: how did it go?

As you might have noticed if you followed my twitter stream, I spent the last two weeks in Hokkaido, doing some trekking and enjoying the onsens. But still 5 new posts appeared on my blog. This happened thanks to a feature of Subtext that allows ... [More] posting in the future.

Before leaving to Japan I was undecided about scheduling posts while I was on vacation for two weeks, but then I decided to try this experiment.

But now I’m back and I wanted to ask you what you think of what I did: did you like the fact that I new posts were appearing? Or do you think my vacation would have been unnoticed? Or maybe you think I should have just stopped writing posts since I was not available to respond to comments. Please let me know your thoughts by answering the following poll, and by writing a comment. Thank you.

Tags: poll,blogging [Less]

Reducing the bounce rate of tech blogs with Subtext and Lucene.net

In this post I’m going to explain the reason behind my decision to introduce Lucene.net into Subtext to power the internal search engine.

The problem: high bounce rate It all stared a few week ago, when I noticed that I get lot of ... [More] visitors from search engines (around 70%) but that they rarely look at more than one page (only 15% read a second page).

I was interested in knowing if this was just a problem of my blog, or a general problem of all tech/dev oriented blogs. So I ran a quick poll over twitter, and I found out that I’m not alone: 66% of the people that responded state that they have an average of less than 2 page views per visit on their developer oriented blog. The remaining 33% state that they have more than 2 pages per visit (actually 24% even more than 3). But this is probably due to different metrics or way of identifying pages per view (I’ve a 7.3 pages per view if I look at the stats provided by stats provided by my provider with AwStats).

In a few words, developer-focused blogs don’t retain the reader for more than one or two pages.

The possible options I started thinking about ways to reduce the bounce rate, both as a pure SEO exercise but also because this will help readers that come from search engines to find more posts about the keywords they were interested in. Some possible solutions are:

Show the best or latest posts of the blog more prominently Show a list of posts similar to the one the visitor is reading If the visitor comes from a search engine, show other posts that match the same keywords The first option is not something that can be incorporated into a blogging engine as it “only” requires an update in the design of the blog skin. And it won’t give a lot of benefits to the readers because the “best” or “latest” posts might not be about what the reader is looking for.

The solution So I decided to focus on the other two upgrades, which can be easily introduced into Subtext (or any other blog engine) and that will give a bigger benefit to the reader. And I’m going to develop the following “widgets”:

More posts like this, taking inspiration from the Similar Posts plugin from WordPress More search results, taking the inspiration from the Search Relevancy Extension written by Keyvan Nayyeri for Graffiti Why Lucene.net? Subtext already has a internal search engine that I could have leveraged to power those two widgets, some of you might has why I’m planning to use Lucene.net. The reason is quite simple: Lucene.net is a powerful full-text search engine, with advanced features that allow fast and consistent searches, and that allows the kind of “more like this” search that I need for one of the two widgets. Subtext already has a "Similar Posts” control, but it relies on the categories of posts, so not really that accurate.

And last but not least, Lucene.net will make the internal search engine more accurate, so I’m planning to completely change the search implementation of Subtext.

What’s next I never used Lucene.net in a real application, so this will also be an interesting journey, that I’ll take with you, my dear readers. I’m planning on writing on this blog who the integration of Lucene.net in Subtext is going, and I’m also going to write a series of posts about how to use Lucene.net. So, stay tuned!!

Tags: Lucene.net,SEO,Subtext [Less]

Subtext goes to Google Code

Subtext finally leaves the SourceForge Hell and moves source code and issue tracking on Google Code.

What does that mean for users? First of all that means that you won't have to fight with the bad user experience of SourceForge any ... [More] more.

Then, if you were reading and asking questions in the SF Forums, now you have to subscribe to the Subtext group on Google Groups.

For the moment the latest stable version is still on SourceForge, version 2.1. Actually the really latest version (2.1.1) is available for download and install from the Windows Web App Gallery.

I'm pretty excited by this move, since I was really annoyed by the lack of new features and usability of SourceForge.

Google, Subtext [Less]

Subtext 2.1 released

Taking advantage of the Thanksgiving Day’s holiday, Phil wrapped up a few changes and bug fixes and released the first update of Subtext 2: Subtext 2.1.

This new version fixes also on flaw that could allow a potential XSS attack via ... [More] comment.

To read more about the release, go to Phil’s post: Subtext 2.1 Released! Contains Security Update.

If you are on Subtext 2.0 there are no database schema changes, so just replace the dlls and merge the web.config file, and you are done.

Download the latest version here.

Technorati Tags: subtext [Less]

Subtext 2.0: Bugs, Features, and Patches

Plenty of other folks have already announced that the Subtext 2.0 bits finally dropped this past Sunday afternoon, hot off the CI server. And by finally, I mean – over a year after the last official release, and four months after we said it was ... [More] just around the corner.

Good things come to those who wait At least that’s what the Heinz company says. Or maybe they just had a brilliant marketing department.

Anyhow, Simone Chiaretta and several other folks have already hit the release highlights, so I’ll just steal their summaries:

Top notch support for Windows Live Writer thanks to some patches and check-ins from Tim Heuer New CSS-based admin design that makes better use of apace Support for mobile skins (and a default mobile skin if your favorite skin doesn’t have mobile support built in) Streamlined installation Process Support for Enclosures CSS and JS optimizations Setting a date in the future for publishing posts Login to your blog using OpenID, as well as use your blog as an OpenID Delegate Ch-ch-cha-changes! As you can see, the new bits are packed with a metric crap-load of bug fixes, new features, and patches. And yes, metric-crap load is a technical term and a real unit of measure… or not, whatever.

But hey, you don’t have to take my word for it. You can get a full list of the changes here:

Bug Fixes Delivered Feature Requests Accepted Patches Go get you some Oh yeah, and I should probably link to the new bits: DOWNLOAD

Technorati Tags: subtext, open source, openid [Less]

Subtext 2.0 released

As I already anticipated a few days ago, Subtext was on its way to be released. And today Phil just announced it: Subtext 2.0 has been released, one year and a few months after the previous version 1.9.5.

I already explained a some of the ... [More] new features of Subtext 2.0:

Publish in the future JS and CSS performance optimization Enclosures but Subtext 2.0 also brings to the table:

Enhanced MetaWeblog API implementation Enhanced WLW implementation new CSS-based admin layout Mobile-skin support OpenID support, both to login to the admin and to use you blog as OpenID delegate and many bug fixes Read Phil blog for more details on the release notes and for the future plans for Subtext (MVC anyone?).

And, you want to upgrade you blog, or give Subtext a try, here is the url for the download: Subtext 2.0 of SourceForge.

Technorati Tags: Subtext [Less]

New feature in Subtext 2.0: publish in the future

In the last two days I wrote about two new feature of Subtext 2.0: Enclosures and performance optimizations for skins.

Today I’ll talk about another new feature of Subtext 2.0: Future Posting.

I already anticipated this feature ... [More] while I was testing it last week: posting in the future allows the blog’s author to write a post either with the post editor online or with any offline blogging application (like Windows Live Writer), set the date and time, save it on the server as published (not as draft), and have it automatically published both on the blog and the RSS feed at the specified moment.

This is useful when writing series of posts: you write all your posts at once, and then schedule them to appear online one per week or per day or every couple of days. (Which is what I’ve been doing with this series of posts.)

It’s also useful when you go on holiday but you still want to publish something on your blog: you can write some posts before your leave and set the publish date in the future while you are relaxing on some Maldivian beach.

How to set the publishing date of post?
If you are using the online post editor, you have to set the Post Date (in en-US format mm/dd/yyyy hh:mm:ss AM/PM) in the Advanced Options panel. The time must be in the blog’s time zone. Expect some enhancements in a future release of Subtext, like an Ajax datetime picker, or similar UI elements.

If you are using Windows Live Writer, you can set the date of the post with the date picker on the bottom toolbar.

How do I know that a post will be published in the future?
A post published in the future will not be accessible online till the publish date has been reached, but you can see the scheduled posts looking at the list of posts in the admin. If the post is scheduled for future publishing, the list will tell you when this will happen.

How dates relate to URLs
Subtext stores 3 different dates for each post:

the date in which the post created
the last modify date
and the publish date

Till Subtext 2.0 the permalink of the post was generated using the date of creation, but now that it’s possible to post in the future this will not make sense anymore: if you write 10 posts and schedule for future publishing they will appear as created all at the same day.

So now the permalink will be generated using the publish date, so that also the url shows the real publish date.

But to keep the backward compatibility, and don’t change the url of past posts, the upgrade procedure to Subtext 2.0 includes also a little update scripts that copies the date of creation into the publish date.

Feedback? Questions?
As usual, if you have questions, bug reports and feedback just write on the forums and report bugs.

It’s done with the features I implemented in Subtext 2.0: tomorrow I’ll write about how the frontend performance optimization works behind the scenes. Please subscribe to the feed if you want to get it automatically in your feedreader.

Technorati Tags: subtext,future post [Less]