bts (BigTable Search) aims to bring scalable full-text search to Python applications hosted on Google's App Engine. bts will hopefully be unnecessary once Google includes full-text search with App Engine, but there's no specific date for that feature and we haven't seen it show up on the roadmap yet.
see a demo
bts grew out of the search implementation used on http://www.bidtective.com and its features are currently limited to what that particular application requires, namely:
Stemming and stopword removal Multi-term queries with optional matching (i.e. not all words need to be present to match a document) Faceted search (filtering search results by one or more low-cardinality facets, as well as retrieving hit counts for individual facets) Indexing multiple fields, without needing to update index.yaml Weighting of fields to determine relevancy of search results Native indexing of data types such as dates and numbers to support range queries Sorting of results based on relevance Starts-with searching (useful for type-ahead completion) Support for materialized properties (properties that are derived and saved at time of indexing) bts currently has the following limitations:
Supports only English language text Scalability has not yet been verified - we're constantly testing and improving performance, but we're not yet at a point where we can confidently say to what level it scales. We do know that the asynchronous approach to index updates allows for keeps bts from noticeably impacting online transactions. We also know that updating the inverted index is a little bit costly, so if you're doing a lot of adds/updates to the index, you'll want plenty of cpu quota. Right now, only relevancy sorting is supported. Because indexing is performed asynchronously and because App Engine places a fixed upper limit on task queue insertions, there is a fixed upper limit to how many objects can be indexed in one day. Realistically, we try to minimize task queue insertions so you won't run into this problem unless you're updating upwards of 50,000 objects per day. The basic architectural approach is as follows:
Store terms in an inverted index on a per-field basis Include a derived 'bts_all' field to support searching across all full-text fields Store non-text fields in their native datatype (e.g. integer, date, reference, etc.) To query: Retrieve keys from inverted index Merge keys in memory Sort results by cumulative relevancy of matching terms Retrieve objects corresponding to the keys Indexing is performed asynchronously using task queues and broken into multiple steps to ensure that indexing operations complete within the 30 second app engine time window (even for large objects) Indexing is a two-step process: Inspect the object to determine what terms have been added or removed and write these to a TermUpdates queue table Merge the TermUpdates into the inverted index The merge process is partitioned into parallel processes by term type (e.g. integer, string, etc.) and for strings is partitioned by letter of the alphabet Each merge process is handled as a task in a task queue that re-queues itself until complete, thereby allowing merge processing to continue indefinitely but stay within the 30-second app engine processing window. To avoid stressing the system, the queue for merge tasks has a low rate and bucket size. The tasks are initially queued and periodically re-queued by a cron job.
Copyright © 2013 Black Duck Software, Inc. and its contributors, Some Rights Reserved. Unless otherwise marked, this work is licensed under a Creative Commons Attribution 3.0 Unported License . Ohloh ® and the Ohloh logo are trademarks of Black Duck Software, Inc. in the United States and/or other jurisdictions. All other trademarks are the property of their respective holders.