On April 12, Amazon launched a new beta service, CloudSearch which:
is a fully-managed search service in the cloud that allows customers to easily integrate fast and highly scalable search functionality into their applications.
This latest offering is based on search techniques developed by A9 and provides three features to configure and customize search results as summarized by Dr. Werner Vogels, CTO at Amazon.com
Filtering: Conceptually, this is using a match in a document field to restrict the match set. For example, if documents have a "color" field, you can filter the matches for the color "red".
Ranking: Search has at least two major phases: matching and ranking. The query specifies which documents match, generating a match set. After that, scores are computed (or direct sort criterion is applied) for each of the matching documents to rank them best to worst. Amazon CloudSearch provides the ability to have customized ranking functions to fine tune the search results.
Faceting: Faceting allows you to categorize your search results into refinements on which the user can further search. For example, a user might search for ‘umbrellas’, and facets allow you to group the results by price, such as $0-$10, $10-$20, $20-$40, etc. Amazon CloudSearch also allows for result counts to be included in facets, so that each refinement has a count of the number of documents in that group. The example could then be: $0-$10 (4 items), $10-$20 (123 items), $20-$40 (57 items), etc.
Jeff Barr, Evangelist at Amazon.com has shared a blog post introducing the service offering in which he outlines three relatively straightforward steps to setting up search for an application
1. Create and configure a Search Domain. This is a data container and a related set of services. It exists within a particular Availability Zone of a single AWS Region (initially US East).
2. Upload your documents. Documents can be uploaded as JSON or XML that conforms to our Search Document Format (SDF). Uploaded documents will typically be searchable within seconds. You can, if you'd like, send data over an HTTPS connection to protect it while it is transit.
3. Perform searches.
The API ecosystem around CloudSearch for various languages is also growing with contributions from early customers. Social music discovery website ex.fm was one of the early customers of CloudSearch and they created a module for CloudSearch in boto which will soon be committed to the boto master. Search Technologies like ex.fm was an early customer. They utilized CloudSearch to index and search Wikipedia based on a Java API they built which is now available for download.
From a performance perspective, CloudSearch scales both horizontally (up to 50 instances) and vertically as data gets added and frequency of requests increase. Primary indexes are stored in memory to reduce turnaround time. The pricing model hence corresponds to the two dimensional scaling factors:
You'll be billed based on the number of running search instances. There are three search instance sizes (Small, Large, and Extra Large) at prices ranging from $0.12 to $0.68 per hour (these are US East Region prices, since that's where we are launching CloudSearch).
There's a modest charge for each batch of uploaded data. If you change configuration options and need to re-index your data, you will be billed $0.98 for each Gigabyte of data in the search domain.
There's no charge for in-bound data transfer, data transfer out is billed at the usual AWS rates, and you can transfer data to and from your Amazon EC2 instances in the Region at no charge.
One will invariably run into the tagline "Start Searching in One Hour for Less Than $100 / Month" which corresponds to a month's usage of the small instance with no additional data that is uploaded and indexed during that period. However there is no clear guidance around request throughput for instance sizes across a spectrum of data set sizes to help determine pricing. A tweet from Marco Arment, creator of InstaPaper, caught my eye
My experiment with Amazon CloudSearch was going very well until I imported half of subscribers' bookmarks and saw it would cost $10k/month.
Lucid Imagination, Creators of LucidWorks: the search platform in the cloud based on Apache Solr, was quick on commenting on the lack of features for enterprise apps:
But back to Amazon…their initial offering lacks a lot of basic capabilities in the areas of security, data acquisition and business rules.
Search Technologies and ex.fm addressed data acquisition and transformation to Amazon's Search Data Format(SDF) in different ways. Ex.fm used custom scripts
When a song or user is modified and we want to reindex the document in Cloudsearch, we add the documents id to a redis set. We then have a celery periodic task that collects all of the ids from the redis set, builds an SDF with hundreds of document updates and commits them all at once.
it has been really handy to export entire mongodb collections to SDF’s and store them on S3 for even faster iteration and potential disaster recovery. We have a script that creates thousands of celery tasks; 1000 document ids at a time. Each celery task then pulls those 1000 objects from mongodb, creates the SDF and stores it on S3. We can then run another script, that creates thousands of more tasks to grab each SDF from S3 and directly post it to the document service. We’ve found this to be extremely effective for indexing millions of documents in a very timely fashion.
and Search Technologies leveraged their Aspire content processor:
Just to be super cool, we fetch Wikipedia dump files directly from Wikipedia, and stream them through our Aspire content processing framework. At no point do we actually have to download files to disk. It just magically flows from Wikipedia, through Aspire, and into CloudSearch. We index the whole thing with a push of a single button.
Indexing into Amazon CloudSearch was super easy. We already had an Aspire “Post XML” stage which did exactly what was needed. We just wrote an XSLT transform to map Aspire metadata into Amazon CloudSearch index fields and it pretty much worked first time.
Most beta users have reported positive experiences with the scalability and availability of the service. Ex.fm said:
In terms of performance, Cloudsearch is much faster and more reliable than our previous Apache Solr install. The best feature by far is scaling on demand. Data is horizontally partitioned on the fly as the number of documents you’re indexing grows. The number of instance size and number of instances scales up and down depending on how many bits you’re pushing and pulling.
In 4 months, we’ve had a total of 15 minutes of downtime, which is very good for a private beta. We’re also not concerned with lock-in because it is so similar to the way we were using Solr. We were even able to swap in Cloudsearch for Solr without anyone really noticing, except that search results were twice as fast and had much better relevance.
Paul Nelson, Chief Architect at Search Technologies on unleashing more performance from CloudSearch:
Batching helps indexing performance a lot, and co-locating your indexer on the same availability zone as your Amazon CloudSearch search domain also helps. You can send documents to Amazon CloudSearch really fast (nearly 500 docs/second in my test run).
Chris Moyer tweeted:
Over 1 million documents indexed in #CloudSearch and still getting responses in under 100ms on any search.
Amazon.com is conducting a webinar on May 10th which will be an opportunity to address topics around security for multi-tenant services and other features relevant for enterprise applications.