On Lily, HBase, Hadoop and SOLR

During a conference call last week, I got a request for a high-level description of the differentiators between Lily and its underlying components: what sets Lily apart against HBase basically. Here's a rough take at it before I fold it into the product documentation: your comments are appreciated!

The foundations of Lily are well understood: HBase, the Hadoop BigTable implementation, the Hadoop distributed filesystem - distributed in a convenient fashion under the Cloudera Hadoop Distribution, and SOLR, the enterprise search platform based on the popular full-text search engine Lucene. In this text, we look at what sets Lily apart, or more specifically, what Lily adds on top of these powerful technologies.

major-blocks

The Lily Content Model

HBase is modeled after Google BigTable, a well-documented yet surprisingly uncommon datamodel described as a sparse, distributed, multi-dimensional sorted map. Uncommon in the sense that classically trained business software developers might encounter difficulties in mapping their domain models into the BigTable/HBase model, more specifically because of the lack of entity relationships, indexed access and a tendency towards heavy denormalization. Please mind that these are exactly the reasons behind the power of HBase, however sometimes badly understood.

In our area of expertise, i.e. content management and publishing, other common storage traits are missing from the BigTable model as well, namely support of common datatypes (time, date, but also more sophisticated types such as multi-value fields, hierarchical data, and more) and the ability to model entity schemas using these core types. Also, many if not most content repositories have intrinsic support for content versioning, preferably customizable to the infra-entity-level.

With this in mind, the current use of HBase is naturally centered around mass data-ingestion and -processing, where the data structures are often really simple (log files, user profiles). We however choose HBase as a random access storage engine replacing a relational database, admittedly with a new and interesting data model and a promise to scale well in a busy work environment.

Lily offers a content model on top of HBase, which we believe to be of value for many layman content application developers, in the sense that if offers slightly higher-level concepts to shape a datatier with: records and fields, a set of data types, link, multi-value and hierarchical fields, a flexible versioning scheme, schema validation. Beyond that, we took a couple of popular content exchange models and verified if a mapping from these into Lily would be possible: HTML, RDF, the CMIS model, NewsML. We found out that a useful mapping was possible, so we're confident that a broad range of content applications can be build on top of Lily.

With these foundations, Lily is ideally suited for storing and managing semi-structured content, a mixture of simple and rich media types, hypertext data - data with a life-cycle (hence versioning) and with relationships (hence link fields). Obviously, people who want low-level access to the BigTable model should use HBase directly instead. Technically, all Lily content is mapped onto HBase tables.

Flexible index management, powerful search

Having a flexible content model is one thing, but what often is cited as the key missing feature of many NoSQL stores is an ability to locate data without primary key index (there's a pun in that, but I'll pass). With our experience in content apps, we also had a strong need for full-text search and facet browsing, scaling all this with the volume of data HBase accommodates (which is a lot).

To this end, we do a full-on integration with the leading enterprise search platform - SOLR - in a way that is scalable (by sharding index data and distributing search), flexible (by providing full access to SOLR functionality) and robust. That last aspect is really important due to the inherently asynchronous nature of SOLR index updating, compared with the synchronous put/get operations allowed for HBase records.

We bridged SOLR with HBase for this using a HBase-backed RowLog mechanism we implemented in a fully-distributable message queue, meaning we can now reliably pass content updates from one distributed system (HBase) to another (SOLR).

The indexing mechanism allows to configure what data needs to be indexed, and to describe the mapping between the Lily and the SOLR data model. It also supports data denormalization and link dereferencing, needed to replace the flexibility of the SQL query language.

For search, we offer access SOLR as-is, a familiar environment to many. In the future, also keeping in mind SOLR might not be the only index/search solution that is supported by Lily, we could wrap the SOLR query language into a contraption of our own, but we didn't deem this necessary so far.

A distributed architecture

All components of the Lily architecture are fully distributable, also the client connect points (the Lily server node), and we take special care in not inserting SPOFs in the design by unclever tool reuse. This is one of the reasons why we decided to implemented our own queuing mechanism, as other candidates had no reliable, distributed storage.

The Lily remote interface

Lily can be used through its Java API (with remoting happening through Avro), and also using a REST HTTP/JSON interface.

Integration hooks

It is important to realize that a Lily server node consists of a set of Kauri modules deployed inside a Kauri runtime, which ensures a very flexible, pluggable component architecture already. You can easily develop your own indexer and replace the default one. We, the people behind Lily, have been developing open source software for almost 10 years now, working mainly for a technical audience (i.e. developers) so we know the importance of sensible interfaces and replaceability of default behavior.

Conclusion

Lily offers a flexible, higher-level content repository model on top of HBase, HDFS and SOLR, and combines store and search in a scalable and distributed architecture. It is suited for management of semi-structured, content-centric media and data objects that need versioning, sophisticated search and complex metadata. Lily has an open and extensible architecture and provides low- and higher-level access APIs.

categories: lily open source hbase
8/27/10 | Comments

Daisy 2.4 and Kauri 0.4 on track for new releases

With all the noise going to Lily these days, lest not forget our continuing development of Daisy and Kauri, the former an inspiration for the content model behind Lily, the latter providing the runtime environment.

Daisy and Kauri both saw a release candidate in the past few weeks, to which you're cordially invited to download, install and play around. There's lots of new stuff in both releases.

Daisy

Two highlights in Daisy 2.4 M1: the point-in-time publish and search, which is a time machine feature (non-DeLorean, I'm afraid) allowing you to view and search (!) a site as if it was published on a given point in time. If you have a fancy legal publishing requirement, this is a killer feature which you will find nowhere.

On a technical level, we've (finally) switched the build tool of Daisy from Maven 1 to Maven 2, and also provide some nice plug-ins to manage running Daisy instances.

Kauri

Deep work on the forms module, which should finally be really usable by now, a centralized configuration mechanism, and an upgrade to Restlet 2. That, and loads of work on the documentation as well.
This must have been the longest inter-release spell we experienced, part of which is to be blamed on customer project work (for which we're grateful), and part of indeed is caused by lots of work on Lily as well. That said, we believe there's more than enough new goodies to be found in Daisy 2.4 and Kauri 0.4 to compensate for this.
So hit your surf boards and give them a spin!
Here's a direct link to the respective release notifications:

Enjoy!

categories: kauri daisy open source news community
8/26/10 | Comments

Lily "Proof of Architecture" is OUT!

Hi,

slightly over a year ago, we set out on a course to investigate what content applications would encounter in this new era where data has moved from a liability and cost to an opportunity - if you have the infrastructure to scale. We looked at the architecture of our own content management product, to that of competitors, and decided to move ahead of the curve and design and build the first NoSQL-based content repository that provides cloud-scalable content storage and search.

Lily is the result of that effort, and its "Proof of Architecture" release is available today from www.lilycms.org - licensed under the liberal Apache open source license.

Lily fuses Apache HBase, the Google BigTable-inspired NoSQL column-oriented database, and SOLR, the industry-standard search engine running on top of Apache Lucene, and provides infinitely scalable storage and search for large content collections.

The Lily content repository offers a rich and flexible content model, with strong versioning support, and a queue system that keeps SOLR indexes up to date with repository updates. The Lily content model has been academically validated and accommodates data mapped from various domains, such as rich hypermedia, HTML5, NewsML, MXF, CMIS, RDF and many more.

This Proof of Architecture release is made specifically for the audience Lily has been designed for: content technologists, developers of content applications such as WCMS, CMS, DAM, DMS and RM, which are being confronted with the lack of scale and reliability a relational DBMS back-end often exhibits when data and usage volumes explode. Lily has been specifically architected to be fully distributable, allowing it to run on large server farms or in the cloud, making use of such large-scale infrastructure to provide room for growth.

Outerthought, the company behind Lily, has been building and nurturing open source content management technology for more than 6 years now, and has been leveraging this experience throughout the Lily design. We were first in realizing that NoSQL could be a tremendous asset for content application builders, and to make use of these new and exciting concepts and technology in designing and building a fully-integrated content repository solution.

With this Proof of Architecture, we invite content technologists to check Lily out, and to consider not building their own NoSQL content repository in the years to come, but rather to join us in a open software community around Lily and focus on front-end and UI differentiators instead.

A first Lily distribution release is planned for November 2010, and Lily 1.0 will be there March 2011. We're recruiting technology partners as we speak.

Lily is made available from www.lilycms.org starting now. Follow us on Twitter @outerthought to be updated as we proceed. There's a FAQ on Lily available as well: http://www.lilycms.org/lily/about/faq.html.

I'd especially like to thank the HBase community for their effort in accommodating us, the SOLR/Lucene community for their great software, Michael Stack and Lars George for their enthusiasm along the way, and Bruno and Evert, our lead Lily engineers, for the past year of hard work.

I'm thrilled to see where we are at now, and invite anyone to join us in this excitement.

Thanks for your attention,

Steven.

categories: lily community business events open source news
7/22/10 | Comments

On Lily, HBase and SOLR

I just stumbled upon (gee, another HBase user) the video of my talk on some of the technology decisions we made during Lily's design phase. You'll have to live with my lack of tempo during the first 10 minutes, and you don't have to take my word for everything I say: on July 22nd, we're opening the Lily source tree and you'll be able to find out yourself what Lily is all about.

categories: lily community events hbase
7/14/10 | Comments

Bending time in HBase

HBase, like BigTable, “is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.” (ref) For completeness, we could consider "table" and "column family" as other dimensions offered by HBase.

The dimensions are not all equal: they each have their own behavior. For example, the row dimension is the one that gets sharded, so can grow very big. Columns are not sharded, but in contrast to rows, multiple columns within a row can be atomically put or deleted (fine print: doing puts and deletes as one atomic operation is not possible). In this article, we will focus on the specifics of the time dimension.

You could consider the time dimension to be some automatic versioning done by HBase, which you can ignore if you do not need it. Indeed, you can do operations like get, put or delete without specifying a time, and it will usually work like you would expect. However, some particularities of the current implementation make that you are better of having a good understanding of the time dimension anyway.

Basics

Terminology

In the BigTable paper, the terminology used is that a {row key, column key} pair addresses a cell, and that each cell can contain multiple versions, indexed by timestamp.

Keys in the time/version dimension

While rows keys and column keys (a.k.a. qualifiers) are bytes, in the time dimension, the key is a long integer. Typically this contains time instants like the ones returned by java.util.Date.getTime() or System.currentTimeMillis(), that is: “the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC”.

The time dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first.

We will now look at the behavior of the time dimension for each of the core operations: get, put and delete.

Get

By default, when doing a get, the version with the biggest timestamp of each cell is returned (which may or may not be the latest one written, see later).

The default behavior can be changed in two ways, you can ask:

  • to return more than one version: see Get.setMaxVersions()

  • to return other versions than the latest ones, see Get.setTimeRange()

One interesting option that is missing is the ability to retrieve the latest version less than or equal to a given timestamp, thus giving the 'latest' state of the record at a certain point in time. Update: this is (obviously) possible: just use a range from 0 to the desired timestamp and set the max versions to 1 (thanks Jonathan).

Put

Doing a put always creates a new version of a cell, at a certain timestamp. By default the system uses currentTimeMillis, but you can specify the timestamp (= the long integer) yourself, on a per-column level. This means you could assign a time in the past or the future, or use the long value for non-time purposes.

Without going too much into the storage architecture, HBase basically never overwrites data but only appends. The data files are rewritten once in while by a compaction process. A data file is basically a list of key-value pairs, where the key is the composite {row key, column key, time}. Each time you do a put that writes a new value for an existing cell, a new key-value pair gets appended to the store. Even if you would specify an existing timestamp. Doing lots of updates to the same row in a short time span will lead to a lot of key-value pairs being present in the store. Depending on the garbage collection settings (see next), these will be removed during the next compaction.

Delete

There is a lot to tell about deletes and the time dimension.

Garbage collection

First of all, there are two ways of automatic pruning of versions:

  • you can specify a maximum number of versions, if more versions are added the oldest ones are deleted. The default is 3, and is configured when creating a column family via HColumnDescriptor.setMaxVersions(int versions). The actual deletion of the excess versions is done upon major compaction, though when performing gets or scans the results will already be limited to the maximum versions configured. Thus setting maximum-versions to 1 does not really disable versioning; each put still creates a new version, but only the latest one is kept.

  • you can specify a time-to-live (TTL), if versions get older than this TTL they are deleted. The default TTL is "forever", and is configured via HColumnDescriptor.setTimeToLive(int seconds). Again, the actual removal of versions is done upon major compaction, but gets and scans will stop returning versions whose TTL is passed immediately. Note that when the TTL has passed for all cells in a row, the row ceases to exist (HBase has no explicit create or delete of a row: it exists if there are cells with values in them).

An interesting behavior I noticed while empirically verifying these behaviors is the following: suppose you create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case anymore (thus major compactions are not completely transparent to the user).

Manual delete

When performing a delete operation in HBase, there are two ways to specify the versions to be deleted:

  • delete all versions older than a certain timestamp

  • delete the version at a specific timestamp

A delete can apply to a complete row, a complete column family, or to just one column. It is only in the last case that you can delete versions at a specific timestamp. For the deletion of a row or all the columns within a family, it always works by deleting all cells older than a certain time.

Deletes create tombstone markers

For example, let's suppose we want to delete a row. For this you can specify a timestamp, or else by default the currentTimeMillis is used. What this means is “delete all cells where the timestamp is less than or equal to this timestamp”. HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond the delete condition. Rather, a so-called “tombstone” is written, which will mask the deleted values. When HBase does a major compaction, the tombstones are processed to actually remove the dead values, together with the tombstones themselves.

If the timestamp you specified when deleting a row is larger than the timestamp of any value in the row, then you can consider the complete row to be deleted.

Uses of timestamps

While the time dimension is primarily intended for versioning, you could consider to use it as a just another dimension similar to columns, with the difference that they key is a long. Due to some bugs, currently this cannot be recommended (see below).

Browsing through Jira issues, I have found some interesting mentions of the timestamp dimension:

  • HBase multi data center replication makes use of the timestamps to avoid conflicts. See HBASE-1295 (page 4 of the attached PDF) or this mail.

  • A comment in HBASE-2406 mentions: If you want a consistent version of some data that spans multiple tables (i.e. secondary index), you may want to use the same timestamp to insert into both tables so that you can use the exact timestamp as part of a get() after reading it out of one table.

Limitations

There are still some bugs (or at least 'undecided behavior') with the time dimension that needs to be cleared out. These are things that could get you into trouble:

Overwriting values at existing timestamps. (HBASE-1485 , HBASE-2406) In other words, update the value at an exact {row, column, time} key (I find this the desired behavior, though you could also store multiple unordered values at the same time). It might seem logical that there is trouble here, as you could imagine that HBase needs the timestamp to decide what value is most recent, and if two values have the same timestamp, it can't make this decision. However, there is another source of order which is the order in which things are written to the HFile.

Deletes mask puts, even puts that happened after the delete (HBASE-2256). Remember that a delete writes a tombstone, which only disappears after then next major compaction. Suppose you do a delete of everything <= T. After this you do a new put with a timestamp <= T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run.

These issues should not be a problem if you use always-increasing timestamps for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond (testimony).

categories: hbase coding
7/13/10 | Comments