Tuesday, October 2, 2012

Cassandra Composite Types - A Overview [with CQL & Cassandra-Cli examples]

Just felt like sharing what is composite type in cassandra and How can we make use of it
After so many discussions in Stackoverflow and PHPCassa forums, I hope I have got a clear picture over the topic

What are composite types?
Composite types are like structures in C or C++

For ex:
The way we define a linked list [basic]
struct LLNode {
 int data;
 char name[20];
 struct LLNode *next;
}
Which means every member of this struct will have data of type int, name of type char array and a next pointer of type LLnode
The struct is a composite of basic datatypes[not exactly in this case]
Also you can't initialize value to any of these attributes when you define a struct

The same way Cassandra Composite Type is a dataType derived from existing basic supported dataTypes.

Why do we need this?
Cassandra initially had and still has the concept of super columns.
The basic use of them is maintaining an inverted index.

Consider a data model in cassandra
ColumnFamily: UserMaster
ColumnModel:
userID: {name, prevCompany, experience} 
Now incase we need to support a query

Select name from UserMaster where prevCompany = xyz;

This is totally impossible untill we have a secondary index created.

To overcome this issue, Cassandra gave developers an option of creating their own index using super columns
ColumnFamily: UserIndex
ColumnModel: 
Date: {company1: {rowID1, .. , rowIDN },..,companyN: {rowID1, .. , rowIDN } } //SuperColumn

Now we can answer the above query via
allUsers = UserIndex[Date][prevCompany];
for i in allUsers
 echo UserMaster[i][name];

But this is a bit messy as every insert leads to two inserts with no transactional guarentees.
Also every read will result in minimum two reads across CFs
Also What if we have one more column of interest????
Say
Select name from UserMaster where prevCompany = xyz and experience = 2yrs;
And There are many other issues with super column itself

To overcome this [inverted index] EdAnuff came up with the concept of composite types

How would composite types save time?
Point to remember:
1. Composite types are type preserved
Ex:
CompositeType(ascii, int, ascii);
a:1:user1
a:10:user2
aa:2:user1
ab:0:user2

You can see the columns sorted first based on component 1, then 2 and then 3;
And sorting is based on the exact type of the component

So, we model our data in the following way
ColumnFamily: UserMaster
ColumnModel:
rowID: {prevCompany:exp:userName} //Notice the column value in model 1 becomes the Column names

Sample:
20120901: {prevCompany1:2:user1 => {null}, prevCompany1:2:user2 => {null}, prevCompany1:3:user3 => {null}, ...}
Query:
select * from UserMaster where prevCompany = 'prevCompany1' and exp > 2
Result:
prevCompany1:3:user3 => ''

Note:
I have kept the column value null as we don't need anything in there.
Also I have kept user ID as the last component because our read pattern is
Given a company get all users with given experience range

Now we can query for
All employees whose prevCompany is 'x'
All employees whose prevCompany is 'x' and exp is > '2yrs'
All employees whose prevCompany is 'x' and exp is = '2yrs' and name = 'xyz'
and so on

Note:
Your query should always fetch a contiguous slice incase of composite types

Means:
Sample Data [assume]:
20120901: {y:2:123, y:2:124, y:2:125, y:3:123, y:3:126}

All employees whose prevCompany is 'y' and exp is >= '2yrs' and userName = '123' will not work
Why?
Filter1: prevCompany is 'y' => contiguous
{y:2:123, y:2:124, y:2:125, y:3:123, y:3:126}
Filter2: exp is >= 2yrs => still contiguous
{y:2:123, y:2:124, y:2:125, y:3:123, y:3:126}
Filter3: userID = '123' => non contiguous
{y:2:123, y:2:124, y:2:125, y:3:123, y:3:126}

All employees whose prevCompany is 'y' and exp is = '2yrs' and userName > '123' will work
Why?
Filter1: prevCompany is 'y' => contiguous
{y:2:123, y:2:124, y:2:125, y:3:123, y:3:126}
Filter2: exp is = 2yrs => still contiguous
{y:2:123, y:2:124, y:2:125, y:3:123, y:3:126}
Filter3: userID > '123' => still contiguous
{y:2:123, y:2:124, y:2:125, y:3:123, y:3:126}

Finally,
Composite Types in cassandra is a nice concept but limits the number of components while defining the columnfamily
Cassandra has support for Dynamic Composite Columns [to overcome previous issue] but as of me it is not safe [as type safety is the tradeoff]
Syntax to Create composite type via cassandra-cli
create column family TestComposite 
 with comparator='CompositeType(UTF8Type, UTF8Type, LongType)'
 and key_validation_class = 'UTF8Type',
 and defaut_validation_class = 'UTF8Type';
 
Remember you can't query composite types via cassandra-cli [unless a point query]

Syntax to create composite type via CQL
CREATE TABLE UserMaster (
   day ascii,
   preCompany ascii,
   experience int,
   userName ascii,
   PRIMARY KEY (ID, preCompany, experience, userName)
 );
 
You can always only query based on components listed in Primary Key field above
Check this post in SO
Specifying composite fields via CQL is a bit different check this detailed post to get a clear picture. Remember both ways use the same storage pattern

Hope the whole blogpost made some sense about composite columns