Use of Vertical Partitioning Algorithms to Create Column Families in HBase

Posted on May 11th, 2017 in HBase by Rana Faisal Munir

HBase is a NoSQL database. It stores data as a key-value pair where key is used to uniquely identify a row and each value is further divided into multiple column families. HBase reads subset of columns by reading the entire column families (which has referred columns) into memory and then discards unnecessary columns. To reduce number of column families read in a query, it is very important to define HBase schema carefully. If it is created without considering workload then it might create inefficient layout which can increase the query execution time. There are already many approaches available for HBase schema design but in this post, I will present to use of vertical partitioning algorithms to create HBase column families.

There are two main challenges in defining a table in HBase:

Defining a row key which can cover most of the queries in the workload
Defining column families which help in reading less data as much as possible

Defining Row Key:

For defining a row key, I propose to extract all columns which are used in FILTER and rank them according to their usage. Take top 1 or 2 which covers majority of the queries and combine them with primary key of the table. For example, in TPCH benchmark, lineitem table is used in "Q1,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q12,Q14,Q15,Q17,Q18,Q19,Q20,Q21" and in Filter, L_SHIPDATE is used 8 times and L_COMMITDATE, L_RECIPETDATE, L_QUANTITY are used for 2 times each. According to my proposal, I choose L_SHIPDATE by combining them with Primary key (L_ORDERKEY , L_LINENUMBER) to make a composite row key for HBase table. By doing this, we can take advantage of key base filtering in 8 queries.

Defining Column Families:

There are already many vertical partitioning algorithms in database world to define vertical partitions. Vertical partitions are similar to column groups and it can be taken as column families in HBase. There is a paper titled as "A Comparison of Knives for Bread Slicing" in VLDB 2013. They compared already existing vertical partitioning algorithms. However, they have not evaluated these algorithms for HBase. It might be interesting to compare there performance in HBase to see how useful these algorithms are in HBase settings. I have applied five algorithms (AutoPart, HillClimb, O2P, HYRISE, NAVATHE) by considering only Lineitem table and all TPCH queries as workload which involve Lineitem table. The below table is showing the column groups for every algorithm.

Algorithm Name	Columns Groups
AutoPart	G1: {L_LINESTATUS, L_TAX} G2: {L_ORDERKEY} G3: {L_PARTKEY} G4: {L_SUPPKEY} G5: {L_LINENUMBER, L_COMMENT} G6: {L_QUANTITY} G7: {L_RECEIPTDATE, L_COMMITDATE} G8: {L_RETURNFLAG} G9: {L_SHIPDATE} G10: {L_DISCOUNT, L_EXTENDEDPRICE} G11: {L_SHIPINSTRUCT} G12: {L_SHIPMODE}
HillClimb	G1: {L_ORDERKEY} G2: {L_PARTKEY} G3: {L_SUPPKEY} G4: {L_LINENUMBER} G5: {L_QUANTITY} G6: {L_DISCOUNT,L_EXTENDEDPRICE} G7: {L_LINESTATUS, L_TAX} G8: {L_RETURNFLAG} G9: {L_SHIPDATE} G10: {L_RECEIPTDATE, L_COMMITDATE} G11: {L_SHIPINSTRUCT} G12: {L_SHIPMODE} G13: {L_COMMENT}
HYRISE	G1: {L_RECEIPTDATE, L_COMMITDATE} G2: {L_PARTKEY} G3: {L_SHIPMODE} G4: {L_SHIPINSTRUCT} G5: {L_SHIPDATE} G6: {L_SUPPKEY} G7: {L_DISCOUNT} G8: {L_EXTENDEDPRICE} G9: {L_RETURNFLAG} G10: {L_COMMENT, L_LINENUMBER} G11: {L_LINESTATUS, L_TAX} G12: {L_ORDERKEY} G13: {L_QUANTITY}
NAVATHE	G1: {L_COMMENT, L_LINESTATUS} G2: {L_RETURNFLAG} G3: {L_SHIPMODE} G4: {L_RECEIPTDATE, L_COMMITDATE} G5: {L_SUPPKEY} G6: {L_DISCOUNT, L_EXTENDEDPRICE } G7: {L_SHIPDATE} G8: {L_QUANTITY } G9: {L_PARTKEY} G10: {L_SHIPINSTRUCT} G11: {L_TAX} G12: {L_LINENUMBER} G13: {L_ORDERKEY}
O2P	G1: {L_COMMENT, L_LINESTATUS } G2: {L_RETURNFLAG} G3: {L_SHIPMODE } G4: {L_COMMITDATE } G5: {L_RECEIPTDATE } G6: {L_SUPPKEY } G7: {L_DISCOUNT, L_EXTENDEDPRICE } G8: {L_SHIPDATE } G9: {L_QUANTITY L_PARTKEY } G10: {L_SHIPINSTRUCT } G11: {L_TAX} G12: {L_LINENUMBER } G13: {L_ORDERKEY}

Theoretical Evaluation:

Whenever, we read subset of columns then we have to read all columns families which contain these columns. We can evaluate by seeing how many column families (column groups) that we have to read in different queries.

Query	AutoPart	HillClimb	O2P	HYRISE	NAVATHE
Q3 of TPCH select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, from LINEITEM where l_shipdate > date '1995-03-13' group by l_orderkey limit 10;	2	2	2	3	2
Q12 of TPCH select l_shipmode from LINEITEM where l_shipmode in ('RAIL', 'FOB') and l_commitdate < l_receiptdate and l_shipdate < l_commitdate and l_receiptdate >= date '1997-01-01' and l_receiptdate < date '1997-01-01' + interval '1' year group by l_shipmode order by l_shipmode;	3	3	4	3	3

Query

AutoPart

HillClimb

O2P

HYRISE

NAVATHE

Q3 of TPCH

select

l_orderkey,

sum(l_extendedprice * (1 - l_discount)) as revenue,

from LINEITEM

where

l_shipdate > date '1995-03-13'

group by l_orderkey

limit 10;

Q12 of TPCH

select

l_shipmode

from LINEITEM

where l_shipmode in ('RAIL', 'FOB')

and l_commitdate < l_receiptdate

and l_shipdate < l_commitdate

and l_receiptdate >= date '1997-01-01'

and l_receiptdate < date '1997-01-01' + interval '1' year

group by l_shipmode

order by l_shipmode;

The above table presents two queries of TPCH and I manually evaluate their performance for different algorithms. I am not favoring any algorithm and it is up to you to decide which one is better in your environment. However, I am recommending to use one of these algorithms while designing schema of a HBase table. It can be helpful and can propose an efficient schema based on your workload.