Indexing HDFS Data in PDW: Splitting the data from index
1
Vinitha Gankidi#,Nikhil Teletia*,Jignesh M. Patel#,Alan Halverson*, David J. DeWitt*#University of Wisconsin-Madison*Microsoft Jim Gray Systems Lab
Hybrid SQL-On-Hadoop solutions (MicrosoftPolyBase, TeradataQueryGrid, IBM Big SQL etc.)
RDBMS
Motivation
2
HDFS
Data lives in two worlds
Cheap and scalable data store
Cold Data
Load first, schema later
Familiar SQL interface
Decades of research and optimization
Hot Data
SQL Server PDW with Polybase
SQL
Result
Query Execution over External Data
3
Import HDFS files into PDW
IMPORT PATH
PUSH-DOWN PATH
Run the rest of the query inside PDW
1
2
Import the result of theMapjobintoPDW
Run the rest of the query inside PDW
2
3
Runa Mapjobtofilter
1
SELECT * FROMhdfs_EmployeeWHEREDeptID= 1
HDFS
The HDFS files have to be entirely imported
Significant startup overhead for MAP task
All the HDFS files are scanned entirely
What is a Split-Index?
Index is stored inRDBMS,while the data is in HDFSIndex is stored as aRDBMS tableHash-partitioned across multiple nodeEach partition has clustered B+ tree
4
RDBMS
Index
Split-Index is similar to a materialized view(with an external pointer)
Split-Index canbe out-of-sync withthe data
Query Execution usingSplit-Index
5
SELECT name FROMhdfs_EmployeeWHEREDeptID= 1
Index_Emp(Index onDeptID)
1
SELECT [HDFS File Name],[HDFS Offset], [Rec Len]FROMindex_EmployeeWHEREDeptID= 1
2
Qualifying Tuples
Retrievequalifyingtuples from HDFS files.
3
Return the result
4
HDFS
Using index, we can answer queries without having to sequentially scan each HDFS file.
Incremental Index Update
Given theappend-onlyproperty of the HDFS data, index can beupdated incrementallyA new HDFS file is addedAppend the rows of the new file to the existing indexAn HDFS file is deletedDelete the rows of the deleted file from the existing index
6
Hybrid Scan
AstaleSplit-Index can still be used during query executionExamples:An HDFS file is addedScan thenew fileusingnon-indexapproachProcessexisting filesusingindexAn HDFS file is deletedWhen probing the index, remove the rows associated with the deleted file
7
Experiments
8
Split-IndexPerformance
Cluster9 Node SQL Server PDW cluster (8 compute nodes + 1 controlnode)29 Node Windows HDP 2.0 cluster (28 data nodes + 1 name node)Data Set10 TB ScaleLineitemtableCompare Push-Down approach with Split-Index approach
9
Push-Down Approach
Map Cost
Data Import Cost
Split-Index Approach
RID Materialize Cost
Data Import Cost
COST
Split-Index Performance
10
SELECT * FROMlineitemWHEREl_orderkey<= [Variable]
Data Size:~800GBIndex Size:~80GB
Split-Index onl_orderkey
Push-Down Approach
MapCost
Data Import Cost
Split-IndexApproach
RID Materialize Cost
Data Import Cost
Index performance is sensitive to the accesspattern.
Space vs. Time Trade off
Cost of storing the data in RDBMS is higher compared to HDFSSplit-Index can be used as a covering indexQuantify the performance and space trade-off as we move columns from HDFS to PDWExperiment Setup1 TB ScaleLineitemModified Query 6
11
SELECTSUM(l_extendedprice*l_discount) ASREVENUEFROMlineitemWHEREl_shipdate>= '1994-01-01'ANDl_shipdate<dateadd(mm, 1, cast('1994-01-01' as date))ANDl_discountBETWEEN .06 - 0.01 AND.06 + 0.01 ANDl_quantity<24
Space vs. Time Trade off
12
Conclusions and Future Work
13
0
Embed
Upload