Follow
Publications: 0 | Followers: 0

Data Generation for Application-Specific Benchmarking

Publish on Category: Birds 268

Y.C. TayNational University of Singapore
Data GenerationforApplication-Specific Benchmarking
Background
benchmarks helpresearchand development---the dominant database benchmark is TPC
SIGMOD Conference 2011research track: 87 papers, 17 use TPC (20%)industry track: 14 papers, 6 use TPC (43%)
Problem :a few TPC benchmarksbut many, many applications
TPC becoming irrelevant?
Vision
a paradigm shift indatabase benchmarkdevelopment
fromtop-downcommittee consensusdomain-specificpackage (data generator + queries)
tobottom-upcommunity collaborationapplication-specifictools (dataset scaling)
synthetically scale up/downapplication data
application alreadyhas queries
Challenge
Dataset Scaling Problem:
Given a set of relational tablesDand a scale factors,generate a database stateD’ that is similar toDbutstimes its size.
E.g. What would DBLP look like in 2020?
s> 1why:scalability testingdifficulty:copying doesn’t work (e.g. social network data)
s< 1why:application testingdifficulty:sampling not straightforward (similar to web crawling)
s= 1why:privacy/proprietary reasonsdifficulty:encryption is risky
Challenge
Dataset Scaling Problem:
Given a set of relational tablesDand a scale factors,generate a database stateD’ that is similar toDbutstimes its size.
by query results
difficulty:data correlation
E.g. database = {photos, owners, comments, tags}
inter-column correlationforeign keysage and genderuser likely to commenton own photosgardener likely to tagphotos of flowers
inter-row correlationphoto dimensions(same camera)tags used by gardener(“rose”, “bee”, “beetle”)
inter-column + inter-row2 users comment oneach other’s photos(social network)
Challenge
scaling a social network:
D
empiricaldataset
~
D
inject
syntheticdataset
E.g. how to inject into
~
D
* correlation from indicating X and Y comment on each other’s photos
~
G
* correlation betweenAlice’s birthdayand wallposts by her classmates
* correlation among tags used bybird watchers
extract
G
empiricalsocial graph
use join query
G
~
scale bys
syntheticsocial graph
use graph theory#edges?#triangles?path lengths?
any database theory?
Challenge
Attribute Value Correlation Problem for Social Networks:
Suppose a datasetDrecords data from a social network.How do the social interactions affect the correlationamong attribute values inD?
* online social networks are here to stay
* their datasets can be huge
* their datasets have commercial value
where is the database theory?
Vision (for the next 25 years):
a paradigm shiftfroma top-down design of domain-specific benchmarks by committee consensustoa bottom-up collaborative development of tools for application-specific dataset scaling
Challenges:
Dataset Scaling ProblemAttribute Value Correlation Problem for Social Networks
commercial value in dataset scaling toolsnew database research areas (social network data, schema design,vertical/horizontal partition, query optimization, business intelligence, …)
Payoff:
UpSizeR(http:www.comp.nus.edu.sg/~upsizer)
single-server versionHadoopversion
Start:

0

Embed

Share

Upload

Make amazing presentation for free
Data Generation for Application-Specific Benchmarking