Follow
Publications: 91 | Followers: 0

Big Data around UW-Madison

Publish on Category: Birds 0

Big Data around UW-Madison
Brian S.Yandell,UW-Madisonwww.stat.wisc.edu/~yandell
10/21/2011
1
Big Data © Brian Yandell
Our networks are awash in data.A little of it is information.A smidgen of this shows up as knowledge.Combined with ideas, some of that is actually useful.Mix in experience, context, compassion, discipline, humor, tolerance, and humility, and perhaps knowledge becomes wisdom.Cliff Stoll (1995Silicon Snake Oil)PBSNOVA(1990) "The KGB, the Computer, and Me"
Why do we care?
Comply with federal, private grants (audit risks)security, confidentiality, access, IRB,FERPAdemands by NSF for data plans, NIH to publish dataMake efficient use of scarce resourcessave time, money, peoplereduce errors: detection, correctionreduce duplication of effort in separate research projectsFacilitate reproducible researchredo calculations years latercompare old to new data/methodsdocument data study steps in detailShare data within and among project groupsVisualization: move quickly from data to insightKeep up with growing size:tera/peta/exa/zetta-bytes
10/21/2011
2
Big Data © Brian Yandell
Wikipedia definition of “big data”
datasets that grow so large that they become awkward to work withcapture, storage,search, sharing, analytics, visualizingon-hand database management tools areindequaterelational databasesdesktop statistics/visualization packagesrequires "massively parallel software running on” 10-1000 servers[current limits: terabytes,exabytesandzettabytesincreasingly gathered by ubiquitous information-sensing mobile devicesaerial sensory technologies , wireless sensor networks, software logscameras, microphones, RFID readersbenefits of working with larger and larger datasetsallows analysts to "spot business trends, prevent diseases, combat crime.“Subject areas (now)meteorology, genomics,connectomics, complex physics simulationsbiological and environmental research , Internet searchfinance, business informatics(social) media, publications, audio, video, interactive gaming
10/21/2011
3
Big Data © Brian Yandell
What types of data matter?(everything)
Metadata: data about datadata descriptionstudy plan, experimental designdiagnostics and analytics: plans, tools, scriptsMolecularDNA, RNA, protein sequences and 2/3/4-D structuretranscriptomic, proteomic,metabolomicinteractomes, pathways and networksSpatial/temporalimages at all scales, static and dynamicgeospatial alliance, biomedical imaging, networkspoint/line/object dataPopulation-basedsocioeconomic, cultural, political, healthtransportation, financial, linguisticMethodologycode, algorithms, pipelines, workflows, user interfacesvisualization: static, dynamic, interactivepublication instruments: papers, graphs, audio, video, interactivereproducible research tools
10/21/2011
4
Big Data © Brian Yandell
Enterprise Storage System(Confluenceuse at Biomedical Computing Group)
direct access to "snapshot's" of data12TB total with quick expansion to 24TB usableinstitutional cost modelWork-spacesbring some sanity and structurecustomize specifically for user needs
10/21/2011
5
Big Data © Brian Yandell
Enterprise Work-Spaces
User Work-Spacesuser specific, access limiteddisk quota: 25GB for “home directory” work-spacesProject Work-Spacesshared work-spaces for data sets/files shared among team members/co-workersdata retention, backups, archiving and access controls are strictly controlledExample: SDAC drug studyComputational Work-Spacesshared work-spaces for high throughput computational usageless strict data retention and no archiving needExample: many different users all accessing and updating many shared filesData Warehousing Work-Spaceslarge data sets generally written once and read many timeslocal repository for extremely large (multiTera-byte) genetics or statistical data setsno backup or retention requirementscan be re-fetched from another location
10/21/2011
6
Big Data © Brian Yandell
What do we need?inference methods for data structures
Computer Science has historically been strong on data structures and …Statistics has historically been … strong on inference from data.One way to draw on the strengths of both disciplines is to pursue the study of 'inferential methods for data structures';i.e., methods that update probability distributions on recursively-defined objects such as trees, graphs, grammars and function calls.”Michael Jordan, UC-Berkeley (2010 UW lecture)
10/21/2011
Big Data © Brian Yandell
7
The translation gap betweenstructure and inference
Many tools emerging for data structureGenomic, geospatial, ...GMOD.org, .NET bio, … platformsBasic production inference being addedT-tests with FDR, enrichmentGlue to bind resources (GenomeSpace)UCSC genome browser,Cytoscape, Galaxy, …But state-of-the-art collaboration tools lagTranslate one-off code to pipelineBuild, maintain, enhance new workflows
10/21/2011
Big Data © Brian Yandell
8
What works and what doesn’t at UW?(the people dimension)
What have been successful?Cancer Informatics Shared Resource (30+ years)Biostat& Med Info with Comp CancerCtrother BMI collaborative research across campus (30+ years)Biometry Program (30+ years)Stat with CALS and later L&S (Bot,Zoo),VetMed(off and on)Tech Partners (25 years)Geospatial Alliance (25 years)CIBM, GSTP, Biophysics training grants (10 years)CS, Math, BMI with multiple collaboratorsWhatis missing?link from Gene Expression Center to data analytics“free” quantitative consulting across campusexperimental design, data analysisinformatics, workflows/pipelines
10/21/2011
Big Data © Brian Yandell
9
Who should be involved at UW-Madison?
Chief Information Officers (CIOs) in all organizationsLibrariansAcademic and general library systemStatisticians and biostatisticiansDevelop methods for design and analysisComputer scientistsDesign and build computers, databases, analyticsOther data analytics fieldsDepartments: Stat, BMI, CS, ECE,ISyE, SLIS,BusInfoInformatics experts in generalSubject matter scientistsOmics, spatial, networks, languagesBoth faculty and staffBuild communication to foster ideas, collaboration
10/21/2011
10
Big Data © Brian Yandell
Who specifically at UW-Madison?
Stat/BMI: BrianYandell,ZhiguangQian, Mark CravenCS: MyronLivny(CHTC), MichaelGleicher, Michael Ferris (ISyE,WID)Libraries: DorotheaSalo(RDS,SLIS), LeeKonrad(GLS)DoIT: JanCheetham, Alan WolfCIOs: Phil Barak (Soils/CALS), UmbertoTachinardi(SMPH)Discipline scientistsSandra SplinterBonDurant(Gene ExpressionCtr)George Phillips (CIBM,Biochem&CS)Juan de Pablo (BiolChemEngr)Edgar Spalding (Botany)HowardVeregin(State Cartographer, Geospatial Alliance)CorinnaGries(LTER, Limnology)Tom Mish (BCG/SMPH)Ex-officio: Bruce Maas (UW CIO), Katrina Forest (ITC)
10/21/2011
11
Big Data © Brian Yandell
Data Science at UW-Madisonwho thinks about data for its own sake?
Academic ProgramsStatistics Department, L&SBiostatistics & Medical Informatics Department, SMPHComputer Science Department, L&SElectrical & Computer Engineering Department,CoEIndustrial & Systems Engineering Department,CoEOperations & Information Management Department, Business SchoolLibrary and Information Studies Department, SLISBiometry Program, CALSMathematics Department, L&SResearch GroupsCancer Informatics Shared Resource, SMPHComputing & Biometry, CALSGeospatial Alliance (formerly SIAC)Gene Expression Center, Genome CenterData & Information Service Center, Demography, Social Sciences (formerly DACC, DPLS)Administrative GroupsGeneral Library System (GLS)Biomedical Computing Group (BCG), SMPHDivision of Information Technology (DoIT)Research Data Services (RDS)Information Technology Committee (ITC)Wisconsin Institutes of Discovery (WID/MIR)
10/21/2011
Big Data © Brian Yandell
12

0

Embed

Share

Upload

Make amazing presentation for free
Big Data around UW-Madison