Computing for Research ISpring2014Lecture 1: January 8
Primary Instructor:Elizabeth Garrett-Mayer
Description: Students learn to use the primary statistical software packages for data manipulation and analysis, including (but not limited to): R, RBioconductor,SAS,and Stata. Additionally, students will learn: how to use the division's high speed cluster-computing environment, how to practice the principles of reproducible research usingSweavein R,howto useLaTeXandBibTeXfor manuscript and presentationdevelopment, and how to create and maintain a website.This is a three credit course.Course Organization:This course is given byfaculty and students in the Department of Public Health Science.Instructors will take turns giving lectures in their areas of expertise.Dr. Garrett-Mayer is the primary instructor and the course director.Textbooks: No textbook. Reading material (primarily found on the web) will be provided as necessary.Prerequisites: Biometry 700
Grading: Instructors will give short exercises to be completed and turned into the primary instructor by the Wednesday of the week following when it was assigned (e.g., assignments given onMon Feb3andWedFeb5areboth due onWedFeb12).Each assignment will count equally towards75%of the course grade. There will be afinal projectwhich will account for the remaining20% of the course grade. The remaining 5% of the course grade will reflect class participation.HomeworksPolicy:Homeworksare due by 5pm on the due date. Allhomeworksshould be emailed to the primary instructor (firstname.lastname@example.org) or turned in at lecture time. Asking for extensions onhomeworksisdiscouraged. However, it is expected that, on occasion, extenuating circumstances may arise. Therefore, the policy is that each student may request an extension on homework twice and the extension is to be no more than 2 days. After using two extensions, no more extensions will be granted except with a medical note.
Attention to material:Laptops are permitted in class, but it is expected that if they are used, it is to follow along with the lecture. Email and web browsers should not be visited during class time.Checking phones during lecture is not acceptable.The instructors are giving their time and expertise. Be respectful and give them your attention.Classroom disruptions:Many of us have small children and others who we need to be able to be in contact with during lectures. It is acceptable to bring pagers or cell phones to class. Please be sure they are on silent mode. If you need to leave during lecture to take a phone call, or make a phone call, please do so. However, this should be a relatively rare occurrence. Texting and emailing during lecture time is not acceptable.Violations of classroom etiquette policies willresult in a 0 for class participation.
Office Hours:The primary instructor willbeavailable byappointment.Liqiong Fan willalso have office hours. However, given the nature of the course, the primary instructor(or TA) maynot be knowledgeable regarding all of the topics covered. As a result, additional help may be needed to complete assignments from the lecturers.Be considerate and responsible in scheduling time with course instructors and recognize that they all have busy schedules.
Uponsuccessful completion of the course, the student will be able toImport, perform simple analyses and produce graphical displays inStata, SAS and RCreate new functions or commands in each of R,Stataand SASGenerate professional quality scientific manuscripts and presentations using Latex along with statistical softwarePerform standard power and sample size calculations using available software and simulations.Operate the division’s cluster computer with batch computing
SASdatamangementSTATAwebsite designRsamplesize/power calculationsBatch processingLatex +Sweave
We are meeting in a regular classroomBringing laptops is allowedData, code, etc. needed for class will be on the website prior to classFor optimal interface, install packages ASAPR (http://cran.r-project.org/)Stata(DPHS helpdeskrequest)SAS(DPHShelpdesk request)Create a bookmark to the course website:http://people.musc.edu/~elg26/teaching/statcomputing.2014/statcomputingI.2014.htm
Every lecturer will have his/her own styleNotes may beprepared ahead of time and postedPrepared and posted after the lectureNonexistentLecture notes will NOT be printed by the instructors prior to lecture.If they are available and you would like a paper copy, it is your responsibility to print them out.
2014:to be a successful biostatistician/epidemiologist, you MUST be competent on the computer.Historically: students learned in labs from (older) studentsMoving forward:many options for analysis and generation of resultsEfficiency in computing is essential.Your computer IS your lab!
Data analysis software
In this course:RStataSASMany other options:
SAS was conceived by Anthony J. Barr in 1966.As a North Carolina State University graduate student from 1962 to 1964, Barr had created an analysis of variance modeling language. From 1966 to 1968, Barr developed the fundamental structure and language of SAS.In January 1968, Barr and James Goodnight collaborated, integrating new multiple regression and analysis of variance routines developed by Goodnight into Barr's framework.By 1971, SAS was gaining popularity within the academic community. One strength of the system was analyzing experiments with missing data, which was useful to the pharmaceutical and agricultural industries, among others.In 1976, SAS Institute, Inc. was incorporated.The latest version, SAS version9.4,was released in July2013
SAS consists of a number of components, which organizations separately license and install as required.Licenses expire! Software cannot be used after expiration (unless renewed)
Why (or why not) SAS?
Most commonly used inpharma(although that may be changing!)“FDAlikesSAS”: truth or myth?Many jobs for MS statisticians and/or epidemiologists require SAS expertiseThe most common languageBecoming less the choice of academiaUpdates are less frequent than freeware‘pros’ of competitors are starting to outweigh the ‘pros of SASLicensing costsSlow to add new functionalityLack of consistency with syntaxLearning curve is slower than other programs that now have similar capability
Statais a general-purpose statistical software package created in 1985 byStataCorp.Most of its users work in research, especially in the fields of economics, sociology, political science, biomedicine and epidemiology.Relatively simple to learn yet powerfulLatest version is Stata13 (released June 2013).Lots of add-ons forepiusers
Why (or why not)Stata?
Relatively inexpensive (especially as student or single-user)Biomedical focus: output and functions are tailored to medical researchFast and big: can handle and manipulate large datasetsSophisticated with wide range of toolsEasy to learn language with consistent syntaxGraphics are not as good as other packages (although that has improved)Programming (simulations, loops, etc.) is more challenging
Ris a programming language and software environment for statistical computing and graphics.The R language has become ade factostandard among statisticians for the development of statistical software,and is widely used for statistical software development and data analysis.R is an implementation of the S programming language. S was created by John Chambers while at Bell Labs. R was created by RossIhakaand Robert Gentleman, and is now developed by theR Development Core Team. R is named partly after the first names of the first two R authors, and partly as a play on the name of S.R source code is freely available under the GNU General Public License.The capabilities of R are extended through user-submittedpackages, whichallow specialized statistical techniques, graphical devices, as well asimport/export capabilitiesto many external data formats.A core set of packages are included with the installation of R, with more than5000(asofJanuary 2013)available at the Comprehensive R Archive Network (CRAN).The most recent version isR 3.0.2releasedSeptember 2013.
Freeware: latest version can be installed anywhere at anytimePackages (a.k.a.libraries) that are user-contributed allow additional features/commandsRelatively simpleinterfaceRstudioprovides a nicer interface and is gaining in popularity.
Why (or why not) R?
Great for programming and simulationsHandles looping wellFlexible languageFREE!User-contributed packages included in real-time (i.e., no delay in their availability)Most PhD Biostatistics programs teach their students R and many/most academic statisticians in top programs use R.Interfaces nicely with other programs such as Latex (Sweave),WinBugs, C,Emacs.Can be clunky for data management.Memory is not as good as SAS andStataQuality-control on user-contributed packages not evident
Not a question of which one.Question is “for my current problem, which package makes the most sense to use?”Each has strengths and weaknesses
Analysis of clean data is easy!The real world:you will get messy datamostof the time from your colleaguesData management tools will help you;Deal with messy dataSet up data capture approaches for your colleagues to minimize messinessExcel,RedCapand general principles of data management for statistical analysis will be covered
LaTeXis a document markup language and document preparation system for theTeXtypesetting program.The termLaTeXrefers only to the language in which documents are written, not to the editor used to write those documents. In order to create a document inLaTeX, a.texfile must be created using some form of text editor. (e.g.WinEdt)LaTeXis most widely used by mathematicians, scientists, engineers, philosophers, lawyers, linguists, economists, researchers, and other scholars in academia.LaTeXis used because of the high quality of typesetting achievable byTeX.The typesetting system offers extensive facilities for automating most aspects of typesetting and desktop publishing, including numbering and cross-referencing, tables and figures, page layout and bibliographies.
Sweaveis a function in R that enables integration of R code intoLaTeXdocuments. The purpose is "to create dynamic reports, which can be updated automatically if data or analysis change".The data analysis is performed at the moment of writing the report, or more exactly, at the moment of compiling theSweavecode withSweave(i.e., essentially with R) and subsequently withLaTeX. This can facilitate the creation of up-to-date reports for the author.Because theSweavefiles together with any external R files that might be sourced from them andthe data files contain all the information necessary to trace back all steps of the data analyses,Sweavealso has the potential to make research more transparent and reproducible to others. However, this is only the case to the extent that the author makes the data and the R andSweavecode available.New this year: reproducible research lecture (by Dr. Hill) will be modified.Sweavemay or may not be covered.
Sample size and power
We don’treallyuse textbook formulas anymore to do simple power calculations(just like we don’treallyinvert matrices by hand when we analyze data).There are a number of packages that quickly and easily perform simple power calculationsR, SAS andStatacan do some.But, packages likeNquery, EAST and PASS do a lot more.In some non-standard settings, simulations are required to determine power.
It is important in this day and age to“market”yourself.allows you to show your best attributesmakes you multidimensional (e.g., hobbies, background, etc.)It will be important for gaining recognition and opportunities in your field and for making your own work available.Itisn’thard, but you do need to learn some skills to set up and maintain your own site.
Before getting started…
Types of files involved in statistical computingData filesResults filesCommand/batch filesFunction filesGraphics files+ more(?)TIPS:develop a common nomenclature for naming files and foldersOrganize projects within folders
Organization is key!
DO NOT overwrite old files (especially data files)Save with a new nameMousedata.xls (file sent from colleague)Mousedata.clean.xls (your clean version of the data)Use a consistent approach, but think aheadNaming files *.new.* is not a good idea. You may have a new ‘new’ next weekNumericsare good, but if you think you may need more than 9 versions, consider how data2 and data10 would be alphabetized.
For each Principal Investigator I work with, I have a folderWithin the PI folder, for each project, I have a folderFor each time I get a new dataset (or work on a new grant) for that project, I have a folder named with month and yearExample:I:\\MUSC Oncology\\Kraft, Andrew\\VelcadeTrial\\May2008I:\\MUSC Oncology\\Kraft, Andrew\\R01 June 2007
Within each folder of data analysis or grant development calculations, I use the same naming conventions for files:Rbatch.R: a set of R commands that implement all of the computation or analysesRfunctions.R: a set of R functions that are used by the batch fileI always save the original data file from the investigator before making any changesI add ‘clean’ to thedatafilename and save it as a .csvbefore use (e.g.mousedata.clean.csv)MyRbatch.Rfiles always include a line sourcing in the data, including the folder where the data resides.
Friends in Statistical Computing
Google is your friend‘Help’ functions and ‘see also’ links are your friends‘examples’ are your friendsYour fellow students are your friendsFriends help friends figure out statistical computing!
Example 1:SPSS is not included in this curriculum.Can you ever use it? YES!Will you be able to learn it better and faster after having taken this course? YES!Example 2:We will probably not cover the R packagennc(NearesetNeighborAutocovariates)Does that mean you need to find someone to teach it to you? NO!Will you be able to teach it to yourself? YES!Example 3:None of your instructors are computer scientists.Does this mean that they are not qualified to teach you? NO!Most of them are self-taught with regards to these techniques
Final Thoughts for Today
PhD PROGRAMS: THE TRAINING OF INDEPENDENT RESEARCHERS!THISCOURSE WILL POINT YOU IN THE RIGHT DIRECTION AND PROVIDE A SET OFTOOLSIT IS YOUR JOB TO MAKE THEM FIT TOGETHER AND USE THEM AS A LAUNCHING PAD TO SOLVE PROBLEMSNext up: Intro to SAS onMonday!
Some background info on R, SAS,Stata, Latex andSweavewas all pilfered from Wikipedia.