Warning

이광춘 (2019-12-28), “Computational Documents - 개요서(Compendium) 시작하며” 내용을 토대로 작성된 것임을 밝혀둡니다.

1 시작점

Robert Gentleman과 Duncan Temple Lang이 2004년 발표한 논문에서 공식적인 시작점을 찾을 수 있다.¹ ²

¹ “Statistical Analyses and Reproducible Research” Bioconductor Project Working Papers

² “Reproducible Research: A Bioinformatics Case Study” in Statistical Applications in Genetics and Molecular Biology

Tips

We introduce the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data,…), and as a means for distributing, managing and updating the collection. - Gentleman, R. and Temple Lang, D. (2004) R packages can serve as research compendia (including code, data and outputs) for reproducible data analysis projects

R 팩키지는 재현가능한 데이터 분석 프로젝트를 위한 연구 개요서(research compendia)로 훌륭한 대안으로 역할을 수행할 수 있다. 연구 개요서는 코드, 데이터, 출력결과물 등이 포함된다.

2 논문과 `ropensci` 진행경과

Ben Marwick, Carl Boettiger, Lincoln Mullen 이 2018년 인터넷에 온라인으로 발표한 “Packaging Data Analytical Work Reproducibly Using R (and Friends)” 논문에 재현가능 논문 작성에 대한 단계별 추진 사항이 잘 정리되어 있다. (Marwick et al., 2018) 또한, 재현 가능 연구를 R 생태계에서 추진하는 방법이 커뮤니티 모입도 있었다. ³

³ ropensci, “Community Call: Reproducible Research with R”

3 R 팩키지와 개요서

R 프로젝트, R 팩키지, 개요서(compendium)가 서로 동일한 목표를 가지고 있지만, 다소 차이가 있는 것도 사실이다. 데이터 사이언스를 하면서 코드, 데이터, 결과물을 하나의 개요서(compendium) 아래 묶어 이를 통해 재현가능한 과학기술 발전을 도모하는 것이 무엇보다 필요하다. ⁴ ⁵

⁴ jennybc, “Use of an R package to facilitate reproducible research”

⁵ Francisco Rodriguez-Sanchez “Structuring data anlaysis projects as R packages”

좋은 데이터 사이언스 프로젝트 구성
- 모든 파일은 동일한 디렉토리 아래 정돈되어 있음
- 원본 데이터(raw data)는 별도 폴더에 잘 저장되어 있어야 함.
- 정제된 데이터는 R 스크립트를 통해서 만들어져야 함.
- 함수는 분석 스크립트와 독립되어 저장되어야 함.
- 함수는 문서화가 잘 되어야 하면 (단위) 테스트도 되어 ㅎ함.
- 산출물은 코드와 격리되어야 하며 일회용으로 한번 사용하고 버림.
- Makefile은 적절한 순서로 분석을 실행해야 함.
- README 파일은 프로젝트 개요를 담고 있어야 함.
- git을 사용해서 R 코드와 Rmd 문서 파일은 버전 제어를 해야 함.

Code

project/
|-+ data-raw/   # 원본 데이터
|-+ data/       # (R 스크립트로부터 생성된) 정제된 데이터
|-+ R/          # 함수(Functions)
|-+ man/        # (Roxygen으로 생성된) 함수 문서(Function documentation)
|-+ tests/      # 테스트 (functions, Rmd)
|-+ vignettes/  # (Rmd로 작성된) 분석결과, 원고, 보고서 등
|
|- Makefile    # 자동화 시키는 마스터 스크립트
|- DESCRIPTION # 메타데이터와 의존성
|- README      # 프로젝트 개요서

상기와 같이 개요서를 R 팩키지를 통해 구현하게 되면 어떤 점이 좋은지 살펴보자.

재현가능성: Reproducibility
일관되고, 표준적이며, 물흐르는 듯한 프로젝트 구조: Consistent, standard, streamlined organisation
모듈화, 문서화, 테스트 주도 철학을 증진: Promotes modular, well-documented and tested code
공유하기 쉬움: Easy to share (zip, GitHub repo)
설치와 실행이 쉬움: Easy to install & run (Dependencies)
R 팩키지 제작 기계: Use R package development machinery:
R CMD CHECK
지속 개발/지속 배포: Continuous integration (Travis-CI)
goodpractice로 자동화된 코드리뷰: Automatic code review with goodpractice
pkgdown으로 프로젝트 웹사이트 제작: Easily create project websites with pkgdown

3.1 시작이 반이다

시작이 반이다 - start small

Code

project
|- DESCRIPTION
|- README.md  
|- Metadata.txt
|
|- data/                
|   +- 2014ParasiteSurveyJustBrood.csv
|   +- CedarBPLifeTable2014.csv
|   +- NorthBPLifeTable2013.txt
|   +- NorthBPLifeTable2014.csv|
|- analysis/
|   +- CodeforBPpaper.R

3.2 팩키지 개발

팩키지 개발

Code

project
|- DESCRIPTION
|- NAMESPACE
|- README.md
|- LakeTrophicModelling.Rproj
|
|- R/
|   +- LakeTrophicModelling-package.r
|   +- class_prob_rf.R
|   +- condAccuracy.R
|   +- crossval_rf.R
|   +- density_plot.R
|   +- ecdf_ks_ci.R
|   +- ecor_map.R
|   +- getCyanoAbund.R
|   +- getLakeIDs.R
|   +- importancePlot.R
|
|- man/
|   +- class_prob_rf.Rd
|   +- condAccuracy.Rd
|   +- crossval_rf.Rd
|   +- density_plot.Rd
|   +- ecdf_ks_ci.Rd
|   +- ecor_map.Rd
|   +- getCyanoAbund.Rd
|   +- getLakeIDs.Rd
|   +- importancePlot.Rd
|
|- data/                
|   +- LakeTrophicModelling.rda
|
|- vignettes/
|
|- inst/
|   +- doc/
|      +- manuscript.pdf
|   +- extdata/
|      +- ltmData.csv
|      +- data_def.csv

3.3 CI/CD와 도커

Dockerfile 파일을 추가하여 환경도 재현가능하게 만들 수 있고, .travis.yml을 추가하여 CI/CD 환경도 구축할 수 있다. tests/를 추가하여 테스트 주도 개발(test-driven development, TDD)를 시도할 수 있고, 이를 통해 수작업 검증을 자동화하여 생산성과 품질을 대폭 향상시킬 수도 있다.

Code

project
|- DESCRIPTION          # project metadata and dependencies 
|- README.md            # top-level description of content and guide to users
|- NAMESPACE            # exports R functions in the package for repeated use
|- LICENSE              # specify the conditions of use and reuse of the code, data & text
|- .travis.yml          # continuous integration service hook for auto-testing at each commit
|- Dockerfile           # makes a custom isolated computational environment for the project
|
|- data/                # raw data, not changed once created
|  +- my_data.csv       # data files in open formats such as TXT, CSV, TSV, etc.
|
|- analysis/            # any programmatic code
|  +- my_report.Rmd     # R markdown file with narrative text interwoven with code chunks 
|  +- makefile          # builds a PDF/HTML/DOCX file from the Rmd, code, and data files
|  +- scripts/          # code files (R, shell, etc.) used for data cleaning, analysis and visualisation 
|
|- R/                     
|  +- my_functions.R    # custom R functions that are used more than once throughout the project
|
|- man/
|  +- my_functions.Rd   # documentation for the R functions (auto-generated when using devtools)
|
|- tests/
|  +- testthat.R        # unit tests of R functions to ensure they perform as expected

4 각 사례별 템플릿

5 `rrtoools` 워크샵

“Reproducible Research in R with rrtools”, 31st October, Northwest Universities R Day
- 공통: devtools
- 윈도우즈: Rtools
- 맥: Xcode
- 리눅스: r-devel 혹은 r-base-dev

References

Marwick, B., Boettiger, C., & Mullen, L. (2018). Packaging data analytical work reproducibly using r (and friends). The American Statistician, 72(1), 80–88. https://doi.org/10.1080/00031305.2017.1375986

1 시작점

2 논문과 ropensci 진행경과

3 R 팩키지와 개요서

3.1 시작이 반이다

3.2 팩키지 개발

3.3 CI/CD와 도커

4 각 사례별 템플릿

5 rrtoools 워크샵

References

2 논문과 `ropensci` 진행경과

5 `rrtoools` 워크샵