Course Identification

Principles and practice of large scale data analysis using R

Lecturers and Teaching Assistants

Prof. Emmanuel Levy, Prof. Schraga Schwartz, Prof. Igor Ulitsky
Dr. Benjamin Dubreuil, Dr. Miguel Angel Garcia Campos

Course Schedule and Location

First Semester
Sunday, 14:15 - 16:00, FGS, Rm C

Field of Study, Course Type and Credit Points

Chemical Sciences: Lecture; Elective; 2.00 points
Life Sciences: 2.00 points
Life Sciences (Systems Biology Track): 2.00 points
Life Sciences (Molecular and Cellular Neuroscience Track): 2.00 points
Life Sciences (Brain Sciences: Systems, Computational and Cognitive Neuroscience Track): 2.00 points







Language of Instruction


Attendance and participation

Expected and Recommended

Grade Type

Numerical (out of 100)

Grade Breakdown (in %)


Evaluation Type

Final assignment

Scheduled date 1


Estimated Weekly Independent Workload (in hours)



  1. Introduction to the R programming environment and to genomic data (DNA-seq, RNA-seq, ChIP-seq, CLIP and what kind of data they yield). Introduction to the specific dataset that will be followed throughout the course.

  2. Introduction to variables, data structures and their representation in R (Vectors, Lists, Hash tables, more complex data structures, a bit on complexity), saving/loading files

  3. Visualization of single/two-dimensional data using basic package and ggplot2 (barplot, scatterplot, boxplot, customization). How to check the distribution of your data? qq-plots

  4. Introduction to the UCSC genome browser and IGV - loading and visualizing your data vis-a-vis public data and annotations.

  5. First steps in analysis of data - Loading, Filtering, subsetting, Looking at correlations. Statistical tests for differences (T-test, Wilcoxon, KS) and how they are done in R. Application to the RNA-seq data.

  6. Control flow, conditionals, loops (for, apply, lapply, tapply, sapply,mapply), and functions

  7. Text and regular expressions, grep, basic sequence analysis (seqLogo package)

  8. Merging datasets - e.g. how to combine peaks from ChIP-seq with RNA-seq data in various ways.

  9. Multi-dimensional data: normalization, clustering (hierarchical/biclustering), PCA, and visualization.

  10. Bioconductor - introduction, some sample packages (SeqLogo? edgeR?). Models for significance of differential expression of RNA-seq data

  11. Building interactive interfaces using Shiny

  12. Machine learning - what it means, what are classifiers. ROC curves. Feature selection. What to be aware of. How to train a simple SVM. Application on the RNA-seq data. Cross validation etc. Visualization of ROC curves.

Learning Outcomes

Upon successful completion of this course students will be able to:

  1. Use proficiently the R programming language and facilitating tools to analyze and extract biological insights from genomic datasets such as RNA-seq.

Reading List