Course Identification

Principles and practice of large scale data analysis using R

Lecturers and Teaching Assistants

Prof. Emmanuel Levy, Prof. Schraga Schwartz, Prof. Igor Ulitsky
Dr. Miguel Angel Garcia Campos, Dr. Yaara Finkel, Dr. Benjamin Dubreuil

Course Schedule and Location

First Semester
Monday, 14:15 - 16:00, FGS, Rm C

Thursday, 14:00 - 15:45, FGS, Rm C

Field of Study, Course Type and Credit Points

Chemical Sciences: Lecture; Elective; Regular; 3.00 points
Life Sciences: Lecture; Elective; Regular; 3.00 points
Life Sciences (Molecular and Cellular Neuroscience Track): Lecture; Elective; Regular; 3.00 points
Life Sciences (Brain Sciences: Systems, Computational and Cognitive Neuroscience Track): Lecture; Elective; Regular; 3.00 points
Life Sciences (Computational and Systems Biology Track): Lecture; Elective; Core; 3.00 points


* On Nov 4th, 2019 the lecture will begin at 14:30
* On Feb 3rd, 2020 the lecture will be held between 13:10-14:55.





Language of Instruction


Registration by


Attendance and participation

Expected and Recommended

Grade Type

Pass / Fail

Grade Breakdown (in %)


Evaluation Type

Final assignment

Scheduled date 1


Estimated Weekly Independent Workload (in hours)



  1. Introduction to the R programming environment and to genomic data (DNA-seq, RNA-seq, ChIP-seq, CLIP and what kind of data they yield). Introduction to the specific dataset that will be followed throughout the course.

  2. Introduction to variables, data structures and their representation in R (Vectors, Lists, Hash tables, more complex data structures, a bit on complexity), saving/loading files

  3. Visualization of single/two-dimensional data using basic package and ggplot2 (barplot, scatterplot, boxplot, customization). How to check the distribution of your data? qq-plots

  4. Introduction to the UCSC genome browser and IGV - loading and visualizing your data vis-a-vis public data and annotations.

  5. First steps in analysis of data - Loading, Filtering, subsetting, Looking at correlations. Statistical tests for differences (T-test, Wilcoxon, KS) and how they are done in R. Application to the RNA-seq data.

  6. Control flow, conditionals, loops (for, apply, lapply, tapply, sapply,mapply), and functions

  7. Text and regular expressions, grep, basic sequence analysis (seqLogo package)

  8. Merging datasets - e.g. how to combine peaks from ChIP-seq with RNA-seq data in various ways.

  9. Multi-dimensional data: normalization, clustering (hierarchical/biclustering), PCA, and visualization.

  10. Bioconductor - introduction, some sample packages (SeqLogo? edgeR?). Models for significance of differential expression of RNA-seq data

  11. Building interactive interfaces using Shiny

  12. Machine learning - what it means, what are classifiers. ROC curves. Feature selection. What to be aware of. How to train a simple SVM. Application on the RNA-seq data. Cross validation etc. Visualization of ROC curves.

Learning Outcomes

Upon successful completion of this course students will be able to:

  1. Use proficiently the R programming language and facilitating tools to analyze and extract biological insights from genomic datasets such as RNA-seq.

Reading List