Course Identification

Principles and practice of large scale data analysis using R
20193051

Lecturers and Teaching Assistants

Prof. Schraga Schwartz, Prof. Igor Ulitsky, Prof. Emmanuel Levy
Dr. Benjamin Dubreuil, Dr. Miguel Angel Garcia Campos

Course Schedule and Location

2019
First Semester
Sunday, 14:15 - 16:00, FGS, Rm C

Tutorials
Tuesday, 09:15 - 11:00, FGS, Rm C
04/11/2018

Field of Study, Course Type and Credit Points

Life Sciences: Lecture; Elective; 3.00 points
Chemical Sciences: Lecture; Elective; 3.00 points
Life Sciences (Molecular and Cellular Neuroscience Track): Lecture; Elective; 3.00 points
Life Sciences (Brain Sciences: Systems, Computational and Cognitive Neuroscience Track): Lecture; Elective; 3.00 points
Life Sciences (Computational and Systems Biology Track): Lecture; Elective; Core; 3.00 points
Mathematics and Computer Science (Systems Biology / Bioinformatics): Lecture; Elective; 3.00 points

Comments

N/A

Prerequisites

No

Restrictions

60

Language of Instruction

English

Attendance and participation

Expected and Recommended

Grade Type

Numerical (out of 100)

Grade Breakdown (in %)

50%
50%

Evaluation Type

Final assignment

Scheduled date 1

N/A
N/A
-
N/A

Estimated Weekly Independent Workload (in hours)

4

Syllabus

  1. Introduction to the R programming environment and to genomic data (DNA-seq, RNA-seq, ChIP-seq, CLIP and what kind of data they yield). 

  2. Introduction to variables, data structures and their representation in R (Vectors, Lists, Hash tables, more complex data structures, a bit on complexity), saving/loading files

  3. Visualization of single/two-dimensional data using basic package and ggplot2 (barplot, scatterplot, boxplot, customization). How to check the distribution of your data? qq-plots

  4. First steps in analysis of data - Loading, Filtering, subsetting, Looking at correlations. Statistical tests for differences (T-test, Wilcoxon, KS) and how they are done in R. Application to the RNA-seq data.

  5. Control flow, conditionals, loops (for, apply, lapply, tapply, sapply,mapply), and functions

  6. Text and regular expressions, grep, basic sequence analysis (seqLogo package)

  7. Merging datasets - e.g. how to combine peaks from ChIP-seq with RNA-seq data in various ways.

  8. Multi-dimensional data: normalization, clustering (hierarchical/biclustering), PCA, and visualization.

  9. Bioconductor - introduction, some sample packages (SeqLogo? edgeR?). Models for significance of differential expression of RNA-seq data

  10. Building interactive interfaces using Shiny

  11. Machine learning - what it means, what are classifiers. ROC curves. Feature selection. What to be aware of. How to train a simple SVM. Application on the RNA-seq data. Cross validation etc. Visualization of ROC curves.

Learning Outcomes

Upon successful completion of this course students will be able to:

Use the R programming language to analyze and extract insights from a multidimensional dataset, such as an RNA-seq or ChIP-seq experiment. 

Reading List

N/A

Website

N/A