Course Identification

Title:

Principles and practice of large scale data analysis using R

Code:

20182101

Lecturers and Teaching Assistants

Lecturers:

Prof. Emmanuel Levy, Prof. Schraga Schwartz, Prof. Igor Ulitsky

TA's:

Dr. Benjamin Dubreuil, Dr. Miguel Angel Garcia Campos

Course Schedule and Location

Year:

2018

Semester:

First Semester

When / Where:

Sunday, 14:15 - 16:00, WSoS, Rm C

First Lecture:

29/10/2017

Field of Study, Course Type and Credit Points

Chemical Sciences: Lecture; Elective; 2.00 points
Life Sciences: 2.00 points
Life Sciences (Systems Biology Track): 2.00 points
Life Sciences (Molecular and Cellular Neuroscience Track): 2.00 points
Life Sciences (Brain Sciences: Systems, Computational and Cognitive Neuroscience Track): 2.00 points

Comments

N/A

Prerequisites

No

Restrictions

Participants:

32

Language of Instruction

English

Attendance and participation

Expected and Recommended

Grade Type

Numerical (out of 100)

Grade Breakdown (in %)

Assignments:

50%

Final:

50%

Evaluation Type

Final assignment

Scheduled date 1

Date / due date

N/A

Location

N/A

Time

-

Remarks

N/A

Estimated Weekly Independent Workload (in hours)

2

Syllabus

Introduction to the R programming environment and to genomic data (DNA-seq, RNA-seq, ChIP-seq, CLIP and what kind of data they yield). Introduction to the specific dataset that will be followed throughout the course.
Introduction to variables, data structures and their representation in R (Vectors, Lists, Hash tables, more complex data structures, a bit on complexity), saving/loading files
Visualization of single/two-dimensional data using basic package and ggplot2 (barplot, scatterplot, boxplot, customization). How to check the distribution of your data? qq-plots
Introduction to the UCSC genome browser and IGV - loading and visualizing your data vis-a-vis public data and annotations.
First steps in analysis of data - Loading, Filtering, subsetting, Looking at correlations. Statistical tests for differences (T-test, Wilcoxon, KS) and how they are done in R. Application to the RNA-seq data.
Control flow, conditionals, loops (for, apply, lapply, tapply, sapply,mapply), and functions
Text and regular expressions, grep, basic sequence analysis (seqLogo package)
Merging datasets - e.g. how to combine peaks from ChIP-seq with RNA-seq data in various ways.
Multi-dimensional data: normalization, clustering (hierarchical/biclustering), PCA, and visualization.
Bioconductor - introduction, some sample packages (SeqLogo? edgeR?). Models for significance of differential expression of RNA-seq data
Building interactive interfaces using Shiny
Machine learning - what it means, what are classifiers. ROC curves. Feature selection. What to be aware of. How to train a simple SVM. Application on the RNA-seq data. Cross validation etc. Visualization of ROC curves.

Learning Outcomes

Upon successful completion of this course students will be able to:

Use proficiently the R programming language and facilitating tools to analyze and extract biological insights from genomic datasets such as RNA-seq.

Reading List

N/A

Website

N/A