Course Identification

Title:

Principles and practice of large scale data analysis using R

Code:

20202081

Lecturers and Teaching Assistants

Lecturers:

Prof. Emmanuel Levy, Prof. Schraga Schwartz, Prof. Igor Ulitsky

TA's:

Dr. Miguel Angel Garcia Campos, Dr. Yaara Finkel, Dr. Benjamin Dubreuil

Course Schedule and Location

Year:

2020

Semester:

First Semester

When / Where:

Monday, 14:15 - 16:00, WSoS, Rm C

Tutorials
Thursday, 14:00 - 15:45, WSoS, Rm C

First Lecture:

04/11/2019

Field of Study, Course Type and Credit Points

Chemical Sciences: Lecture; Elective; Regular; 3.00 points
Life Sciences: Lecture; Elective; Regular; 3.00 points
Life Sciences (Molecular and Cellular Neuroscience Track): Lecture; Elective; Regular; 3.00 points
Life Sciences (Brain Sciences: Systems, Computational and Cognitive Neuroscience Track): Lecture; Elective; Regular; 3.00 points
Life Sciences (Computational and Systems Biology Track): Lecture; Elective; Core; 3.00 points

Comments

* On Nov 4th, 2019 the lecture will begin at 14:30
* On Feb 3rd, 2020 the lecture will be held between 13:10-14:55.

Prerequisites

No

Restrictions

Participants:

32

Language of Instruction

English

Registration by

Registration By:27/10/2019

Attendance and participation

Expected and Recommended

Grade Type

Pass / Fail

Grade Breakdown (in %)

Assignments:

50%

Final:

50%

Evaluation Type

Final assignment

Scheduled date 1

Date / due date

N/A

Location

N/A

Time

-

Remarks

N/A

Estimated Weekly Independent Workload (in hours)

2

Syllabus

Introduction to the R programming environment and to genomic data (DNA-seq, RNA-seq, ChIP-seq, CLIP and what kind of data they yield). Introduction to the specific dataset that will be followed throughout the course.
Introduction to variables, data structures and their representation in R (Vectors, Lists, Hash tables, more complex data structures, a bit on complexity), saving/loading files
Visualization of single/two-dimensional data using basic package and ggplot2 (barplot, scatterplot, boxplot, customization). How to check the distribution of your data? qq-plots
Introduction to the UCSC genome browser and IGV - loading and visualizing your data vis-a-vis public data and annotations.
First steps in analysis of data - Loading, Filtering, subsetting, Looking at correlations. Statistical tests for differences (T-test, Wilcoxon, KS) and how they are done in R. Application to the RNA-seq data.
Control flow, conditionals, loops (for, apply, lapply, tapply, sapply,mapply), and functions
Text and regular expressions, grep, basic sequence analysis (seqLogo package)
Merging datasets - e.g. how to combine peaks from ChIP-seq with RNA-seq data in various ways.
Multi-dimensional data: normalization, clustering (hierarchical/biclustering), PCA, and visualization.
Bioconductor - introduction, some sample packages (SeqLogo? edgeR?). Models for significance of differential expression of RNA-seq data
Building interactive interfaces using Shiny
Machine learning - what it means, what are classifiers. ROC curves. Feature selection. What to be aware of. How to train a simple SVM. Application on the RNA-seq data. Cross validation etc. Visualization of ROC curves.

Learning Outcomes

Upon successful completion of this course students will be able to:

Use proficiently the R programming language and facilitating tools to analyze and extract biological insights from genomic datasets such as RNA-seq.

Reading List

N/A

Website

N/A