-
Introduction to the R programming environment and to genomic data (DNA-seq, RNA-seq, ChIP-seq, CLIP and what kind of data they yield). Introduction to the specific dataset that will be followed throughout the course.
-
Introduction to variables, data structures and their representation in R (Vectors, Lists, Hash tables, more complex data structures, a bit on complexity), saving/loading files
-
Visualization of single/two-dimensional data using basic package and ggplot2 (barplot, scatterplot, boxplot, customization). How to check the distribution of your data? qq-plots
-
Introduction to the UCSC genome browser and IGV - loading and visualizing your data vis-a-vis public data and annotations.
-
First steps in analysis of data - Loading, Filtering, subsetting, Looking at correlations. Statistical tests for differences (T-test, Wilcoxon, KS) and how they are done in R. Application to the RNA-seq data.
-
Control flow, conditionals, loops (for, apply, lapply, tapply, sapply,mapply), and functions
-
Text and regular expressions, grep, basic sequence analysis (seqLogo package)
-
Merging datasets - e.g. how to combine peaks from ChIP-seq with RNA-seq data in various ways.
-
Multi-dimensional data: normalization, clustering (hierarchical/biclustering), PCA, and visualization.
-
Bioconductor - introduction, some sample packages (SeqLogo? edgeR?). Models for significance of differential expression of RNA-seq data
-
Building interactive interfaces using Shiny
- Machine learning - what it means, what are classifiers. ROC curves. Feature selection. What to be aware of. How to train a simple SVM. Application on the RNA-seq data. Cross validation etc. Visualization of ROC curves.