SpatialTranscriptomicsResearch · jfnavarro · Oct 11, 2017 · Oct 11, 2017 · Oct 11, 2017 · Oct 11, 2017
diff --git a/CHANGELOG b/CHANGELOG
@@ -9,12 +9,12 @@ Version 0.3.0
 * Bug fixes
 * Optimized the normalization
 * Unsupervised allows to center size factors by mean
-* Unsupervised allows to computed adjuted log normalized counts
+* Unsupervised allows to computed adjusted log normalized counts
 * Unsupervised allows to compute the number of clusters automatically
 * Supervised allows to normalize the data
 * Supervised allows to input train/classes with different spots
   than in the train/test data and in different order
-* st_data_plotter allows to highlith selected spots
+* st_data_plotter allows to highlight selected spots
 * st_data_plotter only plots the spots where the gene is present
   when a gene reg-exp is given
 * st_data_plotter allows to normalize the data
@@ -35,4 +35,10 @@ Version 0.4.1
 * Fixed a bug in the noise filtering function
 
 Version 0.4.2
-* Added compatibility with Python 3
+* Added compatibility with Python 3
+
+Version 0.4.5
+* Added merge_replicates.py script
+* Added slice_regions_matrix.py script
+* Optimized and improved differential_analysis.py
+* Added compatibility with R 3.4 and rpy2 latest versions
diff --git a/LICENSE b/LICENSE
@@ -1,5 +1,5 @@
 The MIT License (MIT)
-Copyright (c) 2016 Jose Fernandez Navarro. 
+Copyright (c) 2017 Jose Fernandez Navarro, KTH. 
 
 Permission is hereby granted, free of charge, to any person obtaining 
 a copy of this software and associated documentation files (the "Software"), 

diff --git a/README.md b/README.md
@@ -1,10 +1,14 @@
 # Spatial Transcriptomics Analysis 
 
-Different tools for visualization, data processing and analysis (supervised and un-supervised learning, differential expression analysis, etc..) of Spatial Transcriptomics data (can also be used for single cell data).
+Different tools for visualization, data processing and analysis (supervised and un-supervised learning,
+differential expression analysis, etc..) of Spatial Transcriptomics datasets (can also be used for single cell data).
 
-The package is compatible with the output format of the data generated with the ST Pipeline (https://github.com/SpatialTranscriptomicsResearch/st_pipeline) and give full support to plot the data onto the tissue images but it is compatible with any single cell datasets where the data is stored as a matrix of counts (genes as columns and spot/cells as rows). 
+The package is compatible with the output format of the data generated with the
+ST Pipeline (https://github.com/SpatialTranscriptomicsResearch/st_pipeline) and give full
+support to plot the data onto the tissue images but it is compatible with any single cell datasets
+where the data is stored as a matrix of counts (genes as columns and spot/cells as rows).
 
-This package makes use of the following tools:
+This package makes use of the following R packages:
 
 t-SNE
 https://github.com/lvdmaaten/bhtsne
@@ -27,9 +31,10 @@ See AUTHORS file.
 ### Contact
 For bugs, feedback or help you can contact Jose Fernandez Navarro <jose.fernandez.navarro@scilifelab.se>
 
-### Note
+### Input Format
 The referred matrix format is the ST data format, a matrix of counts where spot coordinates are row names
-and the genes are column names.
+and the genes are column names. This matrix format (.TSV) is generated with the
+[ST Pipeline](https://github.com/SpatialTranscriptomicsResearch/st_pipeline)
 
 The scripts that allow you to pass the tissue HE image can optionally take a 3x3 alignment file.
 If the images are cropped to the exact array boundaries the alignment file is not needed
@@ -44,12 +49,51 @@ Where each a correspondonds to a cell of the affine transformation matrix.
 
 ### Installation
 
-Note that the ST Analysis package requires R (https://cran.r-project.org/) installed in your system.
-To install the ST Analsysis packate just clone or download the repository, cd into the cloned folder and type:
+We recommend that you install the latest version 3.4.x. Once you have installed R you can open
+a R terminal or Rstudio and type the following:
 
-    python setup.py install
+    source("https://bioconductor.org/biocLite.R")
+    biocLite("monocle")
+    biocLite("scran")
+    biocLite("DESeq2")
+    biocLite("Rtsne")
+    biocLite("edgeR")
 
-A bunch of scripts will then be available in your system.
+Before you install the ST Analysis package we recommend that you create a Python 3 virtual
+environment. We recommend [Anaconda](https://anaconda.org/anaconda/python).
+The latest versions of rpy2 (R binder for Python) are only compatible with Python 3.
+
+#### OSX
+The following instructions are for installing the ST Analysis package with Python 3.4 and Anaconda
+(should be the same for Python 3.6)
+Note: we advice to update Xcode to the latest version.
+
+    conda create -n python3.4 python=3.4
+    source activate python3.4
+    brew install freetype
+    brew install gcc
+    export CC=/usr/local/Cellar/gcc/7.2.0/bin/gcc-7
+    pip install rpy2
+    export CC=/usr/bin/clang
+    conda install matplotlib
+    conda install pandas
+    conda install scikit-learn
+    python setup.py install
+
+#### Linux
+The following instructions are for installing the ST Analysis package with Python 3.4 and Anaconda
+(should be the same for Python 3.6)
+Note: we advice to install and update the developer tools packages
+
+    conda create -n python3.4 python=3.4
+    source activate python3.4
+    pip install rpy2
+    conda install matplotlib
+    conda install pandas
+    conda install scikit-learn
+    python setup.py install
+
+A bunch of scripts (described behind) will then be available in your system.
 Note that you can always type script_name.py --help to get more information
 about how the script works. 
 The ST Analysis package is compatible with Python 2 and 3 and we recomend to use
@@ -60,50 +104,79 @@ a virtual environment to make the installation of the dependencies easier.
 ## Analysis tools
 
 ### To do un-supervised learning
-To see how spots cluster together based on their expression profiles you can run : 
+To see how spots cluster together based on their expression profiles you can run:
 
     unsupervised.py --counts-table-files matrix_counts.tsv --normalization DESeq2 --num-clusters 5 --clustering KMeans --dimensionality tSNE --image-files tissue_image.JPG --use-log-scale 
 
-  The script can be given one or serveral datasets (matrices with counts). It will perform dimesionality reduction
-  and then cluster the spots together based on the dimesionality reduced coordinates. 
-  It generates a scatter plot of the clusters. It also generates an image for
-  each dataset of the predicted classes on top of the tissue image (tissue image for each dataset must be given and optionally 
-  an alignment file to convert to pixel coordiantes).
-  It also generate a file with the predicted classes for each spot that can be used in other analysis.
-
-  To know more about the parameters you can type --help 
+The script can be given one or serveral datasets (matrices with counts). It will perform dimesionality reduction
+and then cluster the spots together based on the dimesionality reduced coordinates.
+It generates a scatter plot of the clusters. It also generates an image for
+each dataset of the predicted classes on top of the tissue image (tissue image for each dataset must be given and optionally
+an alignment file to convert to pixel coordiantes).
+It also generate a file with the predicted classes for each spot that can be used in other analysis.
+To know more about the parameters you can type --help
 
 ### To do supervised learning
 You can train a classifier with the expression profiles of a set of spots
-where you know the class (cell type) and then predict on a new dataset
-of the same tissue. For that you can use the following script :
+where you know the class (spot type) and then predict on a new dataset
+of the same tissue. For that you can use the following script:
 
     supervised.py --train-data data_matrix.tsv --test-data data_matrix.tsv --train-casses train_classes.txt --test-classes test_classes.txt --image tissue_image.jpg
 
-  This will generate some statistics, a file with the predicted classes for each spot and a plot of the predicted spots on top of the tissue image (if the image and the alignment matrix are given). 
-  The script can take several datasets for the training set and it allows to normalize the training and testing data.
-
-  To know more about the parameters you can type --help
+This will generate some statistics, a file with the predicted classes for each spot and a plot of
+the predicted spots on top of the tissue image (if the image and the alignment matrix are given).
+The script can take several datasets for the training set and it allows to normalize the training and testing data.
+The test/train classes file shoud look like:
+
+XxY 1
+XxY 1
+XxY 2
+
+Where X is the spot X coordinate and Y is the spot Y coordinate and 1,1 and 2 are
+spot classes (regions).
+To know more about the parameters you can type --help
 
 ### To visualize ST data (output from the ST Pipeline) 
-Use the script st_data_plotter.py. It can plot ST data, it can use
-filters (counts or genes) it can highlight spots with reg. expressions
+Use the script st_data_plotter.py to plot ST data, it can use
+filters (counts or genes) it can highlight spots with regular expressions
 of genes and it can highlight spots by giving a file with spot coordinates
-and labels. You need a matrix with the gene counts by spot and optionally
-the a tissue image and an alignment matrix. A example run would be : 
+and labels. You can also normalize the data for visualization.
+You need a matrix with the gene counts and spots and optionally
+a tissue image and an optional alignment matrix. A example run would be:
 
-    st_data_plotter.py --cutoff 2 --filter-genes Actb* --image tissue_image.jpg --alignment alignment_file.txt data_matrix.tsv
+    st_data_plotter.py --cutoff 2 --show-genes Actb* --image tissue_image.jpg data_matrix.tsv
 
-  This will generate a scatter plot of the expression of the spots that contain a gene Actb and with higher expression than 2 and it will use the tissue image as background. You could optionally pass a list of spots with their classes (Generated with unsupervised.py) to highlight spots in the scatter plot. More info if you type --help
+This will generate a scatter plot of the expression of the spots that contain a gene Actb and
+with higher expression than 2 and it will use the tissue image as background.
+You could optionally pass a list of spots with their classes (Generated with unsupervised.py)
+to highlight spots in the scatter plot. More info if you type --help
 
+### To slice a matrix of counts based of regions of interest
+You can slice a dataset based on regions of interests (spots) obtained
+manually or with unsupervised.py. You need a file defining classes for each spot
+(unsupervised.py generates such files):
+
+XxY 1
+XxY 1
+XxY 2
+
+Where X is the spot X coordinate and Y is the spot Y coordinate and 1,1 and 2 are
+spot classes (regions).
+A example run would be:
+
+    slice_regions_matrix.py --counts-matrix dataset.tsv --spot-classes classes.txt --regions 1 3
+
 ### To perform Differential Expression Analysis (DEA)
-You can perform a D.E.A using the output from unsupervised.py and a list of groups to where the D.E.A will be performed.
-The scripts generates different plots and the list of D.E genes in a text file. Basically the script
-needs one or more matrices of counts with ST data (genes as columns), a tab delimited file with two columns where
-the first column is a class and the second is a spot (for each input matrix) and finally the list of comparisions to be made
-from the classes present in the data (for example: 0:1-0:2 0:1-0:5). Where 0 refers to the first input dataseet and 1,2,5 refers to
-the classes defined the classes file.
-
-    differential_analysis.py --input-data stdata.tsv --data-classes spot_classes.txt --condition-tuples 1-2 1-3
+You can perform a D.E.A between ST datasets (most likely regions of interests)
+The scripts generates different plots and the list of D.E. genes in a text file for each comparison.
+Basically the script needs one or more matrices of counts with ST data (genes as columns) and a list
+of comparisons to make:
+
+DATASET0-DATASET2 DATASET1-DATASET3 ...
+
+Where 0 refers to the first input dataset. The scripts allows for different normalization methods and
+different D.E.A. algorithms (see --help). An example run would be:
+
+    differential_analysis.py --input-data stdata_region1.tsv stdata_region2.tsv --comparisons 0-1
 
-  To know more about the parameters you can type --help
+To know more about the parameters you can type --help