Sebastian Raschka, 2015  
Python Machine Learning

# References & Resources

A list of references as they appear throughout the chapters.

A BibTeX version for your favorite reference manager is available [here](./pymle.bib).

<br>
<br>

### Chapter 1: Machine Learning - Giving Computers the Ability to Learn from Data 

##### Literature

- F. Galton. [Regression towards mediocrity in hereditary stature](http://www.jstor.org/stable/2841583). Journal of the Anthropological Institute of Great Britain and Ireland, pages 246–263, 1886.

##### Links

- Python: https://www.python.org

- Installing Python: https://docs.python.org/3/installing/index.html

- Anaconda Scientific Python Distribution: https://store.continuum.io/cshop/anaconda/

##### Additional Resources & Further Reading

<br>
<br>


### Chapter 2: Training Simple Machine Learning Algorithms for Classification

##### Literature

- W. S. McCulloch and W. Pitts. [A logical calculus of the ideas immanent in nervous activity](http://link.springer.com/article/10.1007/BF02459570). The bulletin of mathematical biophysics, 5(4):115–133, 1943.

- F. Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.

- B. Widrow. Adaptive ”Adaline” neuron using chemical ”memistors”. Number Technical Report 1553-2. Stanford Electron. Labs., Stanford, CA, October 1960.


##### Links

- NumPy Tutorial: http://wiki.scipy.org/Tentative_NumPy_Tutorial

- Pandas Tutorial: http://pandas.pydata.org/pandas-docs/stable/tutorials.html

- Matplotlib Tutorial: http://matplotlib.org/users/beginner.html

- IPython Notebook: https://ipython.org/ipython-doc/3/notebook/index.html

- BLAS (Basic Linear Algebra Subprograms): http://www.netlib.org/blas/

- LAPACK — Linear Algebra PACKage: http://www.netlib.org/lapack/

- UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/

- Iris dataset: https://archive.ics.uci.edu/ml/datasets/Iris

##### Additional Resources & Further Reading

- Z. Kolter. [Linear algebra review and reference](http://www.cs.cmu.edu/~zkolter/course/linalg/linalg_notes.pdf), 2008.

<br>
<br>


## Chapter 3: A Tour of Advanced Machine Learning Classifiers Using Scikit-Learn

##### Literature

- D. H. Wolpert and W. G. Macready. [No free lunch theorems for optimization](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=585893&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D585893). Evolutionary Computation, IEEE Transactions on, 1(1):67–82, 1997.

- D. H. Wolpert. [The supervised learning no-free-lunch theorems](http://link.springer.com/chapter/10.1007/978-1-4471-0123-9_3#page-1). In Soft Computing and Industry, pages 25–42. Springer, 2002.

- S. Menard. [Logistic regression: From introductory to advanced concepts and applications](https://books.google.com/books?hl=en&lr=&id=JSJzAwAAQBAJ&oi=fnd&pg=PP1&dq=Logistic+regression:+From+introductory+to+advanced+concepts+and+applications&ots=u7tB-9qcZT&sig=FiW0ejcCYxrne--73Vs5giobJZ0). Sage Publica- tions, 2009.

- V. Vapnik. [The nature of statistical learning theory](https://books.google.com/books?hl=en&lr=&id=EqgACAAAQBAJ&oi=fnd&pg=PR7&dq=The+nature+of+statistical+learning+theory&ots=g2HZeBcX50&sig=dettzMHf6X1pLD-JMlvhpNQpaws#v=onepage&q=The%20nature%20of%20statistical%20learning%20theory&f=false). Springer Science & Business Media, 2013.

- C. J. Burges. [A tutorial on support vector machines for pattern recognition](http://link.springer.com/article/10.1023/A:1009715923555). Data mining and knowledge discovery, 2(2):121–167, 1998.

- J. H. Friedman, J. L. Bentley, and R. A. Finkel. [An algorithm for finding best matches in logarithmic expected time](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.319.4027). ACM Transactions on Mathematical Software (TOMS), 3(3):209–226, 1977.

##### Links

- scikit-learn: http://scikit-learn.org/stable/

- LIBLINEAR -- A Library for Large Linear Classification: http://www.csie.ntu.edu.tw/~cjlin/liblinear/

- LIBSVM -- A Library for Support Vector Machines https://www.csie.ntu.edu.tw/~cjlin/libsvm/

- Graphviz - Graph Visualization Software: http://www.graphviz.org

##### Additional Resources & Further Reading

- L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. wadsworth. Belmont, CA, 1984.

- L. Breiman. [Random forests](http://link.springer.com/article/10.1023/A:1010933404324). Machine learning, 45(1):5–32, 2001.

- P. Cunningham and S. J. Delany. [k-nearest neighbour classifiers](http://www.researchgate.net/profile/Sarah_Delany/publication/228686398_k-Nearest_neighbour_classifiers/links/0fcfd50d0c1d1f41ad000000.pdf). Multiple Classifier Systems, pages 1–17, 2007.


<br>
<br>


## Chapter 4: Building Good Training Sets – Data Pre-Processing 


##### Literature

- T. Hastie, J. Friedman, and R. Tibshirani. [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/), volume 2. Springer, 2009. Section 3.4.

- F. Ferri, P. Pudil, M. Hatef, and J. Kittler. [Comparative study of techniques for large-scale feature selection](https://books.google.com/books?hl=en&lr=&id=sbajBQAAQBAJ&oi=fnd&pg=PA403&dq=Comparative+Study+of+Techniques+for+Large+Scale+Feature+Selection&ots=KdHKVsEatk&sig=PS_jPNFzSYPvFuhfXabBZxqt6UM#v=onepage&q=Comparative%20Study%20of%20Techniques%20for%20Large%20Scale%20Feature%20Selection&f=false). Pattern Recognition in Practice IV, pages 403–413, 1994.


##### Links

- Wine Data Set: https://archive.ics.uci.edu/ml/datasets/Wine

##### Additional Resources & Further Reading

- One-hot encoding: https://en.wikipedia.org/wiki/One-hot

- M. Y. Park and T. Hastie. [L1-regularization path algorithm for generalized linear models](http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2007.00607.x/abstract;jsessionid=315EA2A9E59E2A7E79041A04F4FE065D.f02t01?userIsAuthenticated=false&deniedAccessCustomisedMessage=). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4):659–677, 2007.

- A. Y. Ng. [Feature selection, L1 vs. L2 regularization, and rotational invariance](http://dl.acm.org/citation.cfm?id=1015435). In Proceedings of the twenty-first international conference on Machine learning, page 78. ACM, 2004.

- D. W. Aha and R. L. Bankert. [A comparative evaluation of sequential feature selection algorithms](http://link.springer.com/chapter/10.1007/978-1-4612-2404-4_19#page-1). In Learning from Data, pages 199–206. Springer, 1996.

- C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn. [Bias in random forest variable importance measures: Illustrations, sources and a solution](http://www.biomedcentral.com/1471-2105/8/25). BMC bioinformatics, 8(1):25, 2007.


<br>
<br>


## Chapter 5: Compressing Data via Different Dimensionality Reduction Techniques

##### Literature

- R. A. Fisher. [The use of multiple measurements in taxonomic problems](http://onlinelibrary.wiley.com/store/10.1111/j.1469-1809.1936.tb02137.x/asset/j.1469-1809.1936.tb02137.x.pdf?v=1&t=idj76237&s=91cde2700c80c8e9db3270f122008a4629fe8887). Annals of eugenics, 7(2):179–188, 1936.

- C. R. Rao. [The utilization of multiple measurements in problems of biological classification](http://www.jstor.org/stable/2983775). Journal of the Royal Statistical Society. Series B (Methodological), 10(2):159–203, 1948.


- A. M. Martinez and A. C. Kak. [PCA versus LDA](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=908974&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D908974). Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(2):228–233, 2001.

- R. O. Duda, P. E. Hart, and D. G. Stork. [Pattern classification](http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471056693.html). 2nd. Edition. New York, 2001.

- B. Schoelkopf, A. Smola, and K.-R. Mueller. [Kernel principal component analysis](http://link.springer.com/chapter/10.1007/BFb0020217). pages 583–588, 1997.


##### Links

##### Additional Resources & Further Reading

- I. Jolliffe. [Principal component analysis](http://onlinelibrary.wiley.com/doi/10.1002/9781118445112.stat06472/abstract?userIsAuthenticated=false&deniedAccessCustomisedMessage=). Wiley Online Library, 2002.

- Manifold learning algorithms: https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Manifold_learning_algorithms

- J. Shawe-Taylor and N. Cristianini. [Kernel methods for pattern analysis](https://books.google.com/books?hl=en&lr=&id=9i0vg12lti4C&oi=fnd&pg=PR8&dq=Kernel+Methods+for+Pattern+Analysis&ots=okAEjd1H3R&sig=hNqlaqWsF4YB_2O4PWl1AjveplY#v=onepage&q=Kernel%20Methods%20for%20Pattern%20Analysis&f=false). Cambridge university press, 2004.


<br>
<br>


## Chapter 6: Learning Best Practices for Model Evaluation and Hyperparameter Optimization 

##### Literature

- R. Kohavi et al. [A study of cross-validation and bootstrap for accuracy estimation and model selection](http://frostiebek.free.fr/docs/Machine%20Learning/validation-1.pdf). In Ijcai, volume 14, pages 1137–1145, 1995.

- M. Markatou, H. Tian, S. Biswas, and G. M. Hripcsak. [Analysis of variance of cross-validation estimators of the generalization error](http://academiccommons.columbia.edu/catalog/ac:173902). Journal of Machine Learning Research, 6:1127–1168, 2005.

- B. Efron and R. Tibshirani. [Improvements on cross-validation: the 632+ bootstrap method](http://www.stat.washington.edu/courses/stat527/s13/readings/EfronTibshirani_JASA_1997.pdf). Journal of the American Statistical Association, 92(438):548–560, 1997.

- S. Varma and R. Simon. [Bias in error estimation when using cross-validation for model selection](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1397873/). BMC bioinformatics, 7(1):91, 2006.


##### Links

- Breast Cancer Wisconsin dataset: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)


##### Additional Resources & Further Reading

- Y. Bengio and Y. Grandvalet. [No unbiased estimator of the variance of k-fold cross-validation](http://dl.acm.org/citation.cfm?id=1044695). The Journal of Machine Learning Research, 5:1089–1105, 2004.

- S. Raschka. [An overview of general performance metrics of binary classifier systems](http://arxiv.org/pdf/1410.5330.pdf). Computing Research Repository (CoRR), abs/1410.5330, 2014.

- J. A. Hanley and B. J. McNeil. [The meaning and use of the area under a receiver operating characteristic (roc) curve](http://pubs.rsna.org/doi/abs/10.1148/radiology.143.1.7063747). Radiology, 143(1):29–36, 1982.

- J. Davis and M. Goadrich. [The relationship between precision-recall and roc curves](http://dl.acm.org/citation.cfm?id=1143874). In Proceedings of the 23rd international conference on Machine learning, pages 233–240. ACM, 2006.

- J. Bergstra and Y. Bengio. [Random search for hyper-parameter optimization](http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf). The Journal of Machine Learning Research, 13(1):281–305, 2012.


<br>
<br>


## Chapter 7: Combining Different Models for Ensemble Learning

##### Literature

- D. H. Wolpert. [Stacked generalization](http://www.sciencedirect.com/science/article/pii/S0893608005800231). Neural networks, 5(2):241–259, 1992.

- L. Breiman. [Bagging predictors](http://link.springer.com/article/10.1007/BF00058655#page-1). Machine learning, 24(2):123–140, 1996.

- R. E. Schapire. [The strength of weak learnability](http://link.springer.com/article/10.1007/BF00116037). Machine learning, 5(2):197–227, 1990.

- Y. Freund, R. E. Schapire, et al. [Experiments with a new boosting algorithm](http://www.public.asu.edu/~jye02/CLASSES/Fall-2005/PAPERS/boosting-icml.pdf). In ICML, volume 96, pages 148–156, 1996.

- L. Breiman. [Bias, variance, and arcing classifiers](http://oz.berkeley.edu/~breiman/arcall96.pdf). 1996.

- G. Raetsch, T. Onoda, and K. R. Mueller. [An improvement of adaboost to avoid overfitting](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.9074). In Proc. of the Int. Conf. on Neural Information Processing. Citeseer, 1998.

- A. Toescher, M. Jahrer, and R. M. Bell. [The bigchaos solution to the netflix grand prize](http://www.stat.osu.edu/~dmsl/GrandPrize2009_BPC_BigChaos.pdf). Netflix prize documentation, 2009.


##### Links

- Netflix Recommendations: Beyond the 5 stars (Part 1): http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html

##### Additional Resources & Further Reading

- K. M. Ting and I. H. Witten. [Issues in stacked generalization](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.1519). J. Artif. Intell. Res.(JAIR), 10:271–289, 1999.

- J. H. Friedman. [Stochastic gradient boosting](http://www.sciencedirect.com/science/article/pii/S0167947301000652). Computational Statistics & Data Analysis, 38(4):367–378, 2002.


<br>
<br>


## Chapter 8: Applying Machine Learning to Sentiment Analysis 

##### Literature

- A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. [Learning word vectors for sentiment analysis](http://dl.acm.org/citation.cfm?id=2002491). In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.

- I. Kanaris, K. Kanaris, I. Houvardas, and E. Stamatatos. [Words versus character n-grams for anti-spam filtering](http://www.worldscientific.com/doi/abs/10.1142/S0218213007003692). International Journal on Artificial Intelligence Tools, 16(06):1047–1067, 2007.

- S. Raschka. [Naive bayes and text classification I - introduction and theory](http://arxiv.org/pdf/1410.5329.pdf). Computing Research Repos- itory (CoRR), abs/1410.5329, 2014.

- S. Bird, E. Klein, and E. Loper. [Natural language processing with Python](http://www.nltk.org/book/). O’Reilly Media, Inc.”, 2009.

- M. F. Porter. [An algorithm for suffix stripping](http://www.emeraldinsight.com/doi/abs/10.1108/eb046814). Program: electronic library and information systems, 14(3):130–137, 1980.

- M. Toman, R. Tesar, and K. Jezek. [Influence of word normalization on text classification](http://repository.essex.ac.uk/4019/). Proceedings of InSciT, pages 354–358, 2006.

- A. Appleby. [murmurhash3](https://sites.google.com/site/murmurhash/), 2011.

- T. Mikolov, K. Chen, G. Corrado, and J. Dean. [Efficient estimation of word representations in vector space](http://arxiv.org/abs/1301.3781). arXiv preprint arXiv:1301.3781, 2013.


##### Links

- Review dataset: http://ai.stanford.edu/~amaas/data/sentiment/

- Google regex Tutorial: https://developers.google.com/edu/python/regular-expressions

- Natural Language Toolkit: http://www.nltk.org

- Google Word2Vec: https://code.google.com/p/word2vec/

##### Additional Resources & Further Reading

- A. Aizawa. [An information-theoretic perspective of tf–idf measures](http://www.sciencedirect.com/science/article/pii/S0306457302000213). Information Processing & Manage- ment, 39(1):45–65, 2003.

- M. F. Porter. Snowball: [A language for stemming algorithms](http://snowball.tartarus.org/texts/introduction.html), 2001.

- C. D. Paice. [Method for evaluation of stemming algorithms based on error counting](http://link.springer.com/article/10.3103/S0005105509060041). Journal of the American Society for Information Science, 47(8):632–649, 1996.


<br>
<br>


## Chapter 9: Embedding a Machine Learning Model into a Web Application 

##### Literature

##### Links

- Flask: http://flask.pocoo.org

- SQLite: http://www.sqlite.org

- SQLite Manager Add-on: https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/

- WTForms: https://wtforms.readthedocs.org/en/latest/

- Jinja2: http://jinja.pocoo.org

- Webapp example: http://raschkas.pythonanywhere.com

- pythonanywhere: https://www.pythonanywhere.com

##### Additional Resources & Further Reading

- HTTP Methods: GET vs. POST: http://www.w3schools.com/tags/ref_httpmethods.asp


<br>
<br>


## Chapter 10: Predicting Continuous Target Variables with Regression Analysis 

##### Literature

- A. I. Khuri. [Introduction to linear regression analysis](http://onlinelibrary.wiley.com/doi/10.1111/insr.12020_10/abstract), by Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining. International Statistical Review, 81(2):318–319, 2013.

- D. S. G. Pollock. [The Classical Linear Regression Model](http://www.le.ac.uk/users/dsgp1/COURSES/MESOMET/ECMETXT/06mesmet.pdf).

- R. Toldo and A. Fusiello. [Automatic estimation of the inlier threshold in robust multiple structures fitting](http://link.springer.com/chapter/10.1007/978-3-642-04146-4_15). In Image Analysis and Processing–ICIAP 2009, pages 123–131. Springer, 2009.


##### Links

- Housing dataset: https://archive.ics.uci.edu/ml/datasets/Housing

- Seaborn: http://stanford.edu/~mwaskom/software/seaborn/

##### Additional Resources & Further Reading

- J. W. Tukey. [Exploratory data analysis](http://xa.yimg.com/kq/groups/16412409/1159714453/name/exploratorydataanalysis.pdf). 1977.

- I. Lawrence and K. Lin. [A concordance correlation coefficient to evaluate reproducibility](http://www.jstor.org/stable/2532051?seq=1#page_scan_tab_contents). Biometrics, pages 255–268, 1989.

- N. J. Nagelkerke. [A note on a general definition of the coefficient of determination](http://www.cesarzamudio.com/uploads/1/7/9/1/17916581/nagelkerke_n.j.d._1991_-_a_note_on_a_general_definition_of_the_coefficient_of_determination.pdf). Biometrika, 78(3):691– 692, 1991.

- P. Meer, D. Mintz, A. Rosenfeld, and D. Y. Kim. [Robust regression methods for computer vision: A review](http://link.springer.com/article/10.1007/BF00127126). International journal of computer vision, 6(1):59–70, 1991.

- A. E. Hoerl and R. W. Kennard. [Ridge regression: Biased estimation for nonorthogonal problems](http://amstat.tandfonline.com/doi/abs/10.1080/00401706.1970.10488634). Technometrics, 12(1):55–67, 1970.

- R. Tibshirani. [Regression shrinkage and selection via the lasso](http://www.jstor.org/stable/2346178). Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.

- A. Liaw and M. Wiener. [Classification and regression by randomForest](ftp://131.252.97.79/Transfer/Treg/WFRE_Articles/Liaw_02_Classification%20and%20regression%20by%20randomForest.pdf). R news, 2(3):18–22, 2002.

- G. Louppe. [Understanding random forests: From theory to practice](http://arxiv.org/abs/1407.7502). arXiv preprint arXiv:1407.7502, 2014.

- G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts. [Understanding variable importances in forests of randomized trees](http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees). In Advances in Neural Information Processing Systems, pages 431–439, 2013.

- S. R. Gunn et al. [Support vector machines for classification and regression](http://ce.sharif.ir/courses/85-86/2/ce725/resources/root/LECTURES/SVM.pdf). ISIS technical report, 14, 1998.


<br>
<br>


## Chapter 11: Working with Unlabeled Data – Clustering Analysis  

##### Literature

- D. Arthur and S. Vassilvitskii. [k-means++: The advantages of careful seeding](http://dl.acm.org/citation.cfm?id=1283494). In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.

- J. C. Dunn. [A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters](http://www.tandfonline.com/doi/abs/10.1080/01969727308546046#.VdUUQnhh1AY). 1973.

- J. C. Bezdek. [Pattern recognition with fuzzy objective function algorithms](https://books.google.com/books?hl=en&lr=&id=z6XqBwAAQBAJ&oi=fnd&pg=PR14&dq=Pattern+recognition+with+fuzzy+objective+function+algorithms&ots=0g_HoTDhDo&sig=LWYQJJL8usKeVvPY_q1DLnJ4P70#v=onepage&q=Pattern%20recognition%20with%20fuzzy%20objective%20function%20algorithms&f=false). Springer Science & Business Media, 2013.

- S. Ghosh and S. K. Dubey. [Comparative analysis of k-means and fuzzy c-means algorithms](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.403.7600&rep=rep1&type=pdf). IJACSA, 4:35–38, 2013.

- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. [A density-based algorithm for discovering clusters in large spatial databases with noise](http://www.aaai.org/Papers/KDD/1996/KDD96-037). In Kdd, volume 96, pages 226–231, 1996.


##### Links

##### Additional Resources & Further Reading

- Z. Huang. [Extensions to the k-means algorithm for clustering large data sets with categorical values](http://link.springer.com/article/10.1023/A:1009769707641). Data mining and knowledge discovery, 2(3):283–304, 1998.


- C. Ding and X. He. [K-means clustering via principal component analysis](http://dl.acm.org/citation.cfm?id=1015408). In Proceedings of the twenty- first international conference on Machine learning, page 29. ACM, 2004.

- Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz. [Yinyang k-means: A drop-in replacement of the classic k-means with consistent speedup](http://machinelearning.wustl.edu/mlpapers/paper_files/icml2015_ding15.pdf). In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 579–587, 2015.

- P. J. Rousseeuw. Silhouettes: [a graphical aid to the interpretation and validation of cluster analysis](http://www.sciencedirect.com/science/article/pii/0377042787901257). Journal of computational and applied mathematics, 20:53–65, 1987.

- J.-P. Rasson and T. Kubushishi. [The gap test: an optimal method for determining the number of natural classes in cluster analysis](http://link.springer.com/chapter/10.1007/978-3-642-51175-2_21). In New approaches in classification and data analysis, pages 186–193. Springer, 1994.

- S. C. Johnson. [Hierarchical clustering schemes](http://link.springer.com/article/10.1007/BF02289588). Psychometrika, 32(3):241–254, 1967.


<br>
<br>


## Chapter 12: Training Artificial Neural Networks for Image Recognition

##### Literature

- D. R. G. H. R. Williams and G. Hinton. [Learning representations by back-propagating errors](http://lia.disi.unibo.it/Courses/SistInt/articoli/nnet1.pdf). Nature, pages 323–533, 1986.

- Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: [Closing the gap to human-level performance in face verification](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6909616&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6909616). In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1701–1708. IEEE, 2014.

-  A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. Deepspeech: [Scaling up end-to-end speech recognition](http://arxiv.org/pdf/1412.5567.pdf). arXiv preprint arXiv:1412.5567, 2014.

- T. Unterthiner, A. Mayr, G. Klambauer, and S. Hochreiter. [Toxicity prediction using deep learning](http://arxiv.org/pdf/1412.5567.pdf). arXiv preprint arXiv:1503.01445, 2015.

- T. Hastie, J. Friedman, and R. Tibshirani. [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/), volume 2. Springer, 2009.

- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. [Gradient-based learning applied to document recognition](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf). Proceedings of the IEEE, 86(11):2278–2324, 1998.

- A. G. Baydin and B. A. Pearlmutter. [Automatic differentiation of algorithms for machine learning](http://arxiv.org/pdf/1404.7456v1.pdf). arXiv preprint arXiv:1404.7456, 2014.

- Y. Bengio. [Learning deep architectures for AI](http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf). Foundations and trends in Machine Learning, 2(1):1–127, 2009.

- P. Y. Simard, D. Steinkraus, and J. C. Platt. [Best practices for convolutional neural networks applied to visual document analysis](http://www.computer.org/csdl/proceedings/icdar/2003/1960/02/196020958.pdf). In null, page 958. IEEE, 2003.

- S. Hochreiter and J. Schmidhuber. [Long short-term memory](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6795963&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6795963). Neural computation, 9(8):1735–1780, 1997.

- C. M. Bishop. [Neural networks for pattern recognition](https://books.google.com/books?hl=en&lr=&id=T0S0BgAAQBAJ&oi=fnd&pg=PP1&dq=Neural+networks+for+pattern+recognition&ots=jL6TqGbBld&sig=fiLrMg-RJx22cgQ7zd2CiwUqNqI#v=onepage&q=Neural%20networks%20for%20pattern%20recognition&f=false). Oxford university press, 1995.


##### Links

- "How Google Translate squeezes deep learning onto a phone": http://googleresearch.blogspot.com/2015/07/how-google-translate-squeezes-deep.html

- Article about Endianness: https://en.wikipedia.org/wiki/Endianness


##### Additional Resources & Further Reading

- Automatic differentiation: https://en.wikipedia.org/wiki/Automatic_differentiation


<br>
<br>

## Chapter 13: Parallelizing Neural Network Training with Theano

##### Literature

- J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: [A cpu and gpu math compiler in python](https://projects.scipy.org/scipy2010/slides/james_bergstra_theano.pdf). In Proc. 9th Python in Science Conf, pages 1–7, 2010.


##### Links

- LISA Lab: http://lisa.iro.umontreal.ca

- Theano: http://deeplearning.net/software/theano/

-	Pylearn2: http://deeplearning.net/software/pylearn2/

-	Lasagne: https://lasagne.readthedocs.org/en/latest/

-	Keras: http://keras.io

- SymPy: http://www.sympy.org/en/index.html


##### People

- Geoff Hinton http://www.cs.toronto.edu/~hinton/,

- Andrew Ng http://www.andrewng.org

- Yann LeCun http://yann.lecun.com

- Juergen Schmidhuber http://people.idsia.ch/~juergen/

- Yoshua Bengio http://www.iro.umontreal.ca/~bengioy

##### Additional Resources & Further Reading

- Symbolic Computation: https://en.wikipedia.org/wiki/Symbolic_computation

- What Every Programmer Should Know About Floating-Point Arithmetic: http://floating-point-gui.de