diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
new file mode 100644
index 0000000..b735373
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,35 @@
+---
+name: Bug report
+about: Create a report to help us improve
+
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**To Reproduce**
+Steps to reproduce the behavior:
+1. Go to '...'
+2. Click on '....'
+3. Scroll down to '....'
+4. See error
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Screenshots**
+If applicable, add screenshots to help explain your problem.
+
+**Desktop (please complete the following information):**
+ - OS: [e.g. iOS]
+ - Browser [e.g. chrome, safari]
+ - Version [e.g. 22]
+
+**Smartphone (please complete the following information):**
+ - Device: [e.g. iPhone6]
+ - OS: [e.g. iOS8.1]
+ - Browser [e.g. stock browser, safari]
+ - Version [e.g. 22]
+
+**Additional context**
+Add any other context about the problem here.
diff --git a/.github/ISSUE_TEMPLATE/documentation-issue.md b/.github/ISSUE_TEMPLATE/documentation-issue.md
new file mode 100644
index 0000000..a009408
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/documentation-issue.md
@@ -0,0 +1,17 @@
+---
+name: Documentation Issue
+about: Use this template for documentation related issues
+
+---
+
+<em>Please make sure that this is a documentation issue. 
+
+
+**System information**
+- PocketFlow version:
+- Doc Link:
+
+
+**Describe the documentation issue**
+
+**We welcome contributions by users. Will you be able to update submit a PR to fix the doc Issue?**
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
new file mode 100644
index 0000000..066b2d9
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,17 @@
+---
+name: Feature request
+about: Suggest an idea for this project
+
+---
+
+**Is your feature request related to a problem? Please describe.**
+A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+
+**Describe the solution you'd like**
+A clear and concise description of what you want to happen.
+
+**Describe alternatives you've considered**
+A clear and concise description of any alternative solutions or features you've considered.
+
+**Additional context**
+Add any other context or screenshots about the feature request here.
diff --git a/.github/ISSUE_TEMPLATE/other-issues.md b/.github/ISSUE_TEMPLATE/other-issues.md
new file mode 100644
index 0000000..2a9fb28
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/other-issues.md
@@ -0,0 +1,12 @@
+---
+name: Other Issues
+about: Use this template for any other non-support related issues
+
+---
+
+This template is for miscellaneous issues not covered by the other issue categories.
+
+For questions on how to work with PocketFlow, or support for problems that are not verified bugs in PocketFlow, please go to [StackOverflow](https://stackoverflow.com).
+
+
+For high-level discussions about TensorFlow, please post to [discuss group](https://groups.google.com/forum/#!forum/pocketflow).
diff --git a/.gitignore b/.gitignore
index bb3c14f..4b01c45 100644
--- a/.gitignore
+++ b/.gitignore
@@ -9,5 +9,6 @@ automl_output_*
 start_multi.sh
 ratio.list
 path.conf
-*_at_*_run.py
 nvidia-smi-dump
+ssd_outputs
+dump
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 8bbc0d2..152cdf1 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,38 +1,41 @@
-# Contributing to PocketFlow
-[腾讯开源激励计划](https://opensource.tencent.com/contribution) 鼓励开发者的参与和贡献，期待你的加入。我们欢迎[report Issues](https://github.com/Tencent/PocketFlow/issues) 或者 [pull requests](https://github.com/Tencent/PocketFlow/pulls)。 在贡献代码之前请阅读以下指引。
+# 为PocketFlow做出贡献
+
+[腾讯开源激励计划](https://opensource.tencent.com/contribution)鼓励所有开发者的参与和贡献，我们期待你的加入。你可以报告 [Issues](https://github.com/Tencent/PocketFlow/issues) 或者提交 [Pull Requests](https://github.com/Tencent/PocketFlow/pulls)。在贡献代码之前，请阅读以下指引。
 
 ## 问题管理
-我们用 Github Issues 去跟踪 public bugs 和 feature requests。
 
-### 查找已知的issue 优先
-请查找已存在或者相类似的issue，从而保证不存在冗余。
+我们使用 Github Issues 以收集问题和功能需求。
+
+### 查找已有的 Issues
+
+在创建新 Issue 之前，请先搜索是否存在已有的或者类似的 Issue ，以避免重复。
+
+### 创建新 Issue
+
+当创建新 Issue 时，请提供详细的描述、截屏以及/或者短视频来帮助我们定位和复现问题。
 
-### 新建 Issues
-新建issues 时请提供详细的描述、截屏或者短视频来辅助我们定位问题
+## 分支管理
 
-### 分支管理
+目前，为简便起见，我们仅有一个分支：
 
-有两个主分支：
+1. `master` 分支：
+   1. 这是稳定分支。高度稳定的版本将会被标注特定的版本号。
+   2. 请向该分支提交包含问题修复或者新功能的 Pull Requests。
 
-1. `master` 分支
-    1. **注意不要提交PR到此分支**
-2. `dev` 分支. 
-    1. **这是稳定的开发分支，经过完成测试后，`dev`分支的内容会在下次发布时合并到 `master`分支。**
-    2. **建议提交PR到`dev`分支。**
+## Pull Requests
 
-###  Pull Requests
+我们欢迎所有人向 PocketFlow 贡献代码。我们的代码团队会监控 Pull Requests, 进行相应的代码测试与检查，通过测试的 PR 将会被合并至 `master` 分支。
 
-我们欢迎大家贡献代码来使我们的PocketFlow更加强大
-代码团队会监控pull request, 我们会做相应的代码检查和测试，测试通过之后我们就会接纳PR ，但是不会立即合并到master分支。
+在提交 PR 之前，请确认:
 
-在完成一个pr之前请做一下确认:
+1. 从主项目中 fork 代码
+2. 与主项目保持同步
+3. 在代码变动后，对应地修改注释与文档
+4. 在新文件中加入协议与版权声明
+5. 确保一致的代码风格（可使用 `run_pylint.sh`）
+6. 充分测试你的代码
+7. 向 `master` 分支发起 PR 请求
 
-1. 从 `master`  fork 你自己的分支。
-2. 在修改了代码之后请修改对应的文档和注释。
-3. 在新建的文件中请加入licence 和copy right申明。
-4. 确保一致的代码风格，可运行脚本run_pylint.sh进行一致性检查。
-5. 做充分的测试。
-6. 然后，你可以提交你的代码到 `dev` 分支。
+## 协议
 
-## 代码协议
-[BSD 3-Clause License](https://github.com/Tencent/PocketFlow/blob/master/LICENSE.TXT) 为PocketFlow的开源协议，您贡献的代码也会受此协议保护。
+[BSD 3-Clause License](https://github.com/Tencent/PocketFlow/blob/master/LICENSE.TXT)是PocketFlow的开源协议，你贡献的代码也会受此协议保护。
diff --git a/CONTRIBUTING_en.md b/CONTRIBUTING_en.md
new file mode 100644
index 0000000..3d27599
--- /dev/null
+++ b/CONTRIBUTING_en.md
@@ -0,0 +1,41 @@
+# Contributing to PocketFlow
+
+[Tencent Open Source Incentive Program](https://opensource.tencent.com/contribution) encourages all developers' participation and contribution and we are looking forward to you joining us. You are welcomed to report [issues](https://github.com/Tencent/PocketFlow/issues) or submit [pull requests](https://github.com/Tencent/PocketFlow/pulls). Before contributing, please read the following guideline.
+
+## Issue Management
+
+We use Github Issues to track public bugs and feature requests.
+
+### Search for Existing Issues First
+
+Please search for existing or similar issues before opening a new one, in order to avoid duplicated issues.  
+
+### Creating a New Issue
+
+When creating a new issue, please provide detailed descriptions, screenshots, and/or short videos to help us locate and reproduce the problem(s).
+
+### Branch Management
+
+For the moment, we have only one branch for simplicity:
+
+1. `master` branch:
+    1. This is the stable branch. Highly-stable versions will be tagged with certain version numbers.
+    2. Please submit hotfixs or new features as PR to this branch.
+
+### Pull Requests (PR)
+
+We welcome everyone to contribute to PocketFlow. Our development team will monitor pull requests and perform related tests and code reviews. If passed, the PR will be accepted and merged to the master branch. 
+
+Before submitting a PR, please confirm the following:
+
+1. Fork the main repo
+2. Keep updated with the main repo
+3. Update comments and documentation after code changes
+4. Add licence and copyright notes to new files
+5. Keep the code style consistent (use `run_pylint.sh`)
+6. Extensively test your code
+7. Make a pull request to master branch
+
+## License
+
+[BSD 3-Clause License](https://github.com/Tencent/PocketFlow/blob/master/LICENSE.TXT) is PocketFlow's open source license. Your contributed code will also be protected by this license.
diff --git a/README.md b/README.md
index 2fb863f..0bafe39 100644
--- a/README.md
+++ b/README.md
@@ -4,13 +4,20 @@ PocketFlow is an open-source framework for compressing and accelerating deep lea
 
 PocketFlow aims at providing an easy-to-use toolkit for developers to improve the inference efficiency with little or no performance degradation. Developers only needs to specify the desired compression and/or acceleration ratios and then PocketFlow will automatically choose proper hyper-parameters to generate a highly efficient compressed model for deployment.
 
+PocketFlow was originally developed by researchers and engineers working on machine learning team within Tencent AI Lab for the purposes of compacting deep neural networks with industrial applications.
+
 For full documentation, please refer to [PocketFlow's GitHub Pages](https://pocketflow.github.io/). To start with, you may be interested in the [installation guide](https://pocketflow.github.io/installation/) and the [tutorial](https://pocketflow.github.io/tutorial/) on how to train a compressed model and deploy it on mobile devices.
 
+For general discussions about PocketFlow development and directions please refer to [PocketFlow Google Group](https://groups.google.com/forum/#!forum/pocketflow). If you need a general help, please direct to [Stack Overflow](https://stackoverflow.com/). You can report issues, bug reports, and feature requests on [GitHub Issue Page](https://github.com/Tencent/PocketFlow/issues).
+
+**News: we have created a QQ group (ID: 827277965) for technical discussions. Welcome to join us!**
+<img src="docs/qr_code.jpg" alt="qr_code" width="256"/>
+
 ## Framework
 
 The proposed framework mainly consists of two categories of algorithm components, *i.e.* learners and hyper-parameter optimizers, as depicted in the figure below. Given an uncompressed original model, the learner module generates a candidate compressed model using some randomly chosen hyper-parameter combination. The candidate model's accuracy and computation efficiency is then evaluated and used by hyper-parameter optimizer module as the feedback signal to determine the next hyper-parameter combination to be explored by the learner module. After a few iterations, the best one of all the candidate models is output as the final compressed model.
 
-![Framework Design](docs/framework_design.png)
+![Framework Design](docs/docs/pics/framework_design.png)
 
 ## Learners
 
@@ -44,12 +51,12 @@ For complete evaluation results, please refer to [here](https://pocketflow.githu
 
 We adopt the DDPG algorithm as the RL agent to find the optimal layer-wise pruning ratios, and use group fine-tuning to further improve the compressed model's accuracy:
 
-| Model        | Pruning Ratio | Uniform | RL-based      | RL-based + Group Fine-tuning |
-|:------------:|:-------------:|:-------:|:-------------:|:----------------------------:|
-| MobileNet-v1 | 50%           | 66.5%   | 67.8% (+1.3%) | 67.9% (+1.4%)                |
-| MobileNet-v1 | 60%           | 66.2%   | 66.9% (+0.7%) | 67.0% (+0.8%)                |
-| MobileNet-v1 | 70%           | 64.4%   | 64.5% (+0.1%) | 64.8% (+0.4%)                |
-| Mobilenet-v1 | 80%           | 61.4%   | 61.4% (+0.0%) | 62.2% (+0.8%)                |
+| Model        | FLOPs | Uniform | RL-based      | RL-based + Group Fine-tuning |
+|:------------:|:-----:|:-------:|:-------------:|:----------------------------:|
+| MobileNet-v1 | 50%   | 66.5%   | 67.8% (+1.3%) | 67.9% (+1.4%)                |
+| MobileNet-v1 | 40%   | 66.2%   | 66.9% (+0.7%) | 67.0% (+0.8%)                |
+| MobileNet-v1 | 30%   | 64.4%   | 64.5% (+0.1%) | 64.8% (+0.4%)                |
+| Mobilenet-v1 | 20%   | 61.4%   | 61.4% (+0.0%) | 62.2% (+0.8%)                |
 
 ### Weight Sparsification
 
@@ -74,6 +81,19 @@ The resulting model can be deployed on mobile devices for faster inference (Devi
 
 * All the reported time are in milliseconds.
 
+## Citation
+
+Please cite PocketFlow in your publications if it helps your research:
+
+``` bibtex
+@incollection{wu2018pocketflow,
+  author = {Jiaxiang Wu and Yao Zhang and Haoli Bai and Huasong Zhong and Jinlong Hou and Wei Liu and Junzhou Huang},
+  title = {PocketFlow: An Automated Framework for Compressing and Accelerating Deep Neural Networks},
+  booktitle = {Advances in Neural Information Processing Systems (NIPS), Workshop on Compact Deep Neural Networks with Industrial Applications},
+  year = {2018},
+}
+```
+
 ## Reference
 
 * [**Bergstra et al., 2013**] J. Bergstra, D. Yamins, and D. D. Cox. *Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures*. In International Conference on Machine Learning (ICML), pages 115-123, Jun 2013.
diff --git a/datasets/pascalvoc_dataset.py b/datasets/pascalvoc_dataset.py
new file mode 100644
index 0000000..1284354
--- /dev/null
+++ b/datasets/pascalvoc_dataset.py
@@ -0,0 +1,197 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Pascal VOC dataset."""
+
+import os
+import tensorflow as tf
+
+from datasets.abstract_dataset import AbstractDataset
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_integer('image_size', 300, 'output image size')
+tf.app.flags.DEFINE_integer('image_size_eval', 300, 'output image size for evaluation')
+tf.app.flags.DEFINE_integer('nb_bboxs_max', 100, 'maximal # of bounding boxes per image')
+tf.app.flags.DEFINE_integer('nb_classes', 21, '# of classes')
+tf.app.flags.DEFINE_integer('nb_smpls_train', 22136, '# of samples for training')
+tf.app.flags.DEFINE_integer('nb_smpls_val', 500, '# of samples for validation')
+tf.app.flags.DEFINE_integer('nb_smpls_eval', 4952, '# of samples for evaluation')
+tf.app.flags.DEFINE_integer('batch_size', 32, 'batch size per GPU for training')
+tf.app.flags.DEFINE_integer('batch_size_eval', 1, 'batch size for evaluation')
+
+# Pascal VOC specifications
+IMAGE_CHN = 3
+
+def parse_example_proto(example_serialized):
+  """Parse the unserialized feature data from the serialized data.
+
+  Args:
+  * example_serialized: serialized example data
+
+  Returns:
+  * features: unserialized feature data
+  """
+
+  # parse features from the serialized data
+  feature_map = {
+    'image/encoded': tf.FixedLenFeature([], dtype=tf.string, default_value=''),
+    'image/format': tf.FixedLenFeature([], dtype=tf.string, default_value='jpeg'),
+    'image/filename': tf.FixedLenFeature((), dtype=tf.string, default_value=''),
+    'image/height': tf.FixedLenFeature([1], dtype=tf.int64),
+    'image/width': tf.FixedLenFeature([1], dtype=tf.int64),
+    'image/channels': tf.FixedLenFeature([1], dtype=tf.int64),
+    'image/shape': tf.FixedLenFeature([3], dtype=tf.int64),
+    'image/object/bbox/xmin': tf.VarLenFeature(dtype=tf.float32),
+    'image/object/bbox/ymin': tf.VarLenFeature(dtype=tf.float32),
+    'image/object/bbox/xmax': tf.VarLenFeature(dtype=tf.float32),
+    'image/object/bbox/ymax': tf.VarLenFeature(dtype=tf.float32),
+    'image/object/bbox/label': tf.VarLenFeature(dtype=tf.int64),
+    'image/object/bbox/difficult': tf.VarLenFeature(dtype=tf.int64),
+    'image/object/bbox/truncated': tf.VarLenFeature(dtype=tf.int64),
+  }
+  features = tf.parse_single_example(example_serialized, feature_map)
+
+  return features
+
+def pack_annotations(bboxes, labels, difficults=None, truncateds=None):
+  """Pack all the annotations into one tensor.
+
+  Args:
+  * bboxes: list of bounding box coordinates (N x 4)
+  * labels: list of category labels (N)
+  * difficults: list of difficulty flags (N)
+  * truncateds: list of truncation flags (N)
+
+  Returns:
+  * objects: one tensor with all the annotations packed together (FLAGS.nb_bboxs_max x 8)
+  """
+
+  # pack <bboxes> & <labels> with a leading <flags>
+  bboxes = tf.cast(bboxes, tf.float32)
+  labels = tf.cast(tf.expand_dims(labels, 1), tf.float32)
+  flags = tf.ones(tf.shape(labels))
+  objects = tf.concat([flags, bboxes, labels], axis=1)
+
+  # pack <difficults> & <truncateds> if supplied
+  if difficults is not None and truncateds is not None:
+    difficults = tf.cast(tf.expand_dims(difficults, 1), tf.float32)
+    truncateds = tf.cast(tf.expand_dims(truncateds, 1), tf.float32)
+    objects = tf.concat([objects, difficults, truncateds], axis=1)
+
+  # pad to fixed number of bounding boxes
+  pad_size = FLAGS.nb_bboxs_max - tf.shape(objects)[0]
+  objects = tf.pad(objects, [[0, pad_size], [0, 0]])
+
+  return objects
+
+def parse_fn(example_serialized, preprocess_fn, is_train):
+  """Parse image & objects from the serialized data.
+
+  Args:
+  * example_serialized: serialized example data
+  * preprocess_fn: preprocessing function
+  * is_train: whether to construct the training subset
+
+  Returns:
+  * image: image tensor
+  * objects: one tensor with all the annotations packed together
+  """
+
+  # unserialize the example proto
+  features = parse_example_proto(example_serialized)
+
+  # obtain the image data
+  image_raw = tf.image.decode_jpeg(features['image/encoded'], channels=IMAGE_CHN)
+  filename = features['image/filename']
+  shape = features['image/shape']
+
+  # obtain bounding boxes' coordinates
+  # Note that we impose an ordering of (y, x) just to make life difficult.
+  xmins = tf.expand_dims(features['image/object/bbox/xmin'].values, 1)
+  ymins = tf.expand_dims(features['image/object/bbox/ymin'].values, 1)
+  xmaxs = tf.expand_dims(features['image/object/bbox/xmax'].values, 1)
+  ymaxs = tf.expand_dims(features['image/object/bbox/ymax'].values, 1)
+  bboxes_raw = tf.concat([ymins, xmins, ymaxs, xmaxs], axis=1)  # N x 4
+
+  # obtain other annotation data
+  labels_raw = tf.cast(features['image/object/bbox/label'].values, tf.int64)
+  difficults = tf.cast(features['image/object/bbox/difficult'].values, tf.int64)
+  truncateds = tf.cast(features['image/object/bbox/truncated'].values, tf.int64)
+
+  # filter out difficult objects
+  if is_train:
+    # if all is difficult, then keep the first one; otherwise, use all the non-difficult objects
+    mask = tf.cond(
+      tf.count_nonzero(difficults, dtype=tf.int32) < tf.shape(difficults)[0],
+      lambda: difficults < tf.ones_like(difficults),
+      lambda: tf.one_hot(0, tf.shape(difficults)[0], on_value=True, off_value=False, dtype=tf.bool))
+    labels_raw = tf.boolean_mask(labels_raw, mask)
+    bboxes_raw = tf.boolean_mask(bboxes_raw, mask)
+
+  # pre-process image, labels, and bboxes
+  data_format = 'channels_last'  # use the channel-last ordering by default
+  if is_train:
+    out_shape = [FLAGS.image_size, FLAGS.image_size]
+    image, labels, bboxes = preprocess_fn(
+      image_raw, labels_raw, bboxes_raw, out_shape,
+      is_training=True, data_format=data_format, output_rgb=False)
+  else:
+    out_shape = [FLAGS.image_size_eval, FLAGS.image_size_eval]
+    image = preprocess_fn(
+      image_raw, labels_raw, bboxes_raw, out_shape,
+      is_training=False, data_format=data_format, output_rgb=False)
+    labels, bboxes = labels_raw, bboxes_raw
+
+  # pack all the annotations into one tensor
+  image_info = {'image': image, 'filename': filename, 'shape': shape}
+  objects = pack_annotations(bboxes, labels)
+
+  return image_info, objects
+
+class PascalVocDataset(AbstractDataset):
+  """Pascal VOC dataset."""
+
+  def __init__(self, preprocess_fn, is_train):
+    """Constructor function.
+
+    Args:
+    * is_train: whether to construct the training subset
+    """
+
+    # initialize the base class
+    super(PascalVocDataset, self).__init__(is_train)
+
+    # choose local files or HDFS files w.r.t. FLAGS.data_disk
+    if FLAGS.data_disk == 'local':
+      assert FLAGS.data_dir_local is not None, '<FLAGS.data_dir_local> must not be None'
+      self.data_dir = FLAGS.data_dir_local
+    elif FLAGS.data_disk == 'hdfs':
+      assert FLAGS.data_hdfs_host is not None and FLAGS.data_dir_hdfs is not None, \
+        'both <FLAGS.data_hdfs_host> and <FLAGS.data_dir_hdfs> must not be None'
+      self.data_dir = FLAGS.data_hdfs_host + FLAGS.data_dir_hdfs
+    else:
+      raise ValueError('unrecognized data disk: ' + FLAGS.data_disk)
+
+    # configure file patterns & function handlers
+    if is_train:
+      self.file_pattern = os.path.join(self.data_dir, '*train*')
+      self.batch_size = FLAGS.batch_size
+    else:
+      self.file_pattern = os.path.join(self.data_dir, '*val*')
+      self.batch_size = FLAGS.batch_size_eval
+    self.dataset_fn = tf.data.TFRecordDataset
+    self.parse_fn = lambda x: parse_fn(x, preprocess_fn=preprocess_fn, is_train=is_train)
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 0000000..7330c34
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,13 @@
+# How to Build the Documentation Site
+
+1. Install mkdocs and other dependencies:
+
+``` bash
+$ pip install -r doc-requirements.txt
+```
+
+2. Build a directory named "site", which contains all files needed for the documentation website:
+
+``` bash
+$ mkdocs build --clean
+```
diff --git a/docs/doc-requirements.txt b/docs/doc-requirements.txt
new file mode 100644
index 0000000..aa4ce36
--- /dev/null
+++ b/docs/doc-requirements.txt
@@ -0,0 +1,4 @@
+mkdocs
+markdown
+python-markdown-math
+pymdown-extensions
diff --git a/docs/docs/.markdownlint.json b/docs/docs/.markdownlint.json
new file mode 100644
index 0000000..435c42e
--- /dev/null
+++ b/docs/docs/.markdownlint.json
@@ -0,0 +1,5 @@
+{
+    "MD013": false,
+    "MD014": false,
+    "MD024": {"allow_different_nesting": true}
+}
diff --git a/docs/docs/MathJax.js b/docs/docs/MathJax.js
new file mode 100644
index 0000000..35e1994
--- /dev/null
+++ b/docs/docs/MathJax.js
@@ -0,0 +1,53 @@
+(function () {
+  var newMathJax = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js';
+  var oldMathJax = 'cdn.mathjax.org/mathjax/latest/MathJax.js';
+
+  var replaceScript = function (script, src) {
+    //
+    //  Make redirected script
+    //
+    var newScript = document.createElement('script');
+    newScript.src = newMathJax + src.replace(/.*?(\?|$)/, '$1');
+    //
+    //  Move onload and onerror handlers to new script
+    //
+    newScript.onload = script.onload; 
+    newScript.onerror = script.onerror;
+    script.onload = script.onerror = null;
+    //
+    //  Move any content (old-style configuration scripts)
+    //
+    while (script.firstChild) newScript.appendChild(script.firstChild);
+    //
+    //  Copy script id
+    //
+    if (script.id != null) newScript.id = script.id;
+    //
+    //  Replace original script with new one
+    //
+    script.parentNode.replaceChild(newScript, script);
+    //
+    //  Issue a console warning
+    //
+    console.warn('WARNING: cdn.mathjax.org has been retired. Check https://www.mathjax.org/cdn-shutting-down/ for migration tips.')
+  }
+
+  if (document.currentScript) {
+    var script = document.currentScript;
+    replaceScript(script, script.src);
+  } else {
+    //
+    // Look for current script by searching for one with the right source
+    //
+    var n = oldMathJax.length;
+    var scripts = document.getElementsByTagName('script');
+    for (var i = 0; i < scripts.length; i++) {
+      var script = scripts[i];
+      var src = (script.src || '').replace(/.*?:\/\//,'');
+      if (src.substr(0, n) === oldMathJax) {
+        replaceScript(script, src);
+        break;
+      }
+    }
+  }
+})();
\ No newline at end of file
diff --git a/docs/docs/automl_based_methods.md b/docs/docs/automl_based_methods.md
new file mode 100644
index 0000000..2a8672e
--- /dev/null
+++ b/docs/docs/automl_based_methods.md
@@ -0,0 +1,3 @@
+# AutoML-based Methods
+
+Under construction ...
diff --git a/docs/docs/cp_learner.md b/docs/docs/cp_learner.md
new file mode 100644
index 0000000..27024e8
--- /dev/null
+++ b/docs/docs/cp_learner.md
@@ -0,0 +1,69 @@
+# Channel Pruning
+
+## Introduction
+
+Channel pruning is a kind of structural model compression approach which can not only compress the model size, but accelerate the inference speed directly. PocketFlow uses the channel pruning algorithm proposed in (He et al., 2017) to pruning each channel of convolution layers with a certain ratio, and for details please refer to the [channel pruning paper](https://arxiv.org/abs/1707.06168). For better performance and more robust, we modify some parts of the algorithm to achieve better result.
+
+In order to achieve a better performance, PocketFlow can take advantages of reinforcement learning to search a better compression ratio (He et al., 2018). User can also use the distilling (Hinton et al., 2015) and group tuning function to improve the accuracy after compression. Group tuning means setting a certain number of layers as group and then pruning and fine-tuning (or re-training) each group sequentially. For example, we can set each 3 layers as a group and then prune the first 3 layers. After that, we fine-tune (or re-train) the whole model and prune the next 3 layers and so on. Distilling and group tuning are experimentally proved as effective approaches to achieve higher accuracy at a certain compression ratio in most situations.
+
+## Pruning Option
+
+The code of channel pruning are located at directory `./learners/channel_pruning`. To use channel pruning. users can set `--learners` to `channel`. The Channel pruning supports 3 kinds of pruning setup by `cp_prune_option` option.
+
+### Uniform Channel Pruning
+
+One is the uniform layer pruning, which means the user can set each convolution layer pruned with an uniform pruning ratio by  `--cp_prune_option=uniform` and set the ratio (eg. making the ratio 0.5) by `--cp_uniform_preserve_ratio=0.5`. Note that for a layer, if both of pruning ratio of the layer and its previous layer are 0.5, the real preserved FLOPs are 1/4 of original FLOPs. Because channel pruning only prune the c_out channels of the convolution and c_in channels of the next convolution, if both c_in and c_out channels are pruned by 0.5, it will preserve only 1/4 of original computation cost. For a layer by layer convolution networks without residual blocks, if the user set `cp_uniform_preserve_ratio` to `0.5`, the whole model will be the 0.25 computation of the original model. However for the residual networks, some convolutions can only prune their c_in or c_out channels, which means the total preserved computation ratio may be much greater than 0.25.
+
+**Example:**
+
+``` bash
+$ ./scripts/run_seven.sh nets/resnet_at_cifar10_run.py \
+    --learner channel \
+    --batch_size_eval 64 \
+    --cp_uniform_preserve_ratio 0.5 \
+    --cp_prune_option uniform \
+    --resnet_size 20
+```
+
+### List Channel Pruning
+
+Another pruning option is pruning the corresponding layer with ratios listed in a named `ratio.list` file, the file name of which can be set by `--cp_prune_list_file` option. the ratio value must be separated by a comma. User can set `--cp_prune_option=list` to prune the model by list ratios.
+
+**Example:**
+Add list `1.0, 0.1875, 0.1875, 0.1875, 0.1875, 0.1875, 0.1875, 0.1875, 1.0, 0.25, 1.0, 0.25, 0.21875, 0.21875, 0.21875, 1.0, 0.5625, 1.0, 0.546875, 0.546875, 0.546875, 1` in `./ratio.list`
+
+``` bash
+$ ./scripts/run_seven.sh nets/resnet_at_cifar10_run.py \
+    --learner channel \
+    --batch_size_eval 64 \
+    --cp_prune_option list \
+    --cp_prune_list_file ./ratio.list \
+    --resnet_size 20
+```
+
+### Automatic Channel Pruning
+
+The last one pruning option is searching better pruning ratios by reinforcement learning and you only need to give a value which represents what the ratio of total FLOPs/Computation you wants the compressed model preserve. You can set `--cp_prune_option=auto` and set a preserve ratio number such as `--cp_preserve_ratio=0.5`.  User can also use `cp_nb_rlouts_min` to control reinforcement learning warm up iterations, which means the RL agent start to learn after the iterations, the default value is `50`. User can also use `cp_nb_rlouts` to control the total iteration RL agent to search, the default value is `200`. If the user want to control other parameters of the agents, please refer to the reinforcement component page.
+
+**Example:**
+
+``` bash
+$ ./scripts/run_seven.sh nets/resnet_at_cifar10_run.py \
+    --learner channel \
+    --batch_size_eval 64 \
+    --cp_preserve_ratio 0.5 \
+    --cp_prune_option auto \
+    --resnet_size 20
+```
+
+## Channel pruning parameters
+
+The implementation of the channel pruning use Lasso algorithm to do channel selection and linear regression to do feature map reconstruction. During these two phases, sampling is done on the feature map to reduce computation cost. The users can use `--cp_nb_points_per_layer` to set how many sampling points on each layer are taken, the default value is `10`. For some dataset, if the images contain too many zero pixels (eg. black color), the value should be greater. The users can also set using how many batches to do channel selection and feature reconstruction by `cp_nb_batches`, the default value is `60`. Small value of  `cp_nb_batches` may cause over-fitting and large value may slow down the solving speed, so a good value depends on the nets and dataset. For more practical usage, user may consider make the channel number of each layer is the quadruple for fast inference of mobile devices. In this case, user can set `--cp_quadruple` to `True` to make the compressed model have a quadruple number of channels.
+
+## Distilling
+
+Distilling is an effective approach to improve the final accuracy of compressed model with PocketFlow in most situations of classification. User can set `--enbl_dst=True` to enable distilling.
+
+## Group Tuning
+
+As introduced above, group tuning was proposed by the PocketFlow team and finding it is very useful to improve the performance of model compression. In PocketFlow, users can set `--cp_finetune=True` to enable group fine-tuning and set the group number by `--cp_list_group`, the default value is `1000`. There is a trade-off between the small value and large value, because if the value is `1`, PocketFlow will prune convolution and fine-tune/re-train by each layer, which may have better effect but be more time-consuming. If we set the value large, the function will be less effective. User can also set the number of iterations to fine-tune by setting `cp_nb_iters_ft_ratio` which mean the ratio the total iterations to be used in fine-tuning. The learning rate of fine-tuning can be set by `cp_lrn_rate_ft`.
diff --git a/docs/docs/cpr_learner.md b/docs/docs/cpr_learner.md
new file mode 100644
index 0000000..d96698d
--- /dev/null
+++ b/docs/docs/cpr_learner.md
@@ -0,0 +1,137 @@
+# Channel Pruning - Remastered
+
+## Introduction
+
+Channel pruning (He et al., 2017) aims at reducing the number of input channels of each convolutional layer while minimizing the reconstruction loss of its output feature maps, using preserved input channels only. Similar to other model compression components based on channel pruning, this can lead to direct reduction in both model size and computational complexity (in terms of FLOPs).
+
+In PocketFlow, we provide `ChannelPrunedRmtLearner` as the remastered version of the previous `ChannelPrunedLearner`, with simplified and easier-to-understand implementation. The underlying algorithm is based on (He et al., 2017), with a few modifications. However, the support for RL-based hyper-parameter optimization is not yet ready and will be provided in the near future.
+
+## Algorithm Description
+
+For a convolutional layer, we denote its input feature map as $\mathcal{X} \in \mathbb{R}^{N \times h_{i} \times w_{i} \times c_{i}}$, where $N$ is the batch size, $h_{i}$ and $w_{i}$ are the spatial height and width, and $c_{i}$ is the number of inputs channels. The convolutional kernel is denoted as $\mathcal{W} \in \mathbb{R}^{k_{h} \times k_{w} \times c_{i} \times c_{o}}$, where $\left( k_{h}, k_{w} \right)$ is the kernel's spatial size and $c_{o}$ is the number of output channels. The resulting output feature map is given by $\mathcal{Y} = f \left( \mathcal{X}; \mathcal{W} \right) \in \mathbb{R}^{N \times h_{o} \times w_{o} \times c_{o}}$, where $h_{o}$ and $w_{o}$ are the spatial height and width, and $f \left( \cdot \right)$ denotes the convolutional operation.
+
+The convolutional operation can be understood as standard matrix multiplication between two matrices, one from $\mathcal{X}$ and the other from $\mathcal{W}$. The input feature map $\mathcal{X}$ is re-arranged via the `im2col` operator to produce a matrix $\mathbf{X}$ of size $N h_{o} w_{o} \times h_{k} w_{k} c_{i}$. The convolutional kernel $\mathcal{W}$ is correspondingly reshaped into $\mathbf{W}$ of size $h_{k} w_{k} c_{i} \times c_{o}$. The multiplication of these two matrices produces the output feature map in the matrix form, given by $\mathbf{Y} = \mathbf{X} \mathbf{W}$, which can be further reshaped back to the 4-D tensor $\mathcal{Y}$.
+
+The matrix multiplication can be decomposed along the dimension of input channels. We divide $\mathbf{X}$ into $c_{i}$ sub-matrices $\left\{ \mathbf{X}_{i} \right\}$, each of size $N h_{o} w_{o} \times h_{k} w_{k}$, and similarly divide $\mathbf{W}$ into $c_{i}$ sub-matrices $\left\{ \mathbf{W}_{i} \right\}$, each of size $h_{k} w_{k} c_{i} \times c_{o}$. The computation of output feature map $\mathbf{Y}$ can be rewritten as:
+
+$$
+\mathbf{Y} = \sum\nolimits_{i = 1}^{c_{i}} \mathbf{X}_{i} \mathbf{W}_{i}
+$$
+
+In (He et al., 2017), a $c_{i}$-dimensional binary-valued mask vector $\boldsymbol{\beta}$ is introduced to indicate whether an input channel is pruned ($\beta_{i} = 0$) or not ($\beta_{i} = 1$). More formally, we consider the minimization of output feature map's reconstruction loss under sparsity constraint:
+
+$$
+\min_{\mathbf{W}, \boldsymbol{\beta}} \left\| \mathbf{Y} - \sum\nolimits_{i = 1}^{c_{i}} \beta_{i} \mathbf{X}_{i} \mathbf{W}_{i} \right\|_{F}^{2}, ~ \text{s.t.} ~ \left\| \boldsymbol{\beta} \right\|_{0} \le c'_{i}
+$$
+
+The above problem can be tackled by firstly solving $\boldsymbol{\beta}$ via a LASSO regression problem, and then updating $\mathbf{W}$ with the closed-form solution (or iterative solution) to least-square regression. Particularly, in the first step, we rewrite the sparsity constraint as a $l_{1}$-regularization term, so the optimization over $\boldsymbol{\beta}$ is now given by:
+
+$$
+\min_{\boldsymbol{\beta}} \left\| \mathbf{Y} - \sum\nolimits_{i = 1}^{c_{i}} \beta_{i} \mathbf{X}_{i} \mathbf{W}_{i} \right\|_{F}^{2} + \lambda \left\| \boldsymbol{\beta} \right\|_{1}
+$$
+
+The coefficient of $l_{1}$-regularization, $\lambda$, is determined via binary search so that the resulting solution $\boldsymbol{\beta}^{*}$ has exactly $c_{i}$ non-zero entries. We solve the above unconstrained problem with the Iterative Shrinkage Thresholding Algorithm (ISTA).
+
+## Hyper-parameters
+
+Below is the full list of hyper-parameters used in `ChannelPrunedRmtLearner`:
+
+| Name | Description |
+|:-----|:------------|
+| `cpr_save_path` | model's save path |
+| `cpr_save_path_eval` | model's save path for evaluation |
+| `cpr_save_path_ws` | model's save path for warm-start |
+| `cpr_prune_ratio` | target pruning ratio |
+| `cpr_skip_frst_layer` | skip the first convolutional layer for channel pruning |
+| `cpr_skip_last_layer` | skip the last convolutional layer for channel pruning |
+| `cpr_skip_op_names` | comma-separated Conv2D operations names to be skipped |
+| `cpr_nb_smpls` | number of cached training samples for channel pruning |
+| `cpr_nb_crops_per_smpl` | number of random crops per sample |
+| `cpr_ista_lrn_rate` | ISTA's learning rate |
+| `cpr_ista_nb_iters` | number of iterations in ISTA |
+| `cpr_lstsq_lrn_rate` | least-square regression's learning rate |
+| `cpr_lstsq_nb_iters` | number of iterations in least-square regression |
+| `cpr_warm_start` | use a channel-pruned model for warm start |
+
+Here, we provide detailed description (and some analysis) for above hyper-parameters:
+
+* `cpr_save_path`: save path for model created in the training graph. The resulting checkpoint files can be used to resume training from a previous run and compute model's loss function's value and some other evaluation metrics.
+* `cpr_save_path_eval`: save path for model created in the evaluation graph. The resulting checkpoint files can be used to export GraphDef & TensorFlow Lite model files.
+* `cpr_save_path_ws`: save path for model used for warm-start. This learner supports loading a previously-saved channel-pruned model, so that no need to perform channel selection again. This is only used when `cpr_warm_start` is `True`.
+* `cpr_prune_ratio`: target pruning ratio for input channels of each convolutional layer. The larger `cpr_prune_ratio` is, the more input channels will be pruned. If `cpr_prune_ratio` equals 0, then no input channels will be pruned and model remains the same; if `cpr_prune_ratio` equals 1, then all input channels will be pruned.
+* `cpr_skip_frst_layer`: whether to skip the first convolutional layer for channel pruning. The first convolutional layer may be directly related to input images and pruning its input channel may harm the performance significantly.
+* `cpr_skip_last_layer`: whether to skip the last convolutional layer for channel pruning. The first convolutional layer may be directly related to final outputs and pruning its input channel may harm the performance significantly.
+* `cpr_skip_op_names`: comma-separated Conv2D operations names to be skipped. For instance, if `cpr_skip_op_names` is set to "aaa,bbb", then any Conv2D operation whose name contains either "aaa" or "bbb" will be skipped and no channel pruning will be applied on it.
+* `cpr_nb_smpls`: number of cached training samples for channel pruning. Increasing this may lead to smaller performance degradation after channel pruning but also require more training time.
+* `cpr_nb_crops_per_smpl`: number of random crops per sample. Increasing this may lead to smaller performance degradation after channel pruning but also require more training time.
+* `cpr_ista_lrn_rate`: ISTA's learning rate for LASSO regression. If `cpr_ista_lrn_rate` is too large, then the optimization process may become unstable; if `cpr_ista_lrn_rate` is too small, then the optimization process may require lots of iterations until convergence.
+* `cpr_ista_nb_iters`: number of iterations for LASSO regression.
+* `cpr_lstsq_lrn_rate`: Adam's learning rate for least-square regression. If `cpr_lstsq_lrn_rate` is too large, then the optimization process may become unstable; if `cpr_lstsq_lrn_rate` is too small, then the optimization process may require lots of iterations until convergence.
+* `cpr_lstsq_nb_iters`: number of iterations for least-square regression.
+* `cpr_warm_start`: whether to use a previously-saved channel-pruned model for warm-start.
+
+## Empirical Evaluation
+
+In this section, we present some of our results for applying `ChannelPrunedRmtLearner` for compression image classification and object detection models.
+
+For image classification, we use `ChannelPrunedRmtLearner` to compress the ResNet-18 model on the ILSVRC-12 dataset:
+
+| Model | Prune Ratio | FLOPs | Distillation? | Top-1 Acc. | Top-5 Acc. |
+|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
+| ResNet-18 | 0.2 | 73.32% | No | 69.43% | 88.97% |
+| ResNet-18 | 0.2 | 73.32% | Yes | 68.78% | 88.71% |
+| ResNet-18 | 0.3 | 61.31% | No | 68.44% | 88.30% |
+| ResNet-18 | 0.3 | 61.31% | Yes | 68.85% | 88.53% |
+| ResNet-18 | 0.4 | 50.70% | No | 67.17% | 87.48% |
+| ResNet-18 | 0.4 | 50.70% | Yes | 67.35% | 87.83% |
+| ResNet-18 | 0.5 | 41.27% | No | 65.73% | 86.38% |
+| ResNet-18 | 0.5 | 41.27% | Yes | 65.98% | 86.98% |
+| ResNet-18 | 0.6 | 32.07% | No | 63.38% | 84.62% |
+| ResNet-18 | 0.6 | 32.07% | Yes | 63.65% | 85.47% |
+| ResNet-18 | 0.7 | 24.28% | No | 60.26% | 82.70% |
+| ResNet-18 | 0.7 | 24.28% | Yes | 60.43% | 82.96% |
+
+For object detection, we use `ChannelPrunedRmtLearner` to compress the SSD-VGG16 model on the Pascal VOC 07-12 dataset:
+
+| Model | Prune Ratio | FLOPs | Pruned Layers | mAP |
+|:-----:|:-----:|:-----:|:-----:|:-----:|
+| SSD-VGG16 | 0.2 | 67.34% | Backbone | 77.53% |
+| SSD-VGG16 | 0.2 | 66.50% | All | 77.22% |
+| SSD-VGG16 | 0.3 | 53.58% | Backbone | 76.94% |
+| SSD-VGG16 | 0.3 | 52.32% | All | 76.90% |
+| SSD-VGG16 | 0.4 | 41.63% | Backbone | 75.81% |
+| SSD-VGG16 | 0.4 | 39.96% | All | 75.80% |
+| SSD-VGG16 | 0.5 | 31.56% | Backbone | 74.42% |
+| SSD-VGG16 | 0.5 | 29.47% | All | 73.76% |
+
+## Usage Examples
+
+In this section, we provide some usage examples to demonstrate how to use `ChannelPrunedRmtLearner` under different execution modes and hyper-parameter combinations:
+
+To compress a ResNet-20 model for CIFAR-10 classification task in the local mode, use:
+
+``` bash
+# set the target pruning ratio to 0.50
+./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --learner=chn-pruned-rmt \
+    --cpr_prune_ratio=0.50
+```
+
+To compress a ResNet-18 model for ILSVRC-12 classification task in the docker mode with 4 GPUs, use:
+
+``` bash
+# do no apply channel pruning to the last convolutional layer
+./scripts/run_docker.sh nets/resnet_at_ilsvrc12_run.py -n=4 \
+    --learner=chn-pruned-rmt \
+    --cpr_skip_last_layer=True
+```
+
+To compress a MobileNet-v1 model for ILSVRC-12 classification task in the seven mode with 8 GPUs, use:
+
+``` bash
+# use a channel-pruned model for warm-start, so no channel selection is needed
+./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py -n=8 \
+    --learner=chn-pruned-rmt \
+    --cpr_warm_start=True \
+    --cpr_save_path_ws=./models_cpr_ws/model.ckpt
+```
diff --git a/docs/docs/dcp_learner.md b/docs/docs/dcp_learner.md
new file mode 100644
index 0000000..85becd5
--- /dev/null
+++ b/docs/docs/dcp_learner.md
@@ -0,0 +1,85 @@
+# Discrimination-aware Channel Pruning
+
+## Introduction
+
+Discrimination-aware channel pruning (DCP, Zhuang et al., 2018) introduces a group of additional discriminative losses into the network to be pruned, to find out which channels are really contributing to the discriminative power and should be preserved. After channel pruning, the number of input channels of each convolutional layer is reduced, so that the model becomes smaller and the inference speed can be improved.
+
+## Algorithm Description
+
+For a convolutional layer, we denote its input feature map as $\mathbf{X} \in \mathbb{R}^{N \times c_{i} \times h_{i} \times w_{i}}$, where $N$ is the batch size, $c_{i}$ is the number of inputs channels, and $h_{i}$ and $w_{i}$ are the spatial height and width. The convolutional kernel is denoted as $\mathbf{W} \in \mathbb{R}^{c_{o} \times c_{i} \times k \times k}$, where $c_{o}$ is the number of output channels and $k$ is the kernel size. The resulting output feature map is given by $\mathbf{Y} = f \left( \mathbf{X}; \mathbf{W} \right)$, where $f \left( \cdot \right)$ represents the convolutional operation.
+
+The idea of channel pruning is to impose the sparsity constraint on the convolutional kernel, so that some of its input channels only contains all-zero weights and can be safely removed. For instance, if the convolutional kernel satisfies:
+
+$$
+\left\| \left\| \mathbf{W}_{:, j, :, :} \right\|_{F}^{2} \right\|_{0} = c'_{i},
+$$
+
+where $c'_{i} \lt c_{i}$, then the convolutional layer simplified to with $c'_{i}$ input channels only, and the computational complexity is reduced by a ratio of $\frac{c_{i} - c'_{i}}{c_{i}}$.
+
+In order to reduce the performance degradation caused by channel pruning, the DCP algorithm introduces a novel channel selection algorithm by incorporating additional discrimination-aware and reconstruction loss terms, as shown below.
+
+![DCP Learner](pics/dcp_learner.png)
+**Source:** Zhuang et al., *Discrimination-aware Channel Pruning for Deep Neural Networks*. NIPS '18.
+
+The network is evenly divided into $\left( P + 1 \right)$ blocks. For each of the first $P$ blocks, an extra branch is derived from the output feature map of this block's last layer. The output feature map is then passed through batch normalization & ReLU & average pooling & softmax layers to produce predictions, from which a discrimination-aware loss is constructed, denoted as $L_{p}$. For the last block, the final loss of whole network, denoted as $L$, is used as its discrimination-aware loss. Additionally, for each layer in the channel pruned network, a reconstruction loss is introduced to force it to re-produce the corresponding output feature map in the original network. We denote the $q$-th layer's reconstruction loss as $L_{q}^{( R )}$.
+
+Based on a pre-trained model, the DCP algorithm performs channel pruning with $\left( P + 1 \right)$ stages. During the $p$-th stage, the network is fine-tuned with the $p$-th discrimination-aware loss $L_{p}$ plus the final loss $L$. After the block-wise fine-tuning, we sequentially perform channel pruning for each convolutional layer within the block. For channel pruning, we compute each input channel's gradients *w.r.t.* the reconstruction loss $L_{q}^{( R )}$ plus the discrimination-aware loss $L_{p}$, and remove the input channel with the minimal Frobenius norm of gradients. After that, this layer is fine-tuned with the remaining input channels only to (partially) recover the discriminative power. We repeat this process until the target pruning ratio is reached.
+
+After all convolutional layers have been pruned, the resulting network can be further fine-tuned for a few epochs to further reduce the performance loss.
+
+## Hyper-parameters
+
+Below is the full list of hyper-parameters used in the discrimination-aware channel pruning learner:
+
+| Name | Description |
+|:-----|:------------|
+| `dcp_save_path`      | model's save path |
+| `dcp_save_path_eval` | model's save path for evaluation |
+| `dcp_prune_ratio`    | target pruning ratio |
+| `dcp_nb_stages`      | number of channel pruning stages |
+| `dcp_lrn_rate_adam`  | Adam's learning rate for block-wise & layer-wise fine-tuning |
+| `dcp_nb_iters_block` | number of iterations for block-wise fine-tuning |
+| `dcp_nb_iters_layer` | number of iterations for layer-wise fine-tuning |
+
+Here, we provide detailed description (and some analysis) for above hyper-parameters:
+
+* `dcp_save_path`: save path for model created in the training graph. The resulting checkpoint files can be used to resume training from a previous run and compute model's loss function's value and some other evaluation metrics.
+* `dcp_save_path_eval`: save path for model created in the evaluation graph. The resulting checkpoint files can be used to export GraphDef & TensorFlow Lite model files.
+* `dcp_prune_ratio`: target pruning ratio for input channels of each convolutional layer. The larger `dcp_prune_ratio` is, the more input channels will be pruned. If `dcp_prune_ratio` equals 0, then no input channels will be pruned and model remains the same; if `dcp_prune_ratio` equals 1, then all input channels will be pruned.
+* `dcp_nb_stages`: number of channel pruning stages / number of discrimination-aware losses. The training process of DCP algorithm is divided into multiple stages. For each discrimination-aware loss, a channel pruning stage is involved to select channels within corresponding layers. The final classification loss corresponds to a pseudo channel pruning stage, which is not counted in `dcp_nb_stages`.The larger `dcp_nb_stages` is, the slower the training process will be.
+* `dcp_lrn_rate_adam`: Adam's learning rate for block-wise & layer-wise fine-tuning. If `dcp_lrn_rate_adam` is too large, then the fine-tuning process may become unstable; if `dcp_lrn_rate_adam` is too small, then the fine-tuning process may take long time to converge.
+* `dcp_nb_iters_block`: number of iterations for block-wise fine-tuning. This should be set to some value that the block-wise fine-tuning can almost converge and the loss function's value does not decrease much even if more iterations are used.
+* `dcp_nb_iters_layer`: number of iterations for layer-wise fine-tuning. This should be set to some value that the layer-wise fine-tuning can almost converge and the loss function's value does not decrease much even if more iterations are used.
+
+## Usage Examples
+
+In this section, we provide some usage examples to demonstrate how to use `DisChnPrunedLearner` under different execution modes and hyper-parameter combinations:
+
+To compress a ResNet-20 model for CIFAR-10 classification task in the local mode, use:
+
+``` bash
+# set the target pruning ratio to 0.75
+./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --learner dis-chn-pruned \
+    --dcp_prune_ratio 0.75
+```
+
+To compress a ResNet-34 model for ILSVRC-12 classification task in the docker mode with 4 GPUs, use:
+
+``` bash
+# set the number of channel pruning stages to 4
+./scripts/run_docker.sh nets/resnet_at_ilsvrc12_run.py -n=4 \
+    --learner dis-chn-pruned \
+    --resnet_size 34 \
+    --dcp_nb_stages 4
+```
+
+To compress a MobileNet-v2 model for ILSVRC-12 classification task in the seven mode with 8 GPUs, use:
+
+``` bash
+# enable training with distillation loss
+./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py -n=8 \
+    --learner dis-chn-pruned \
+    --mobilenet_version 2 \
+    --enbl_dst
+```
diff --git a/docs/docs/distillation.md b/docs/docs/distillation.md
new file mode 100644
index 0000000..5db8487
--- /dev/null
+++ b/docs/docs/distillation.md
@@ -0,0 +1,28 @@
+# Distillation
+
+Distillation (Hinton et al., 2015) is a kind of model compression approaches in which a pre-trained large model teaches a smaller model to achieve the similar prediction performance.
+It is often named as the "teacher-student" training, where the large model is the teacher and the smaller model is the student.
+
+With distillation, knowledge can be transferred from the teacher model to the student by minimizing a loss function to recover the distribution of class probabilities predicted by the teacher model.
+In most situations, the probability of the correct class predicted by the teacher model is very high, and probabilities of other classes are close to 0, which may not be able to provide extra information beyond ground-truth labels.
+To overcome this issue, a commonly-used solution is to raise the temperature of the final softmax function until the cumbersome model produces a suitably soft set of targets. The soften probability $q_i$ of class $i$ is calculated from the logit $z_i$:
+
+$$
+q_i = \frac{\exp \left( z_i / T \right)}{\sum_j{\exp \left( z_j / T \right)}}
+$$
+
+where $T$ is the temperature.
+As $T$ grows, the probability distribution is more smooth, providing more information as to which classes the cumbersome model more similar to the predicted class.
+It is better to include the standard loss ($T = 1$) between the predicted class probabilities and ground-truth labels.
+The overall loss function is given by:
+
+$$
+L \left( x; W \right) = H \left( y, \sigma \left( z_s; T = 1 \right) \right) + \alpha \cdot H \left( \sigma \left( z_t; T = \tau \right), \sigma \left( z_s, T = \tau \right) \right)
+$$
+
+where $x$ is the input, $W$ are parameters of the distilled small model and $y$ is ground-truth labels, $\sigma$ is the softmax parameterized by temperature $T$, $H$ is the cross-entropy loss, and $\alpha$ is the coefficient of distillation loss.
+The coefficient $\alpha$ can be set by `--loss_w_dst` and the temperature $T$ can be set by `--tempr_dst`.
+
+## Combination with Other Model Compression Approaches
+
+Other model model compression techniques, such as channel pruning, weight pruning, and quantization, can be augmented with distillation. To enable the distillation loss, simply append the `--enbl_dst` argument when starting the program.
diff --git a/docs/docs/faq.md b/docs/docs/faq.md
new file mode 100644
index 0000000..93d34f8
--- /dev/null
+++ b/docs/docs/faq.md
@@ -0,0 +1,5 @@
+# Frequently Asked Questions
+
+**Q: Under construction ...**
+
+A: Under construction ...
\ No newline at end of file
diff --git a/docs/docs/index.md b/docs/docs/index.md
new file mode 100644
index 0000000..9d10fdf
--- /dev/null
+++ b/docs/docs/index.md
@@ -0,0 +1,73 @@
+# PocketFlow
+
+PocketFlow is an open-source framework for compressing and accelerating deep learning models with minimal human effort. Deep learning is widely used in various areas, such as computer vision, speech recognition, and natural language translation. However, deep learning models are often computational expensive, which limits further applications on mobile devices with limited computational resources.
+
+PocketFlow aims at providing an easy-to-use toolkit for developers to improve the inference efficiency with little or no performance degradation. Developers only needs to specify the desired compression and/or acceleration ratios and then PocketFlow will automatically choose proper hyper-parameters to generate a highly efficient compressed model for deployment.
+
+## Framework
+
+The proposed framework mainly consists of two categories of algorithm components, *i.e.* learners and hyper-parameter optimizers, as depicted in the figure below. Given an uncompressed original model, the learner module generates a candidate compressed model using some randomly chosen hyper-parameter combination. The candidate model's accuracy and computation efficiency is then evaluated and used by hyper-parameter optimizer module as the feedback signal to determine the next hyper-parameter combination to be explored by the learner module. After a few iterations, the best one of all the candidate models is output as the final compressed model.
+
+![Framework Design](pics/framework_design.png)
+
+## Learners
+
+A learner refers to some model compression algorithm augmented with several training techniques as shown in the figure above. Below is a list of model compression algorithms supported in PocketFlow:
+
+| Name | Description |
+|:-----|:------------|
+| `ChannelPrunedLearner`   | channel pruning with LASSO-based channel selection (He et al., 2017) |
+| `DisChnPrunedLearner`    | discrimination-aware channel pruning (Zhuang et al., 2018) |
+| `WeightSparseLearner`    | weight sparsification with dynamic pruning schedule (Zhu & Gupta, 2017) |
+| `UniformQuantLearner`    | weight quantization with uniform reconstruction levels (Jacob et al., 2018) |
+| `UniformQuantTFLearner`  | weight quantization with uniform reconstruction levels and TensorFlow APIs |
+| `NonUniformQuantLearner` | weight quantization with non-uniform reconstruction levels (Han et al., 2016) |
+
+All the above model compression algorithms can trained with fast fine-tuning, which is to directly derive a compressed model from the original one by applying either pruning masks or quantization functions. The resulting model can be fine-tuned with a few iterations to recover the accuracy to some extent. Alternatively, the compressed model can be re-trained with the full training data, which leads to higher accuracy but usually takes longer to complete.
+
+To further reduce the compressed model's performance degradation, we adopt network distillation to augment its training process with an extra loss term, using the original uncompressed model's outputs as soft labels. Additionally, multi-GPU distributed training is enabled for all learners to speed-up the time-consuming training process.
+
+## Hyper-parameter Optimizers
+
+For model compression algorithms, there are several hyper-parameters that may have a large impact on the final compressed model's performance. It can be quite difficult to manually determine proper values for these hyper-parameters, especially for developers that are not very familiar with algorithm details. Recently, several AutoML systems, *e.g.* [Cloud AutoML](https://cloud.google.com/automl/) from Google, have been developed to train high-quality machine learning models with minimal human effort. Particularly, the AMC algorithm (He et al., 2018) presents promising results for adopting reinforcement learning for automated model compression with channel pruning and fine-grained pruning.
+
+In PocketFlow, we introduce the hyper-parameter optimizer module to iteratively search for the optimal hyper-parameter setting. We provide several implementations of hyper-parameter optimizer, based on models including Gaussian Processes (GP, Mockus, 1975), Tree-structured Parzen Estimator (TPE, Bergstra et al., 2013), and Deterministic Deep Policy Gradients (DDPG, Lillicrap et al., 2016). The hyper-parameter setting is optimized through an iterative process. In each iteration, the hyper-parameter optimizer chooses a combination of hyper-parameter values, and the learner generates a candidate model with fast fast-tuning. The candidate model is evaluated to calculate the reward of the current hyper-parameter setting. After that, the hyper-parameter optimizer updates its model to improve its estimation on the hyper-parameter space. Finally, when the best candidate model (and corresponding hyper-parameter setting) is selected after some iterations, this model can be re-trained with full data to further reduce the performance loss.
+
+## Performance
+
+In this section, we present some of our results for applying various model compression methods for ResNet and MobileNet models on the ImageNet classification task, including channel pruning, weight sparsification, and uniform quantization.
+For complete evaluation results, please refer to [here](https://pocketflow.github.io/performance/).
+
+### Channel Pruning
+
+We adopt the DDPG algorithm as the RL agent to find the optimal layer-wise pruning ratios, and use group fine-tuning to further improve the compressed model's accuracy:
+
+| Model        | FLOPs | Uniform | RL-based      | RL-based + Group Fine-tuning |
+|:------------:|:-----:|:-------:|:-------------:|:----------------------------:|
+| MobileNet-v1 | 50%   | 66.5%   | 67.8% (+1.3%) | 67.9% (+1.4%)                |
+| MobileNet-v1 | 40%   | 66.2%   | 66.9% (+0.7%) | 67.0% (+0.8%)                |
+| MobileNet-v1 | 30%   | 64.4%   | 64.5% (+0.1%) | 64.8% (+0.4%)                |
+| Mobilenet-v1 | 20%   | 61.4%   | 61.4% (+0.0%) | 62.2% (+0.8%)                |
+
+### Weight Sparsification
+
+Comparing with the original algorithm (Zhu & Gupta, 2017) which uses the same sparsity for all layers, we incorporate the DDPG algorithm to iteratively search for the optimal sparsity of each layer, which leads to the increased accuracy:
+
+| Model        | Sparsity | (Zhu & Gupta, 2017) | RL-based                |
+|:------------:|:--------:|:-------------------:|:-----------------------:|
+| MobileNet-v1 | 50%      | 69.5%               | 70.5% (+1.0%)           |
+| MobileNet-v1 | 75%      | 67.7%               | 68.5% (+0.8%)           |
+| MobileNet-v1 | 90%      | 61.8%               | 63.4% (+1.6%)           |
+| MobileNet-v1 | 95%      | 53.6%               | 56.8% (+3.2%)           |
+
+### Uniform Quantization
+
+We show that models with 32-bit floating-point number weights can be safely quantized into their 8-bit counterpart without accuracy loss (sometimes even better!).
+The resulting model can be deployed on mobile devices for faster inference (Device: XiaoMi 8 with a Snapdragon 845 CPU):
+
+| Model        | Acc. (32-bit) | Acc. (8-bit)    | Time (32-bit) | Time (8-bit)         |
+|:------------:|:-------------:|:---------------:|:-------------:|:--------------------:|
+| MobileNet-v1 | 70.89%        | 71.29% (+0.40%) | 124.53        | 56.12 (2.22$\times$) |
+| MobileNet-v2 | 71.84%        | 72.26% (+0.42%) | 120.59        | 49.04 (2.46$\times$) |
+
+* All the reported time are in milliseconds.
diff --git a/docs/docs/installation.md b/docs/docs/installation.md
new file mode 100644
index 0000000..e2a5af3
--- /dev/null
+++ b/docs/docs/installation.md
@@ -0,0 +1,99 @@
+# Installation
+
+PocketFlow is developed and tested on Linux, using Python 3.6 and TensorFlow 1.10.0. We support the following three execution modes for PocketFlow:
+
+* Local mode: run PocketFlow on the local machine.
+* Docker mode: run PocketFlow within a docker image.
+* Seven mode: run PocketFlow on the seven cluster (only available within Tencent).
+
+## Clone PocketFlow
+
+To make a local copy of the PocketFlow repository, use:
+
+``` bash
+$ git clone https://github.com/Tencent/PocketFlow.git
+```
+
+## Create a Path Configuration File
+
+PocketFlow requires a path configuration file, named `path.conf`, to setup directory paths to data sets and pre-trained models under different execution modes, as well as HDFS / HTTP connection parameters.
+
+We have provided a template file to help you create your own path configuration file. You can find it in the PocketFlow repository, named `path.conf.template`, which contains more detailed descriptions on how to customize path configurations. For instance, if you want to use CIFAR-10 and ImageNet data sets stored on the local machine, then the path configuration file should look like this:
+
+``` bash
+# data files
+data_hdfs_host = None
+data_dir_local_cifar10 = /home/user_name/datasets/cifar-10-batches-bin  # this line has been edited!
+data_dir_hdfs_cifar10 = None
+data_dir_seven_cifar10 = None
+data_dir_docker_cifar10 = /opt/ml/data  # DO NOT EDIT
+data_dir_local_ilsvrc12 = /home/user_name/datasets/imagenet_tfrecord  # this line has been edited!
+data_dir_hdfs_ilsvrc12 = None
+data_dir_seven_ilsvrc12 = None
+data_dir_docker_ilsvrc12 = /opt/ml/data  # DO NOT EDIT
+
+# model files
+model_http_url = https://api.ai.tencent.com/pocketflow
+```
+
+In short, you need to replace "None" in the template file with the actual path (or HDFS / HTTP connection parameters) if available, or leave it unchanged otherwise.
+
+## Prepare for the Local Mode
+
+We recommend to use Anaconda as the Python environment, which has many essential packages built-in. The Anaconda installer can be downloaded from [here](https://www.anaconda.com/download/#linux). To install, use the following command:
+
+``` bash
+# install Anaconda; replace the installer's file name if needed
+$ bash Anaconda3-5.2.0-Linux-x86_64.sh
+
+# activate Anaconda's Python path
+$ source ~/.bashrc
+```
+
+For Anaconda 5.3.0 or later, the default Python version is 3.7, which does not support installing TensorFlow with pip directly. Therefore, you need to manually switch to Python 3.6 once Anaconda is installed:
+
+``` bash
+# install Python 3.6
+$ conda install python=3.6
+```
+
+To install TensorFlow, you may refer to TensorFlow's official [documentation](https://www.tensorflow.org/install/pip) for detailed instructions. Specially, if GPU-based training is required, then you need to follow the [GPU support guide](https://www.tensorflow.org/install/gpu) to set up a CUDA-enabled GPU card in prior to installation. After that, install TensorFlow with:
+
+``` bash
+# TensorFlow with GPU support; use <tensorflow> if GPU is not available
+$ pip install tensorflow-gpu
+
+# verify the install
+$ python -c "import tensorflow as tf; print(tf.__version__)"
+```
+
+To run PocketFlow in the local mode, *e.g.* to train a full-precision ResNet-20 model for the CIFAR-10 classification task, use the following command:
+
+``` bash
+$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py
+```
+
+## Prepare for the Docker Mode
+
+Docker offers an alternative way to run PocketFlow within an isolated container, so that your local Python environment remains untouched. We recommend you to use the [horovod](https://github.com/uber/horovod) docker provided by Uber, which enables multi-GPU distributed training for TensorFlow with only a few lines modification. Once docker is installed, the docker image can be obtained via:
+
+``` bash
+# obtain the docker image
+$ docker pull uber/horovod
+```
+
+To run PocketFlow in the docker mode, *e.g.* to train a full-precision ResNet-20 model for the CIFAR-10 classification task, use the following command:
+
+``` bash
+$ ./scripts/run_docker.sh nets/resnet_at_cifar10_run.py
+```
+
+## Prepare for the Seven Mode
+
+Seven is a distributed learning platform built for both CPU and GPU clusters. Users can submit tasks to the seven cluster, using built-in data sets and docker images seamlessly.
+
+To run PocketFlow in the seven mode, *e.g.* to train a full-precision ResNet-20 model for the CIFAR-10 classification task, use the following command:
+
+``` bash
+$ ./scripts/run_seven.sh nets/resnet_at_cifar10_run.py
+```
diff --git a/docs/docs/multi_gpu_training.md b/docs/docs/multi_gpu_training.md
new file mode 100644
index 0000000..6d9b88e
--- /dev/null
+++ b/docs/docs/multi_gpu_training.md
@@ -0,0 +1,111 @@
+# Multi-GPU Training
+
+Due to the high computational complexity, it often takes hours or even days to fully train deep learning models using a single GPU.
+In PocketFlow, we adopt multi-GPU training to speed-up this time-consuming training process.
+Our implementation is compatible with:
+
+* [Horovod](https://github.com/uber/horovod): a distributed training framework for TensorFlow, Keras, and PyTorch.
+* TF-Plus: an optimized framework for TensorFlow-based distributed training (only available within Tencent).
+
+We have provide a wrapper class, `MultiGpuWrapper`, to seamlessly switch between the above two frameworks.
+It will sequentially check whether Horovod and TF-Plus can be used, and use the first available one as the underlying framework for multi-GPU training.
+
+The main reason that using Horovod or TF-Plus instead TensorFlow's original distributed training routine is that these frameworks provide many easy-to-use APIs and require far less code changes to change from single-GPU to multi-GPU training, as we shall see later.
+
+## From Single-GPU to Multi-GPU
+
+To extend a single-GPU based training script to the multi-GPU scenario, at most 7 steps are needed:
+
+* Import the Horovod or TF-Plus module.
+
+``` Python
+from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
+```
+
+* Initialize the multi-GPU training framework, as early as possible.
+
+``` Python
+mgw.init()
+```
+
+* For each worker, create a session with a distinct GPU device.
+
+``` Python
+config = tf.ConfigProto()
+config.gpu_options.visible_device_list = str(mgw.local_rank())
+sess = tf.Session(config=config)
+```
+
+* (Optional) Let each worker use a distinct subset of training data.
+
+``` Python
+filenames = tf.data.Dataset.list_files(file_pattern, shuffle=True)
+filenames = filenames.shard(mgw.size(), mgw.rank())
+```
+
+* Wrapper the optimizer for distributed gradient communication.
+
+``` Python
+optimizer = tf.train.AdamOptimizer(learning_rate=lrn_rate)
+optimizer = mgw.DistributedOptimizer(optimizer)
+train_op = optimizer.minimize(loss)
+```
+
+* Synchronize master's parameters to all the other workers.
+
+``` Python
+bcast_op = mgw.broadcast_global_variables(0)
+sess.run(tf.global_variables_initializer())
+sess.run(bcast_op)
+```
+
+* (Optional) Save checkpoint files at the master node periodically.
+
+``` Python
+if mgw.rank() == 0:
+  saver.save(sess, save_path, global_step)
+```
+
+## Usage Example
+
+Here, we provide a code snippet to demonstrate how to use multi-GPU training to speed-up training.
+Please note that many implementation details are omitted for clarity.
+
+``` Python
+import tensorflow as tf
+from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
+
+# initialization
+mgw.init()
+
+# create the training graph
+with tf.Graph().as_default():
+  # create a TensorFlow session
+  config = tf.ConfigProto()
+  config.gpu_options.visible_device_list = str(mgw.local_rank())
+  sess = tf.Session(config=config)
+
+  # use tf.data.Dataset() to traverse images and labels
+  filenames = tf.data.Dataset.list_files(file_pattern, shuffle=True)
+  filenames = filenames.shard(mgw.size(), mgw.rank())
+  images, labels = get_images_n_labels(filenames)
+
+  # define the network and its loss function
+  logits = forward_pass(images)
+  loss = calc_loss(labels, logits)
+
+  # create an optimizer and setup training-related operations
+  global_step = tf.train.get_or_create_global_step()
+  optimizer = tf.train.AdamOptimizer(learning_rate=lrn_rate)
+  optimizer = mgw.DistributedOptimizer(optimizer)
+  train_op = optimizer.minimize(loss, global_step=global_step)
+  bcast_op = mgw.broadcast_global_variables(0)
+
+# multi-GPU training
+sess.run(tf.global_variables_initializer())
+sess.run(bcast_op)
+for idx_iter in range(nb_iters):
+  sess.run(train_op)
+  if mgw.rank() == 0 and (idx_iter + 1) % save_step == 0:
+    saver.save(sess, save_path, global_step)
+```
diff --git a/docs/docs/nuq_learner.md b/docs/docs/nuq_learner.md
new file mode 100644
index 0000000..37d2c59
--- /dev/null
+++ b/docs/docs/nuq_learner.md
@@ -0,0 +1,117 @@
+# Non-Uniform Quantization Learner
+
+Non-uniform quantization is a generalization to uniform quantization. In non-uniform quantization, the quantization points are not distributed evenly, and can be optimized via the back-propagation of the network gradients. Consequently, with the same number of bits, non-uniform quantization is more expressive to approximate the original full-precision network comparing to uniform quantization. Nevertheless, the non-uniform quantized model cannot be accelerated directly based on current deep learning frameworks, since the low-precision multiplication requires the intervals among quantization points to be equal. Therefore, the `NonUniformQuantLearner` can only help better compress the model.
+
+## Algorithm
+
+`NonUniformQuantLearner` adopts a similar training and evaluation procedure to the `UniformQuantLearner`. In the training process, the quantized weights are forwarded, while in the backward pass, full precision weights are updated via the STE estimator. The major difference from uniform quantization is that the locations of quantization points are not evenly distributed, but can be optimized and initialized differently. In the following, we introduce the scheme to the update and initialization of quantization points.
+
+### Optimization the quantization points
+
+Unlike uniform quantization, non-uniform quantization can optimize the location of quantization points dynamically during the training of the network, and thereon leads to less quantization loss. The location of quantization points can be updated by summing the gradients of weights that fall into the point ([Han et.al 2015](https://arxiv.org/abs/1510.00149)), i.e.:
+$$
+\frac{\partial \mathcal{L}}{\partial c_k} = \sum_{i,j}\frac{\partial\mathcal{L}}{\partial w_{ij}}\frac{\partial{w_{ij}}}{\partial c_k}=\sum_{ij}\frac{\partial\mathcal{L}}{\partial{w_{ij}}}1(I_{ij}=k)
+$$
+The following figure taken from [Han et.al 2015](https://arxiv.org/abs/1510.00149) shows the above process of updating the clusters:
+
+![Deep Compression Algor](D:/OneDrive%20-%20The%20Chinese%20University%20of%20Hong%20Kong/Research/MyWorks/automc/doc/pocketflow-docs/docs/pics/deep_compression_algor.png)
+
+### Initialization of quantization points
+
+Aside from optimizing the quantization points, another helpful strategy is to properly initialize the quantization points according to the distribution of weights. PocketFlow currently supports two kinds of initialization:
+
+- Uniform initialization: The quantization points are initialized to be evenly distributed along the range $[w_{min}, w_{max}]$ of that layer/bucket.
+- Quantile initialization: The quantization points are initialized to be the quantiles of full-precision weights. Comparing to uniform initialization, quantile initialization can generally lead to better performance.
+
+## Hyper-parameters
+
+To configure `NonUniformQuantLearner`, users can pass the options via the TensorFlow flag interface. The available options are as follows:
+
+| Options                     | Description                                                  |
+| :-------------------------- | :----------------------------------------------------------- |
+| `nuql_opt_mode`             | the fine-tuning mode: [`weights`, `clusters`, `both`]. Default: `weight` |
+| `nuql_init_style`           | the initialization of quantization point: [`quantile`, `uniform`].  Default: `quantile`. |
+| `nuql_weight_bits`          | the number of bits for weight. Default: `4`.                 |
+| `nuql_activation_bits`      | the number of bits for activation. Default: `32`.            |
+| `nuql_save_quant_mode_path` | the save path for quantized models. Default: `./nuql_quant_models/model.ckpt` |
+| `nuql_use_buckets`          | the switch to quantize first and last layers of network. Default: `False`. |
+| `nuql_bucket_type`          | two bucket type available: ['split', 'channel']. Default: `channel`. |
+| `nuql_bucket_size`          | the number of bucket size for bucket type 'split'. Default: `256`. |
+| `nuql_enbl_rl_agent`        | the switch to enable RL to learn optimal bit strategy. Default: `False`. |
+| `nuql_quantize_all_layers`  | the switch to quantize first and last layers of network. Default: `False`. |
+| `nuql_quant_epoch`          | the number of epochs for fine-tuning. Default: `60`.         |
+
+Here, we provide detailed description (and some analysis) for some of the above hyper-parameters:
+
+- `nuql_opt_mode`: the mode for fine-tuning the non-uniformly quantized network, choose among  [`weights`, `clusters`, `both`]. `weight` refers to only updating the network weights, while `clusters` refers to only updating the quantization points, and `both` means updating weights and quantization points simultaneously. Experimentally, we found that `weight` and `both` achieve similar performance, both of which outperform `clusters`.
+- `nuql_init_style`: the style of initialization of quantization points, currently supports  [`quantile`, `uniform`]. The differences between the two strategies have been discussed earlier.
+- `nuql_weight_bits`: The number of bits for weight quantization. Generally, for lower bit quantization (e.g., 2 bit on CIFAR10 and 4 bit on ILSVRC_12), `NonUniformQuantLearner` performs much better than `UniformQuantLearner`. The gap becomes less when using higher bits.
+- `nuql_activation_bits`: The number of bits for activation quantization. Since non-uniform quantized models cannot be accelerated directly, by default we leave it as 32 bit.
+- `nuql_save_quant_mode_path`: the path to save the quantized model. Quantization nodes  have already been inserted into the graph.
+- `nuql_use_buckets`: the switch to turn on the bucket. With bucketing, weights are split into multiple pieces, while the $\alpha$ and $\beta$ are calculated individually for each piece. Therefore, turning on the bucketing can lead to more fine-grained quantization.
+- `nuql_bucket_type`: the type of bucketing. Currently two types are supported: [`split`, `channel`]. `split` refers to that the weights of a layer are first concatenated into a long vector, and then cut it into short pieces according to `uql_bucket_size`. The remaining last piece is still regarded as a new piece. After quantization for each piece, the vectors are then folded back to the original shape as the quantized weights. `channel` refers to that weights with shape `[k, k, cin, cout]` in a convolutional layer are cut into `cout` buckets, where each bucket has the size of `k * k * cin`. For weights with shape `[m, n]` in fully connected layers, they are cut into `n` buckets, each of size `m`. In practice, bucketing with type  `channel` can be calculated more quickly comparing to type `split` since there are less buckets and less computation to iterate through all buckets.
+- `nuql_bucket_size`: the size of buckets when using bucket type `split`. Generally, smaller bucket size can lead to more fine grained quantization, while more storage are required since full precision statistics ($\alpha$ and $\beta$) of each bucket need to be kept.
+- `nuql_quantize_all_layers`: the switch to quantize the first and last layers. The first and last layers of the network are connected directly with the input and output, and are arguably more sensitive to quantization. Keeping them un-quantized can slightly increase the performance, nevertheless, if you want to accelerate the inference speed, all layers are supposed to be quantized.
+- `nuql_quant_epoch`: the epochs for fine-tuning a quantized network.
+- `nuql_enbl_rl_agent`: the switch to turn on the RL agent as hyper parameter optimizer. Details about the RL agent and its configurations are described below.
+
+### Configure the RL Agent
+
+Similar to uniform quantization, once `nuql_enbl_rl_agent==True` , the RL agent will automatically search for the optimal bit allocation strategy for each layer.  In order to search efficiently, the agent need to be configured properly. While here we list all the configurable hyper parameters for the agent, users can just keep the default value for most parameters, while modify only a few of them if necessary.
+
+| Options                       | Description                                                  |
+| :---------------------------- | :----------------------------------------------------------- |
+| `nuql_equivalent_bits`       | the number of re-allocated bits that is equivalent to non-uniform quantization without RL agent. Default: `4`. |
+| `nuql_nb_rlouts`              | the number of roll outs for training the RL agent. Default: `200`. |
+| `nuql_w_bit_min`              | the minimal number of bits for each layer. Default: `2`.     |
+| `nuql_w_bit_max`              | the maximal number of bits for each layer. Default: `8`.     |
+| `nuql_enbl_rl_global_tune`    | the switch to fine-tune all layers of the network. Default: `True`. |
+| `nuql_enbl_rl_layerwise_tune` | the switch to fine-tune the network layer by layer. Default: `False`. |
+| `nuql_tune_layerwise_steps`   | the number of steps for layer-wise fine-tuning. Default: `300`. |
+| `nuql_tune_global_steps`      | the number of steps for global fine-tuning. Default: `2000`. |
+| `nuql_tune_disp_steps`        | the display steps to show the fine-tuning progress. Default: `100`. |
+| `nuql_enbl_random_layers`     | the switch to randomly permute layers during RL agent training. Default: `True`. |
+
+Detailed description can be found in [Uniform Quantization](https://pocketflow.github.io/uq_learner/), with the only difference that the prefix is changed to `nuql_`.
+
+## Usage Examples
+
+Again, users should first get the model prepared. Users  can either use the pre-built models in PocketFlow, or develop their customized nets following the model definition in PocketFlow (for example, [resnet_at_cifar10.py](https://github.com/Tencent/PocketFlow/blob/master/nets/resnet_at_cifar10.py)) Once the model is built, the Non-Uniform Quantization Learner can be easily triggered  as follows:
+
+To quantize a ResNet-20 model for CIFAR-10 classification task with 4 bits in the local mode, use:
+
+```bash
+# quantize resnet-20 on CIFAR-10
+sh ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+--learner=non-uniform \
+--nuql_weight_bits=4 \
+--nuql_activation_bits=4 \
+```
+
+To quantize a ResNet-18 model for ILSVRC_12 classification task with 8 bits in the docker mode with 4 GPUs, and allow to use the channel-wise bucketing, use:
+
+``` bash
+# quantize the resnet-18 on ILSVRC-12
+sh ./scripts/run_docker.sh nets/resnet_at_ilsvrc12_run.py \
+-n=4 \
+--learner=non-uniform \
+--nuql_weight_bits=8 \
+--nuql_activation_bits=8 \
+--nuql_use_buckets=True \
+--nuql_bucket_type=channel
+```
+
+To quantize a MobileNet-v1 model for ILSVRC_12 classification task with 4 bits in the seven mode with 8 GPUs, and allow the RL agent to search for the optimal bit strategy, use:
+
+```bash
+# quantize mobilenet-v1 on ILSVRC-12
+sh ./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py \
+-n=8 \
+--learner=non-uniform \
+--nuql_enbl_rl_agent=True \
+--nuql_equivalent_bits=4 \
+```
+
+## References
+
+Han S, Mao H, and Dally W J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. [arXiv:1510.00149, 2015](https://arxiv.org/abs/1510.00149)
diff --git a/docs/docs/performance.md b/docs/docs/performance.md
new file mode 100644
index 0000000..d7f6d63
--- /dev/null
+++ b/docs/docs/performance.md
@@ -0,0 +1,50 @@
+# Performance
+
+In this documentation, we present evaluation results for applying various model compression methods for ResNet and MobileNet models on the ImageNet classification task, including channel pruning, weight sparsification, and uniform quantization.
+
+We adopt `ChannelPrunedLearner` to shrink the number of channels for convolutional layers to reduce the computation complexity.
+Instead of using the same pruning ratio for all layers, we utilize the DDPG algorithm as the RL agent to iteratively search for the optimal pruning ratio of each layer.
+After obtaining the optimal pruning ratios, group fine-tuning is adopted to further improve the compressed model's accuracy, as demonstrated below:
+
+| Model        | Pruning Ratio | Uniform | RL-based      | RL-based + Group Fine-tuning |
+|:------------:|:-------------:|:-------:|:-------------:|:----------------------------:|
+| MobileNet-v1 | 50%           | 66.5%   | 67.8% (+1.3%) | 67.9% (+1.4%)                |
+| MobileNet-v1 | 60%           | 66.2%   | 66.9% (+0.7%) | 67.0% (+0.8%)                |
+| MobileNet-v1 | 70%           | 64.4%   | 64.5% (+0.1%) | 64.8% (+0.4%)                |
+| Mobilenet-v1 | 80%           | 61.4%   | 61.4% (+0.0%) | 62.2% (+0.8%)                |
+
+**Note:** The original uncompressed MobileNet-v1's top-1 accuracy is 70.89%.
+
+We adopt `WeightSparseLearner` to introduce the sparsity constraint so that a large portion of model weights can be removed, which leads to smaller model and lower FLOPs for inference.
+Comparing with the original algorithm proposed in (Zhu & Gupta, 2017), we also incorporate network distillation and reinforcement learning algorithms to further improve the compressed model's accuracy, as shown in the table below:
+
+| Model        | Sparsity | (Zhu & Gupta, 2017) | RL-based      |
+|:------------:|:--------:|:-------------------:|:-------------:|
+| MobileNet-v1 | 50%      | 69.5%               | 70.5% (+1.0%) |
+| MobileNet-v1 | 75%      | 67.7%               | 68.5% (+0.8%) |
+| MobileNet-v1 | 90%      | 61.8%               | 63.4% (+1.6%) |
+| MobileNet-v1 | 95%      | 53.6%               | 56.8% (+3.2%) |
+
+**Note:** The original uncompressed MobileNet-v1's top-1 accuracy is 70.89%.
+
+We adopt `UniformQuantTFLearner` to uniformly quantize model weights from 32-bit floating-point numbers to 8-bit fixed-point numbers.
+The resulting model can be converted into the TensorFlow Lite format for deployment on mobile devices.
+In the following two tables, we show that 8-bit quantized models can be as accurate as (or even better than) the original 32-bit ones, and the inference time can be significantly reduced after quantization.
+
+| Model        | Top-1 Acc. (32-bit) | Top-5 Acc. (32-bit) | Top-1 Acc. (8-bit) | Top-5 Acc. (8-bit) |
+|:------------:|:-------------------:|:-------------------:|:------------------:|:------------------:|
+| ResNet-18    | 70.28%              | 89.38%              | 70.31% (+0.03%)    | 89.40% (+0.02%)    |
+| ResNet-50    | 75.97%              | 92.88%              | 76.01% (+0.04%)    | 92.87% (-0.01%)    |
+| MobileNet-v1 | 70.89%              | 89.56%              | 71.29% (+0.40%)    | 89.79% (+0.23%)    |
+| MobileNet-v2 | 71.84%              | 90.60%              | 72.26% (+0.42%)    | 90.77% (+0.17%)    |
+
+| Model        | Hardware    | CPU            | Time (32-bit) | Time (8-bit) | Speed-up     |
+|:------------:|:-----------:|:--------------:|:-------------:|:------------:|:------------:|
+| MobileNet-v1 | XiaoMi 8 SE | Snapdragon 710 | 156.33        | 62.60        | 2.50$\times$ |
+| MobileNet-v1 | XiaoMI 8    | Snapdragon 845 | 124.53        | 56.12        | 2.22$\times$ |
+| MobileNet-v1 | Huawei P20  | Kirin 970      | 152.54        | 68.43        | 2.23$\times$ |
+| MobileNet-v2 | XiaoMi 8 SE | Snapdragon 710 | 153.18        | 57.55        | 2.66$\times$ |
+| MobileNet-v2 | XiaoMi 8    | Snapdragon 845 | 120.59        | 49.04        | 2.46$\times$ |
+| MobileNet-v2 | Huawei P20  | Kirin 970      | 226.61        | 61.38        | 3.69$\times$ |
+
+* All the reported time are in milliseconds.
diff --git a/docs/docs/pics/dcp_learner.png b/docs/docs/pics/dcp_learner.png
new file mode 100644
index 0000000..cff9ad8
Binary files /dev/null and b/docs/docs/pics/dcp_learner.png differ
diff --git a/docs/docs/pics/deep_compression_algor.png b/docs/docs/pics/deep_compression_algor.png
new file mode 100644
index 0000000..10ab273
Binary files /dev/null and b/docs/docs/pics/deep_compression_algor.png differ
diff --git a/docs/framework_design.png b/docs/docs/pics/framework_design.png
similarity index 100%
rename from docs/framework_design.png
rename to docs/docs/pics/framework_design.png
diff --git a/docs/docs/pics/rl_workflow.png b/docs/docs/pics/rl_workflow.png
new file mode 100644
index 0000000..ea50231
Binary files /dev/null and b/docs/docs/pics/rl_workflow.png differ
diff --git a/docs/docs/pics/train_n_inference.png b/docs/docs/pics/train_n_inference.png
new file mode 100644
index 0000000..2b9c356
Binary files /dev/null and b/docs/docs/pics/train_n_inference.png differ
diff --git a/docs/docs/pics/wsl_pr_schedule.png b/docs/docs/pics/wsl_pr_schedule.png
new file mode 100644
index 0000000..bd6c892
Binary files /dev/null and b/docs/docs/pics/wsl_pr_schedule.png differ
diff --git a/docs/docs/pre_trained_models.md b/docs/docs/pre_trained_models.md
new file mode 100644
index 0000000..28a349b
--- /dev/null
+++ b/docs/docs/pre_trained_models.md
@@ -0,0 +1,23 @@
+# Pre-trained Models
+
+We maintain a list of pre-trained uncompressed models, so that the training process of model compression does not need to start from scratch.
+
+For the CIFAR-10 data set, we provide following pre-trained models:
+
+| Model name | Accuracy | URL                                                                               |
+|:----------:|:--------:|:---------------------------------------------------------------------------------:|
+| LeNet      | 81.79%   | [Link](https://api.ai.tencent.com/pocketflow/models_lenet_at_cifar_10.tar.gz)     |
+| ResNet-20  | 91.93%   | [Link](https://api.ai.tencent.com/pocketflow/models_resnet_20_at_cifar_10.tar.gz) |
+| ResNet-32  | 92.59%   | [Link](https://api.ai.tencent.com/pocketflow/models_resnet_32_at_cifar_10.tar.gz) |
+| ResNet-44  | 92.76%   | [Link](https://api.ai.tencent.com/pocketflow/models_resnet_44_at_cifar_10.tar.gz) |
+| ResNet-56  | 93.23%   | [Link](https://api.ai.tencent.com/pocketflow/models_resnet_56_at_cifar_10.tar.gz) |
+
+For the ImageNet (ILSVRC-12) data set, we provide following pre-trained models:
+
+| Model name   | Top-1 Acc. | Top-5 Acc. | URL                                                                                   |
+|:------------:|:----------:|:----------:|:-------------------------------------------------------------------------------------:|
+| ResNet-18    | 70.28%     | 89.38%     | [Link](https://api.ai.tencent.com/pocketflow/models_resnet_18_at_ilsvrc_12.tar.gz)    |
+| ResNet-34    | 73.41%     | 91.27%     | [Link](https://api.ai.tencent.com/pocketflow/models_resnet_34_at_ilsvrc_12.tar.gz)    |
+| ResNet-50    | 75.97%     | 92.88%     | [Link](https://api.ai.tencent.com/pocketflow/models_resnet_50_at_ilsvrc_12.tar.gz)    |
+| MobileNet-v1 | 70.89%     | 89.56%     | [Link](https://api.ai.tencent.com/pocketflow/models_mobilenet_v1_at_ilsvrc_12.tar.gz) |
+| MobileNet-v2 | 71.84%     | 90.60%     | [Link](https://api.ai.tencent.com/pocketflow/models_mobilenet_v2_at_ilsvrc_12.tar.gz) |
diff --git a/docs/docs/reference.md b/docs/docs/reference.md
new file mode 100644
index 0000000..8865321
--- /dev/null
+++ b/docs/docs/reference.md
@@ -0,0 +1,13 @@
+# Reference
+
+* [**Bengio et al., 2015**] Yoshua Bengio, Nicholas Leonard, and Aaron Courville. *Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation*. CoRR, abs/1308.3432, 2013.
+* [**Bergstra et al., 2013**] J. Bergstra, D. Yamins, and D. D. Cox. *Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures*. In International Conference on Machine Learning (ICML), pages 115-123, Jun 2013.
+* [**Han et al., 2016**] Song Han, Huizi Mao, and William J. Dally. *Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding*. In International Conference on Learning Representations (ICLR), 2016.
+* [**He et al., 2017**] Yihui He, Xiangyu Zhang, and Jian Sun. *Channel Pruning for Accelerating Very Deep Neural Networks*. In IEEE International Conference on Computer Vision (ICCV), pages 1389-1397, 2017.
+* [**He et al., 2018**] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. *AMC: AutoML for Model Compression and Acceleration on Mobile Devices*. In European Conference on Computer Vision (ECCV), pages 784-800, 2018.
+* [**Hinton et al., 2015**] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. *Distilling the Knowledge in a Neural Network*. CoRR, abs/1503.02531, 2015.
+* [**Jacob et al., 2018**] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. *Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference*. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2704-2713, 2018.
+* [**Lillicrap et al., 2016**] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. *Continuous Control with Deep Reinforcement Learning*. In International Conference on Learning Representations (ICLR), 2016.
+* [**Mockus, 1975**] J. Mockus. *On Bayesian Methods for Seeking the Extremum*. In Optimization Techniques IFIP Technical Conference, pages 400-404, 1975.
+* [**Zhu & Gupta, 2017**] Michael Zhu and Suyog Gupta. *To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression*. CoRR, abs/1710.01878, 2017.
+* [**Zhuang et al., 2018**] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Jiezhang Cao, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. *Discrimination-aware Channel Pruning for Deep Neural Networks*. In Annual Conference on Neural Information Processing Systems (NIPS), 2018.
diff --git a/docs/docs/reinforcement_learning.md b/docs/docs/reinforcement_learning.md
new file mode 100644
index 0000000..9c511a0
--- /dev/null
+++ b/docs/docs/reinforcement_learning.md
@@ -0,0 +1,50 @@
+# Reinforcement Learning
+
+For most deep learning models, the parameter redundancy differs from one layer to another.
+Some layers may be more robust to model compression algorithms due to larger redundancy, while others may be more sensitive.
+Therefore, it is often sub-optimal to use a unified pruning ratio or number of quantization bits for all layers, which completely omits the redundancy difference.
+However, it is also time-consuming or even impractical to manually setup the optimal value of such hyper-parameter for each layer, especially for deep networks with tens or hundreds of layers.
+
+To overcome this dilemma, in PocketFlow, we adopt reinforcement learning to automatically determine the optimal pruning ratio or number of quantization bits for each layer.
+Our approach is innovated from (He et al., 2018), which automatically determines each layer's optimal pruning ratio, and generalize it to hyper-parameter optimization for more model compression methods.
+
+In this documentation, we take `UniformQuantLearner` as an example to explain how the reinforcement learning method is used to iteratively optimize the number of quantization bits for each layer.
+It is worthy mentioning that this feature is also available for `ChannelPrunedLearner`, `WeightSparseLearner`, and `NonUniformQuantLearner`.
+
+## Algorithm Description
+
+Here, we assume the original model to be compressed consists of $T$ layers, and denote the $t$-th layer's weight tensor as $\mathbf{W}_{t}$ and its quantization bit-width as $b_{t}$.
+In order to maximally exploit the parameter redundancy of each layer, we need to find the optimal combination of layer-wise quantization bit-width that achieves the highest accuracy after compression while satisfying:
+
+$$
+\sum_{t = 1}^{T} b_{t} \left| \mathbf{W}_{t} \right| \le b \cdot \sum_{t = 1}^{T} \left| \mathbf{W}_{t} \right|
+$$
+
+where $\left| \mathbf{W}_{t} \right|$ denotes the number of parameters in the weight tensor $\mathbf{W}_{t}$ and $b$ is the whole network's target quantization bit-width.
+
+Below, we present the overall workflow of adopting reinforcement learning, or more specifically, the DDPG algorithm (Lillicrap et al., 2016) to search for the optimal combination of layer-wise quantization bit-width:
+
+![RL Workflow](pics/rl_workflow.png)
+
+To start with, we initialize an DDPG agent and set the best reward $r_{best}$ to negative infinity to track the optimal combination of layer-wise quantization bit-width.
+The search process consists of multiple roll-outs.
+In each roll-out, we sequentially traverse each layer in the network to determine its quantization bit-width.
+For the $t$-th layer, we construct its state vector with following information:
+
+* one-hot embedding of layer index
+* shape of weight tensor
+* number of parameters in the weight tensor
+* number of quantization bits used by previous layers
+* budget of quantization bits for remaining layers
+
+Afterwards, we feed this state vector into the DDPG agent to choose an action, which is then converted into the quantization bit-width under certain constraints.
+A commonly-used constraint is that with the selected quantization bit-width, the budget of quantization bits for remaining layers should be sufficient, *e.g.* ensuring the minimal quantization bit-width can be satisfied.
+
+After obtaining all layer's quantization bit-width, we quantize each layer's weights with the corresponding quantization bit-width, and fine-tune the quantized network for a few iteration (as supported by each learner's "Fast Fine-tuning" mode).
+We then evaluate the fine-tuned network' accuracy and use it as the reward signal $r_{n}$.
+The reward signal is compared against the best reward discovered so far, and the optimal combination of layer-wise quantization bit-width is updated if the current reward is larger.
+
+Finally, we generate a list of transitions from all the $\left( \mathbf{s}_{t}, a_{t}, r_{t}, \mathbf{s}_{t + 1} \right)$ tuples in the roll-out, and store them in the DDPG agent's replay buffer.
+The DDPG agent is then trained with one or more mini-batches of sampled transitions, so that it can choose better actions in the following roll-outs.
+
+After obtaining the optimal combination of layer-wise quantization bit-width, we can optionally use `UniformQuantLearner`'s "Re-training with Full Data" mode (also supported by others learners) for a complete quantization-aware training to further reduce the accuracy loss.
diff --git a/docs/docs/self_defined_models.md b/docs/docs/self_defined_models.md
new file mode 100644
index 0000000..fd030bd
--- /dev/null
+++ b/docs/docs/self_defined_models.md
@@ -0,0 +1,379 @@
+# Self-defined Models
+
+Self-defined models (and data sets) can be incorporated into PocketFlow by implementing a new `ModelHelper` class. The `ModelHelper` class includes the definition of data input pipeline as well as the network's forward pass and loss function. With the self-defined `ModelHelper`, the network can be either trained without any constraints using `FullPrecLearner`, or trained with certain model compression algorithms using other learners, *e.g.* `ChannelPrunedLearner` for channel pruning or `UniformQuantTFLearner` for uniform quantization. In this tutorial, we will define a 4-layer convolutional neural network (2 conv. layers + 2 dense layers) for image classification on the [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist) data set under the PocketFlow framework. Afterwards, we shall demonstrate how to train this self-defined model with different model compression components.
+
+## The Essentials
+
+To use self-defined models and data sets in PocketFlow, we need to provide the following two items in advance to describe the overall training workflow:
+
+* **Data Input Pipeline**: this tells PocketFlow how to parse features and ground-truth labels from data files.
+* **Network Definition**: this tells PocketFlow how to compute the network's predictions and loss function's value.
+
+The `ModelHelper` class, which is a sub-class of the abstract base class `AbstractModelHelper`, is designed to provide such definitions. In PocketFlow, we have offered several `ModelHelper`  classes to describe different combinations of data sets and model architectures. To use self-defined models, a new `ModelHelper` class should be implemented. Besides, we need an execution script to call this newly defined `ModelHelper` class.
+
+P.S.: You can find the full code used in this tutorial under the "./examples" directory.
+
+### Data Input Pipeline
+
+To start with, we need to tell PocketFlow how data files should be parsed. Here, we define a class named `FMnistDataset` to create iterators over the Fashion-MNIST training and test subsets. Every time the iterator is called, it will return a mini-batch of images and corresponding ground-truth labels.
+
+Below is the full implementation of `FMnistDataset` class (this should be placed under the "./datasets" directory, named as "fmnist_dataset.py"):
+
+``` Python
+import os
+import gzip
+import numpy as np
+import tensorflow as tf
+
+from datasets.abstract_dataset import AbstractDataset
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_integer('nb_classes', 10, '# of classes')
+tf.app.flags.DEFINE_integer('nb_smpls_train', 60000, '# of samples for training')
+tf.app.flags.DEFINE_integer('nb_smpls_val', 5000, '# of samples for validation')
+tf.app.flags.DEFINE_integer('nb_smpls_eval', 10000, '# of samples for evaluation')
+tf.app.flags.DEFINE_integer('batch_size', 128, 'batch size per GPU for training')
+tf.app.flags.DEFINE_integer('batch_size_eval', 100, 'batch size for evaluation')
+
+# Fashion-MNIST specifications
+IMAGE_HEI = 28
+IMAGE_WID = 28
+IMAGE_CHN = 1
+
+def load_mnist(image_file, label_file):
+  """Load images and labels from *.gz files.
+
+  This function is modified from utils/mnist_reader.py in the Fashion-MNIST repo.
+
+  Args:
+  * image_file: file path to images
+  * label_file: file path to labels
+
+  Returns:
+  * images: np.array of the image data
+  * labels: np.array of the label data
+  """
+
+  with gzip.open(label_file, 'rb') as i_file:
+    labels = np.frombuffer(i_file.read(), dtype=np.uint8, offset=8)
+  with gzip.open(image_file, 'rb') as i_file:
+    images = np.frombuffer(i_file.read(), dtype=np.uint8, offset=16)
+    image_size = IMAGE_HEI * IMAGE_WID * IMAGE_CHN
+    assert images.size == image_size * len(labels)
+    images = images.reshape(len(labels), image_size)
+
+  return images, labels
+
+def parse_fn(image, label, is_train):
+  """Parse an (image, label) pair and apply data augmentation if needed.
+
+  Args:
+  * image: image tensor
+  * label: label tensor
+  * is_train: whether data augmentation should be applied
+
+  Returns:
+  * image: image tensor
+  * label: one-hot label tensor
+  """
+
+  # data parsing
+  label = tf.one_hot(tf.reshape(label, []), FLAGS.nb_classes)
+  image = tf.cast(tf.reshape(image, [IMAGE_HEI, IMAGE_WID, IMAGE_CHN]), tf.float32)
+  image = tf.image.per_image_standardization(image)
+
+  # data augmentation
+  if is_train:
+    image = tf.image.resize_image_with_crop_or_pad(image, IMAGE_HEI + 8, IMAGE_WID + 8)
+    image = tf.random_crop(image, [IMAGE_HEI, IMAGE_WID, IMAGE_CHN])
+    image = tf.image.random_flip_left_right(image)
+
+  return image, label
+
+class FMnistDataset(AbstractDataset):
+  '''Fashion-MNIST dataset.'''
+
+  def __init__(self, is_train):
+    """Constructor function.
+
+    Args:
+    * is_train: whether to construct the training subset
+    """
+
+    # initialize the base class
+    super(FMnistDataset, self).__init__(is_train)
+
+    # choose local files or HDFS files w.r.t. FLAGS.data_disk
+    if FLAGS.data_disk == 'local':
+      assert FLAGS.data_dir_local is not None, '<FLAGS.data_dir_local> must not be None'
+      data_dir = FLAGS.data_dir_local
+    elif FLAGS.data_disk == 'hdfs':
+      assert FLAGS.data_hdfs_host is not None and FLAGS.data_dir_hdfs is not None, \
+        'both <FLAGS.data_hdfs_host> and <FLAGS.data_dir_hdfs> must not be None'
+      data_dir = FLAGS.data_hdfs_host + FLAGS.data_dir_hdfs
+    else:
+      raise ValueError('unrecognized data disk: ' + FLAGS.data_disk)
+
+    # setup paths to image & label files, and read in images & labels
+    if is_train:
+      self.batch_size = FLAGS.batch_size
+      image_file = os.path.join(data_dir, 'train-images-idx3-ubyte.gz')
+      label_file = os.path.join(data_dir, 'train-labels-idx1-ubyte.gz')
+    else:
+      self.batch_size = FLAGS.batch_size_eval
+      image_file = os.path.join(data_dir, 't10k-images-idx3-ubyte.gz')
+      label_file = os.path.join(data_dir, 't10k-labels-idx1-ubyte.gz')
+    self.images, self.labels = load_mnist(image_file, label_file)
+    self.parse_fn = lambda x, y: parse_fn(x, y, is_train)
+
+  def build(self, enbl_trn_val_split=False):
+    """Build iterator(s) for tf.data.Dataset() object.
+
+    Args:
+    * enbl_trn_val_split: whether to split into training & validation subsets
+
+    Returns:
+    * iterator_trn: iterator for the training subset
+    * iterator_val: iterator for the validation subset
+      OR
+    * iterator: iterator for the chosen subset (training OR testing)
+    """
+
+    # create a tf.data.Dataset() object from NumPy arrays
+    dataset = tf.data.Dataset.from_tensor_slices((self.images, self.labels))
+    dataset = dataset.map(self.parse_fn, num_parallel_calls=FLAGS.nb_threads)
+
+    # create iterators for training & validation subsets separately
+    if self.is_train and enbl_trn_val_split:
+      iterator_val = self.__make_iterator(dataset.take(FLAGS.nb_smpls_val))
+      iterator_trn = self.__make_iterator(dataset.skip(FLAGS.nb_smpls_val))
+      return iterator_trn, iterator_val
+
+    return self.__make_iterator(dataset)
+
+  def __make_iterator(self, dataset):
+    """Make an iterator from tf.data.Dataset.
+
+    Args:
+    * dataset: tf.data.Dataset object
+
+    Returns:
+    * iterator: iterator for the dataset
+    """
+
+    dataset = dataset.apply(tf.contrib.data.shuffle_and_repeat(buffer_size=FLAGS.buffer_size))
+    dataset = dataset.batch(self.batch_size)
+    dataset = dataset.prefetch(FLAGS.prefetch_size)
+    iterator = dataset.make_one_shot_iterator()
+
+    return iterator
+```
+
+When creating an object of `FMnistDataset` class, an extra argument named `is_train` should be provided to toggle between the training and test subsets. The data files can be either store on the local machine or the HDFS cluster, and the directory path is specified in the path configuration file, *e.g.*:
+
+``` plain
+data_dir_local_fmnist = /home/user_name/datasets/Fashion-MNIST
+```
+
+The constructor function loads images and labels from *.gz files, each stored in a NumPy array. The `build` function is then used to create a TensorFlow's data set iterator from these two NumPy arrays. Particularly, if both `enbl_trn_val_split` and `is_train` are True, then the original training subset will be divided into two parts, one for model training and the other for validation.
+
+### Network Definition
+
+Now we implement a new `ModelHelper` class to utilize the above data input pipeline to define the network's training workflow. Below is the full implementation of `ModelHelper` class (this should be placed under the "./nets" directory, named as "convnet_at_fmnist.py"):
+
+``` Python
+import tensorflow as tf
+
+from nets.abstract_model_helper import AbstractModelHelper
+from datasets.fmnist_dataset import FMnistDataset
+from utils.lrn_rate_utils import setup_lrn_rate_piecewise_constant
+from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_float('nb_epochs_rat', 1.0, '# of training epochs\' ratio')
+tf.app.flags.DEFINE_float('lrn_rate_init', 1e-1, 'initial learning rate')
+tf.app.flags.DEFINE_float('batch_size_norm', 128, 'normalization factor of batch size')
+tf.app.flags.DEFINE_float('momentum', 0.9, 'momentum coefficient')
+tf.app.flags.DEFINE_float('loss_w_dcy', 3e-4, 'weight decaying loss\'s coefficient')
+
+def forward_fn(inputs, data_format):
+  """Forward pass function.
+
+  Args:
+  * inputs: inputs to the network's forward pass
+  * data_format: data format ('channels_last' OR 'channels_first')
+
+  Returns:
+  * inputs: outputs from the network's forward pass
+  """
+
+  # transpose the image tensor if needed
+  if data_format == 'channel_first':
+    inputs = tf.transpose(inputs, [0, 3, 1, 2])
+
+  # conv1
+  inputs = tf.layers.conv2d(inputs, 32, [5, 5], padding='same',
+                            data_format=data_format, activation=tf.nn.relu, name='conv1')
+  inputs = tf.layers.max_pooling2d(inputs, [2, 2], 2, data_format=data_format, name='pool1')
+
+  # conv2
+  inputs = tf.layers.conv2d(inputs, 64, [5, 5], padding='same',
+                            data_format=data_format, activation=tf.nn.relu, name='conv2')
+  inputs = tf.layers.max_pooling2d(inputs, [2, 2], 2, data_format=data_format, name='pool2')
+
+  # fc3
+  inputs = tf.layers.flatten(inputs, name='flatten')
+  inputs = tf.layers.dense(inputs, 1024, activation=tf.nn.relu, name='fc3')
+
+  # fc4
+  inputs = tf.layers.dense(inputs, FLAGS.nb_classes, name='fc4')
+  inputs = tf.nn.softmax(inputs, name='softmax')
+
+  return inputs
+
+class ModelHelper(AbstractModelHelper):
+  """Model helper for creating a ConvNet model for the Fashion-MNIST dataset."""
+
+  def __init__(self):
+    """Constructor function."""
+
+    # class-independent initialization
+    super(ModelHelper, self).__init__()
+
+    # initialize training & evaluation subsets
+    self.dataset_train = FMnistDataset(is_train=True)
+    self.dataset_eval = FMnistDataset(is_train=False)
+
+  def build_dataset_train(self, enbl_trn_val_split=False):
+    """Build the data subset for training, usually with data augmentation."""
+
+    return self.dataset_train.build(enbl_trn_val_split)
+
+  def build_dataset_eval(self):
+    """Build the data subset for evaluation, usually without data augmentation."""
+
+    return self.dataset_eval.build()
+
+  def forward_train(self, inputs, data_format='channels_last'):
+    """Forward computation at training."""
+
+    return forward_fn(inputs, data_format)
+
+  def forward_eval(self, inputs, data_format='channels_last'):
+    """Forward computation at evaluation."""
+
+    return forward_fn(inputs, data_format)
+
+  def calc_loss(self, labels, outputs, trainable_vars):
+    """Calculate loss (and some extra evaluation metrics)."""
+
+    loss = tf.losses.softmax_cross_entropy(labels, outputs)
+    loss += FLAGS.loss_w_dcy * tf.add_n([tf.nn.l2_loss(var) for var in trainable_vars])
+    accuracy = tf.reduce_mean(
+      tf.cast(tf.equal(tf.argmax(labels, axis=1), tf.argmax(outputs, axis=1)), tf.float32))
+    metrics = {'accuracy': accuracy}
+
+    return loss, metrics
+
+  def setup_lrn_rate(self, global_step):
+    """Setup the learning rate (and number of training iterations)."""
+
+    nb_epochs = 160
+    idxs_epoch = [40, 80, 120]
+    decay_rates = [1.0, 0.1, 0.01, 0.001]
+    batch_size = FLAGS.batch_size * (1 if not FLAGS.enbl_multi_gpu else mgw.size())
+    lrn_rate = setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay_rates)
+    nb_iters = int(FLAGS.nb_smpls_train * nb_epochs * FLAGS.nb_epochs_rat / batch_size)
+
+    return lrn_rate, nb_iters
+
+  @property
+  def model_name(self):
+    """Model's name."""
+
+    return 'convnet'
+
+  @property
+  def dataset_name(self):
+    """Dataset's name."""
+
+    return 'fmnist'
+```
+
+In the `build_dataset_train` and `build_dataset_eval` functions, we adopt the previously introduced `FMnistDataset` class to define the data input pipeline. The network forward-pass computation is defined in the `forward_train` and `forward_eval` functions, which corresponds to the training and evaluation graph, respectively. The training graph is slightly different from evaluation graph, such as operations related to the batch normalization layers. The `calc_loss` function calculates the loss function's value and extra evaluation metrics, *e.g.* classification accuracy. Finally, the `setup_lrn_rate` function defines the learning rate schedule, as well as how many training iterations are need.
+
+### Execution Script
+
+Besides the self-defined `ModelHelper` class, we still need an execution script to pass it to the corresponding model compression component to start the training process. Below is the full implementation (this should be placed under the "./nets" directory, named as "convnet_at_fmnist_run.py"):
+
+``` Python
+import traceback
+import tensorflow as tf
+
+from nets.convnet_at_fmnist import ModelHelper
+from learners.learner_utils import create_learner
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_string('log_dir', './logs', 'logging directory')
+tf.app.flags.DEFINE_boolean('enbl_multi_gpu', False, 'enable multi-GPU training')
+tf.app.flags.DEFINE_string('learner', 'full-prec', 'learner\'s name')
+tf.app.flags.DEFINE_string('exec_mode', 'train', 'execution mode: train / eval')
+tf.app.flags.DEFINE_boolean('debug', False, 'debugging information')
+
+def main(unused_argv):
+  """Main entry."""
+
+  try:
+    # setup the TF logging routine
+    if FLAGS.debug:
+      tf.logging.set_verbosity(tf.logging.DEBUG)
+    else:
+      tf.logging.set_verbosity(tf.logging.INFO)
+    sm_writer = tf.summary.FileWriter(FLAGS.log_dir)
+
+    # display FLAGS's values
+    tf.logging.info('FLAGS:')
+    for key, value in FLAGS.flag_values_dict().items():
+      tf.logging.info('{}: {}'.format(key, value))
+
+    # build the model helper & learner
+    model_helper = ModelHelper()
+    learner = create_learner(sm_writer, model_helper)
+
+    # execute the learner
+    if FLAGS.exec_mode == 'train':
+      learner.train()
+    elif FLAGS.exec_mode == 'eval':
+      learner.download_model()
+      learner.evaluate()
+    else:
+      raise ValueError('unrecognized execution mode: ' + FLAGS.exec_mode)
+
+    # exit normally
+    return 0
+  except ValueError:
+    traceback.print_exc()
+    return 1  # exit with errors
+
+if __name__ == '__main__':
+  tf.app.run()
+```
+
+## Network Training with PocketFlow
+
+To train the self-defined model without any constraint, use `FullPrecLearner`:
+
+``` bash
+$ ./scripts/run_local.sh nets/convnet_at_fmnist_run.py \
+    --learner full-prec
+```
+
+To train the self-defined model with the uniform quantization constraint, use `UniformQuantTFLearner`:
+
+``` bash
+$ ./scripts/run_local.sh nets/convnet_at_fmnist_run.py \
+    --learner uniform-tf
+```
diff --git a/docs/docs/test_cases.md b/docs/docs/test_cases.md
new file mode 100644
index 0000000..991f4e7
--- /dev/null
+++ b/docs/docs/test_cases.md
@@ -0,0 +1,163 @@
+# Test Cases
+
+This document contains various test cases to cover different combinations of learners and hyper-parameter settings. Any merge request to the master branch should be able to pass all the test cases to be approved.
+
+## Full-Precision
+
+``` bash
+# local mode
+$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py
+$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --enbl_dst
+$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --data_disk hdfs
+$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --data_disk hdfs \
+    --enbl_dst
+
+# seven mode
+$ ./scripts/run_seven.sh nets/resnet_at_ilsvrc12_run.py
+$ ./scripts/run_seven.sh nets/resnet_at_ilsvrc12_run.py \
+    --enbl_dst
+$ ./scripts/run_seven.sh nets/resnet_at_ilsvrc12_run.py \
+    --data_disk hdfs
+$ ./scripts/run_seven.sh nets/resnet_at_ilsvrc12_run.py \
+    --data_disk hdfs \
+    --enbl_dst
+
+# docker mode
+$ ./scripts/run_docker.sh nets/lenet_at_cifar10_run.py
+$ ./scripts/run_docker.sh nets/lenet_at_cifar10_run.py \
+    --enbl_dst
+$ ./scripts/run_docker.sh nets/resnet_at_cifar10_run.py
+$ ./scripts/run_docker.sh nets/resnet_at_cifar10_run.py \
+    --enbl_dst
+```
+
+## Channel Pruning
+
+``` bash
+# uniform preserve ratios for all layers
+$ ./scripts/run_seven.sh nets/resnet_at_cifar10_run.py \
+    --learner channel \
+    --cp_prune_option uniform \
+    --cp_uniform_preserve_ratio 0.5
+
+# auto-tuned preserve ratios for each layer
+$ ./scripts/run_seven.sh nets/resnet_at_cifar10_run.py \
+    --cp_learner channel \
+    --cp_prune_option auto \
+    --cp_preserve_ratio 0.3
+```
+
+## Discrimination-aware Channel Pruning
+
+``` bash
+# no network distillation
+$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --learner dis-chn-pruned \
+    --dcp_nb_stages 3 \
+    --data_disk hdfs
+
+# network distillation
+$ ./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py \
+    --learner dis-chn-pruned \
+    --enbl_dst \
+    --dcp_nb_stages 4
+```
+
+## Weight Sparsification
+
+``` bash
+# uniform pruning ratios for all layers
+$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --learner weight-sparse \
+    --ws_prune_ratio_prtl uniform \
+    --data_disk hdfs
+
+# optimal pruning ratios for each layer
+$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --learner weight-sparse \
+    --ws_prune_ratio_prtl optimal \
+    --data_disk hdfs
+
+# heurist pruning ratios for each layer
+$ ./scripts/run_seven.sh nets/resnet_at_ilsvrc12_run.py \
+    --learner weight-sparse \
+    --ws_prune_ratio_prtl heurist
+
+# optimal pruning ratios for each layer
+$ ./scripts/run_seven.sh nets/resnet_at_ilsvrc12_run.py \
+    --learner weight-sparse \
+    --ws_prune_ratio_prtl optimal
+```
+
+## Uniform Quantization
+
+``` bash
+# channel-based bucketing
+$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --learner uniform \
+    --uql_use_buckets \
+    --uql_bucket_type channel \
+    --data_disk hdfs
+
+# split-based bucketing
+$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --learner uniform \
+    --uql_use_buckets \
+    --uql_bucket_type split \
+    --data_disk hdfs
+
+# channel-based bucketing + RL
+$ ./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py -n=2 \
+    --learner uniform \
+    --uql_enbl_rl_agent \
+    --uql_use_buckets \
+    --uql_bucket_type channel
+
+# split-based bucketing + RL
+$ ./scripts/run_seven.sh nets/resnet_at_ilsvrc12_run.py -n=2 \
+    --learner uniform \
+    --uql_enbl_rl_agent \
+    --uql_use_buckets \
+    --uql_bucket_type split
+```
+
+## Non-uniform Quantization
+
+``` bash
+# channel-based bucketing + RL + optimize clusters
+$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --learner non-uniform \
+    --nuql_enbl_rl_agent \
+    --nuql_use_buckets \
+    --nuql_bucket_type channel \
+    --nuql_opt_mode clusters \
+    --data_disk hdfs
+
+# split-based bucketing + RL + optimize weights
+$ ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --learner non-uniform \
+    --nuql_enbl_rl_agent \
+    --nuql_use_buckets \
+    --nuql_bucket_type split \
+    --nuql_opt_mode weights \
+    --data_disk hdfs
+
+# channel-based bucketing + RL + optimize weights
+$ ./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py -n=2 \
+    --learner non-uniform \
+    --nuql_enbl_rl_agent \
+    --nuql_use_buckets \
+    --nuql_bucket_type channel \
+    --nuql_opt_mode weights
+
+# split-based bucketing + RL + optimize clusters
+$ ./scripts/run_seven.sh nets/resnet_at_ilsvrc12_run.py -n=2 \
+    --learner non-uniform \
+    --nuql_enbl_rl_agent \
+    --nuql_use_buckets \
+    --nuql_bucket_type split \
+    --nuql_opt_mode clusters
+```
diff --git a/docs/docs/tutorial.md b/docs/docs/tutorial.md
new file mode 100644
index 0000000..f776d20
--- /dev/null
+++ b/docs/docs/tutorial.md
@@ -0,0 +1,235 @@
+# Tutorial
+
+In this tutorial, we demonstrate how to compress a convolutional neural network and export the compressed model into a \*.tflite file for deployment on mobile devices. The model we used here is a 18-layer residual network (denoted as "ResNet-18") trained for the ImageNet classification task. We will compress it with the discrimination-aware channel pruning algorithm (Zhuang et al., NIPS '18) to reduce the number of convolutional channels used in the network for speed-up.
+
+## Prepare the Data
+
+To start with, we need to convert the ImageNet data set (ILSVRC-12) into TensorFlow's native TFRecord file format. You may follow the data preparation guide [here](https://github.com/tensorflow/models/tree/master/research/inception#getting-started) to download the full data set and convert it into TFRecord files. After that, you should be able to find 1,024 training files and 128 validation files in the data directory, like this:
+
+``` bash
+# training files
+train-00000-of-01024
+train-00001-of-01024
+...
+train-01023-of-01024
+
+# validation files
+validation-00000-of-00128
+validation-00001-of-00128
+...
+validation-00127-of-00128
+```
+
+## Prepare the Pre-trained Model
+
+The discrimination-aware channel pruning algorithm requires a pre-trained uncompressed model provided in advance, so that a channel-pruned model can be trained with warm-start. You can download a pre-trained model from [here](https://api.ai.tencent.com/pocketflow/list.html), and then unzip files into the `models` sub-directory.
+
+Alternatively, you can train an uncompressed full-precision model from scratch using `FullPrecLearner` with the following command (choose whatever mode that fits you):
+
+``` bash
+# local mode with 1 GPU
+$ ./scripts/run_local.sh nets/resnet_at_ilsvrc12_run.py
+
+# docker mode with 8 GPUs
+$ ./scripts/run_docker.sh nets/resnet_at_ilsvrc12_run.py -n=8
+
+# seven mode with 8 GPUs
+$ ./scripts/run_seven.sh nets/resnet_at_ilsvrc12_run.py -n=8
+```
+
+After the training process, you should be able to find the resulting model files located at the `models` sub-directory in PocketFlow's home directory.
+
+## Train the Compressed Model
+
+Now, we can train a compressed model with the discrimination-aware channel pruning algorithm, as implemented by `DisChnPrunedLearner`. Assuming you are now in PocketFlow's home directory, the training process of model compression can be started using the following command (choose whatever mode that fits you):
+
+``` bash
+# local mode with 1 GPU
+$ ./scripts/run_local.sh nets/resnet_at_ilsvrc12_run.py \
+    --learner dis-chn-pruned
+
+# docker mode with 8 GPUs
+$ ./scripts/run_docker.sh nets/resnet_at_ilsvrc12_run.py -n=8 \
+    --learner dis-chn-pruned
+
+# seven mode with 8 GPUs
+$ ./scripts/run_seven.sh nets/resnet_at_ilsvrc12_run.py -n=8 \
+    --learner dis-chn-pruned
+```
+
+Let's take the execution command for the local mode as an example. In this command, `run_local.sh` is a shell script that executes the specified Python script with user-provided arguments. Here, we ask it to run the Python script named `nets/resnet_at_ilsvrc12_run.py`, which is the execution script for ResNet models on the ImageNet data set. After that, we use `--learner dis-chn-pruned` to specify that the `DisChnPrunedLearner` should be used for model compression. You may also use other learners by specifying the corresponding learner name. Below is a full list of available learners in PocketFlow:
+
+| Learner name     | Learner class            | Note                                                                          |
+|:-----------------|:-------------------------|:------------------------------------------------------------------------------|
+| `full-prec`      | `FullPrecLearner`        | No model compression                                                          |
+| `channel`        | `ChannelPrunedLearner`   | Channel pruning with LASSO-based channel selection (He et al., 2017)          |
+| `dis-chn-pruned` | `DisChnPrunedLearner`    | Discrimination-aware channel pruning (Zhuang et al., 2018)                    |
+| `weight-sparse`  | `WeightSparseLearner`    | Weight sparsification with dynamic pruning schedule (Zhu & Gupta, 2017)       |
+| `uniform`        | `UniformQuantLearner`    | Weight quantization with uniform reconstruction levels (Jacob et al., 2018)   |
+| `uniform-tf`     | `UniformQuantTFLearner`  | Weight quantization with uniform reconstruction levels and TensorFlow APIs    |
+| `non-uniform`    | `NonUniformQuantLearner` | Weight quantization with non-uniform reconstruction levels (Han et al., 2016) |
+
+The local mode only uses 1 GPU for the training process, which takes approximately 20-30 hours to complete. This can be accelerated by multi-GPU training in the docker and seven mode, which is enabled by adding `-n=x` right after the specified Python script, where `x` is the number of GPUs to be used.
+
+Optionally, you can pass some extra arguments to customize the training process. For the discrimination-aware channel pruning algorithm, some of key arguments are:
+
+| Name              | Definition                             | Default Value |
+|:------------------|:---------------------------------------|:--------------|
+| `enbl_dst`        | Enable training with distillation loss | False         |
+| `dcp_prune_ratio` | DCP algorithm's pruning ratio          | 0.5           |
+
+You may override the default value by appending customized arguments at the end of the execution command. For instance, the following command:
+
+``` bash
+$ ./scripts/run_local.sh nets/resnet_at_ilsvrc12_run.py \
+    --learner dis-chn-pruned \
+    --enbl_dst \
+    --dcp_prune_ratio 0.75
+```
+
+requires the `DisChnPrunedLearner` to achieve an overall pruning ratio of 0.75 and the training process will be carried out with the distillation loss. As a result, the number of channels in each convolutional layer of the compressed model will be one quarter of the original one.
+
+After the training process is completed, you should be able to find a sub-directory named `models_dcp_eval` created in the home directory of PocketFlow. This sub-directory contains all the files that define the compressed model, and we will export them to a TensorFlow Lite formatted model file for deployment in the next section.
+
+## Export to TensorFlow Lite
+
+TensorFlow's checkpoint files cannot be directly used for deployment on mobile devices. Instead, we need to firstly convert them into a single \*.tflite file that is supported by the TensorFlow Lite Interpreter. For model compressed with channel-pruning based algorithms, *e.g.* `ChannelPruningLearner` and `DisChnPrunedLearner`, we have prepared a model conversion script, `tools/conversion/export_pb_tflite_models.py`, to generate a TF-Lite model from TensorFlow's checkpoint files.
+
+To convert checkpoint files into a \*.tflite file, use the following command:
+
+``` bash
+# convert checkpoint files into a *.tflite model
+$ python tools/conversion/export_pb_tflite_models.py \
+    --model_dir models_dcp_eval
+```
+
+In the above command, we specify the model directory containing checkpoint files generated in the previous training process. The conversion script automatically detects which channels can be safely pruned, and then produces a light-weighted compressed model. The resulting TensorFlow Lite file is also placed at the `models_dcp_eval` directory, named as `model_transformed.tflite`.
+
+## Deploy on Mobile Devices
+
+After exporting the compressed model to the TensorFlow Lite file format, you may follow the official [guide](https://www.tensorflow.org/lite/demo_android) for creating an Android demo App from it. Basically, this demo App uses a TensorFlow Lite model to continuously classifies images captured by the camera, and all the computation are performed on mobile devices in real time.
+
+To use the `model_transformed.tflite` model file, you need to place it in the `asserts` directory and create a Java class named `ImageClassifierFloatResNet` to use this model for classification. Below is the example code, which is modified from `ImageClassifierFloatInception.java` used in the official demo project:
+
+``` Java
+/* Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+package com.example.android.tflitecamerademo;
+
+import android.app.Activity;
+
+import java.io.IOException;
+
+/**
+ * This classifier works with the ResNet-18 model.
+ * It applies floating point inference rather than using a quantized model.
+ */
+public class ImageClassifierFloatResNet extends ImageClassifier {
+
+  /**
+   * The ResNet requires additional normalization of the used input.
+   */
+  private static final float IMAGE_MEAN_RED = 123.58f;
+  private static final float IMAGE_MEAN_GREEN = 116.779f;
+  private static final float IMAGE_MEAN_BLUE = 103.939f;
+
+  /**
+   * An array to hold inference results, to be feed into Tensorflow Lite as outputs.
+   * This isn't part of the super class, because we need a primitive array here.
+   */
+  private float[][] labelProbArray = null;
+
+  /**
+   * Initializes an {@code ImageClassifier}.
+   *
+   * @param activity
+   */
+  ImageClassifierFloatResNet(Activity activity) throws IOException {
+    super(activity);
+    labelProbArray = new float[1][getNumLabels()];
+  }
+
+  @Override
+  protected String getModelPath() {
+    return "model_transformed.tflite";
+  }
+
+  @Override
+  protected String getLabelPath() {
+    return "labels_imagenet_slim.txt";
+  }
+
+  @Override
+  protected int getImageSizeX() {
+    return 224;
+  }
+
+  @Override
+  protected int getImageSizeY() {
+    return 224;
+  }
+
+  @Override
+  protected int getNumBytesPerChannel() {
+    // a 32bit float value requires 4 bytes
+    return 4;
+  }
+
+  @Override
+  protected void addPixelValue(int pixelValue) {
+    imgData.putFloat(((pixelValue >> 16) & 0xFF) - IMAGE_MEAN_RED);
+    imgData.putFloat(((pixelValue >> 8) & 0xFF) - IMAGE_MEAN_GREEN);
+    imgData.putFloat((pixelValue & 0xFF) - IMAGE_MEAN_BLUE);
+  }
+
+  @Override
+  protected float getProbability(int labelIndex) {
+    return labelProbArray[0][labelIndex];
+  }
+
+  @Override
+  protected void setProbability(int labelIndex, Number value) {
+    labelProbArray[0][labelIndex] = value.floatValue();
+  }
+
+  @Override
+  protected float getNormalizedProbability(int labelIndex) {
+    // TODO the following value isn't in [0,1] yet, but may be greater. Why?
+    return getProbability(labelIndex);
+  }
+
+  @Override
+  protected void runInference() {
+    tflite.run(imgData, labelProbArray);
+  }
+}
+```
+
+After that, you need to change the image classifier class used in `Camera2BasicFragment.java`. Locate the function named `onActivityCreated` and change its content as below. Now you will be able to use the compressed ResNet-18 model to classify objects on your mobile phone in real time.
+
+``` Java
+/** Load the model and labels. */
+@Override
+public void onActivityCreated(Bundle savedInstanceState) {
+  super.onActivityCreated(savedInstanceState);
+  try {
+    classifier = new ImageClassifierFloatResNet(getActivity());
+  } catch (IOException e) {
+    Log.e(TAG, "Failed to initialize an image classifier.", e);
+  }
+  startBackgroundThread();
+}
+```
diff --git a/docs/docs/uq_learner.md b/docs/docs/uq_learner.md
new file mode 100644
index 0000000..927dc27
--- /dev/null
+++ b/docs/docs/uq_learner.md
@@ -0,0 +1,221 @@
+
+# Uniform Quantization
+
+## Introduction
+
+Uniform quantization is widely used for model compression and acceleration. Originally the weights in the network are represented by 32-bit floating-point numbers. With uniform quantization, low-precision (*e.g.* 4-bit or 8-bit) fixed-point numbers are used to approximate the full-precision network. For $k$-bit quantization, the memory saving can be up to $32 / k​$. For example, 8-bit quantization can reduce the network size by 4 folds with negligible drop of performance.
+
+Currently, PocketFlow supports two types of uniform quantization learners:
+
+* `UniformQuantLearner`: a self-developed learner for uniform quantization. The learner is carefully optimized with various extensions and variations supported.
+
+* `UniformQuantTFLearner`: a wrapper based on TensorFlow's [quantization-aware training](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize) training APIs. For now, this wrapper only supports 8-bit quantization, which leads to approximately 4x memory reduction and 3x inference speed-up.
+
+A comparison of these two learners are shown below:
+
+| Features | `UniformQuantLearner` | `UniformQuantTFLearner` |
+|:--------:|:---------------------:|:-----------------------:|
+| Compression              | Yes | Yes |
+| Acceleration             |     | Yes |
+| Fine-tuning              | Yes |     |
+| Bucketing                | Yes |     |
+| Hyper-param Optimization | Yes |     |
+
+## Algorithm
+
+### Training Workflow
+
+Both two uniform quantization learners generally follow the training workflow below:
+
+Given a pre-defined full-precision model, the learner inserts quantization nodes and operations into the computation graph of the model. With activation quantization enabled, quantization nodes will also be placed after activation operations (*e.g.* ReLU).
+
+In the training phase, both full-precision and quantized weights are kept. In the forward pass, quantized weights are obtained by applying the quantization function on full-precision weights. To update full-precision weights in the backward pass, since gradients w.r.t. quantized weights are zeros almost everywhere, we use the straight-through estimator (STE, Bengio et al., 2015) to pass gradients of quantized weights directly to full-precision weights for update.
+
+![train_n_inference](pics/train_n_inference.png)
+
+### Quantization Function
+
+Uniform quantization distributes all the quantization points evenly across the range $\left[ w_{min}, w_{max} \right]$, where $w_{max}$ and $w_{min}$ are the maximum and minimum values of weights in each layer (or bucket). The original full-precision weights are then assigned to their closest quantization points. To achieve this, we first normalize the full-precision weights $x$ to $\left[ 0, 1 \right]$:
+
+$$
+\text{sc} \left( x \right) = \frac{ x - \beta}{\alpha},
+$$
+
+where $\alpha = w_{max} - w_{min}$ and $\beta = w_{min}$. Then, we assign $\text{sc} \left( x \right)$ to its closest quantization point (assuming $k$-bit quantization is used):
+
+$$
+\hat{x} = \frac{1}{2^{k} - 1} \text{round} \left( \left( 2^{k} - 1 \right) \cdot \text{sc} \left( x \right) \right),
+$$
+
+and finally we use inverse linear transformation to recover the quantized weights to the original scale:
+
+$$
+Q \left( x \right) = \alpha \cdot \hat{x} + \beta.
+$$
+
+## UniformQuantLearner
+
+`UniformQuantLearner` is a self-developed learner, which allows a number of customized configurations for uniform quantization. For example, the learner supports bucketing, leading to more fine-grained quantization and better performance. The learner also allows to allocate different bits across layers, in which users can turn on the hyper-parameter optimizer with reinforcement learning to search for the optimal bit allocation strategy.
+
+### Hyper-parameters
+
+To configure `UniformQuantLearner`, users can pass options via the TensorFlow flag interface. The available options are listed as follows:
+
+| Option | Description |
+|:-------|:------------|
+| `uql_weight_bits`  |  the number of bits for weights. Default: `4`.  |
+| `uql_activation_bits`  |  the number of bits for activation. Default: `32.`  |
+| `uql_save_quant_model_path` |  quantized model's save path. Default: `./uql_quant_models/model.ckpt`  |
+| `uql_use_buckets`  | the switch to use bucketing. Default: `False.` |
+| `uql_bucket_type`  | two bucket type available: [`split`, `channel`]. Default: `channel.` |
+| `uql_bucket_size`  |  the number of bucket size for bucket type `split`. Default: `256`.  |
+| `uql_quantize_all_layers` |  the switch to quantize first and last layers of network. Default: `False.`  |
+| `uql_quant_epoch` |  the number of epochs for fine-tuning. Default: `60`.  |
+| `uql_enbl_rl_agent` | the switch to enable RL to learn optimal bit strategy. Default:`False`. |
+
+Here, we provide detailed description (and some analysis) for above hyper-parameters:
+
+* `uql_weight_bits`: The number of bits for weight quantization. Generally, 8 bit does not hurt the model performance while it can compress the model size by 4 folds. While 2 bit and 4 bit could lead to drop of performance on large datasets such as Imagenet.
+* `uql_activation_bits`: The number of bits for activation quantization. When both weights and activations are quantized, 8 bit does not lead to apparent drop of performance, and sometimes can even increase the classification accuracy, which is probably due to better generalization ability. Nevertheless, the performance will be more challenged when both weights and activations are quantized to lower bits, comparing to weight-only quantization.
+* `uql_save_quant_mode_path`: the path to save the quantized model. Quantization nodes  have already been inserted into the graph.
+* `uql_use_buckets`: the switch to turn on the bucket. With bucketing, weights are split into multiple pieces, while the $\alpha$ and $\beta$ are calculated individually for each piece. Therefore, turning on the bucketing can lead to more fine-grained quantization.
+* `uql_bucket_type`: the type of bucketing. Currently two types are supported: [`split`, `channel`]. `split` refers to that the weights of a layer are first concatenated into a long vector, and then cut it into pieces according to `uql_bucket_size`. The remaining last piece will be padded and taken as a new piece. After quantization of each piece, the vectors are then folded back to the original shape as the quantized weights. `channel` refers to that weights with shape `[k, k, cin, cout]` in a convolutional layer are cut into `cout` buckets, where each bucket has the size of `k * k * cin`. For weights with shape `[m, n]` in fully connected layers, they are cut into `n` buckets, each of size `m`. In practice, bucketing with type  `channel` can be calculated more efficiently comparing to type `split` since there are less buckets and less computation to iterate through all of them.
+* `uql_bucket_size`: the size of buckets when using bucket type `split`. Generally, smaller bucket size can lead to more fine grained quantization, while more storage are required since full precision statistics ($\alpha$ and $\beta$) of each bucket need to be kept.
+* `uql_quantize_all_layers`: the switch to quantize the first and last layers. The first and last layers of the network are connected directly with the input and output, and are arguably more sensitive to quantization. Keeping them un-quantized can slightly increase the performance, nevertheless, if you want to accelerate the inference speed, all layers are supposed to be quantized.
+* `uql_quant_epoch`: the epochs for fine-tuning a quantized network.
+* `uql_enbl_rl_agent`: the switch to turn on the RL agent as hyper parameter optimizer. Details about the RL agent and its configurations are described below.
+
+### Configure the RL Agent
+
+Once the hyper parameter optimizer is turned on, i.e., `uql_enbl_rl_agent==True` , the RL agent will automatically search for the optimal bit allocation strategy for each layer.  In order to search efficiently, the agent need to be configured properly. While here we list all the configurable hyper parameters for the agent, users can just keep the default value for most parameters, while modify only a few of them if necessary.
+
+| Option | Description |
+|:-------|:------------|
+| `uql_equivalent_bits`       | the number of re-allocated bits that is equivalent to uniform allocation of bits. Default: `4`. |
+| `uql_nb_rlouts`              | the number of roll outs for training the RL agent. Default: `200`. |
+| `uql_w_bit_min`              | the minimal number of bits for each layer. Default: `2`.     |
+| `uql_w_bit_max`              | the maximal number of bits for each layer. Default: `8`.     |
+| `uql_enbl_rl_global_tune`    | the switch to fine-tune all layers of the network. Default: `True`. |
+| `uql_enbl_rl_layerwise_tune` | the switch to fine-tune the network layer by layer. Default: `False`. |
+| `uql_tune_layerwise_steps`   | the number of steps for layer-wise fine-tuning. Default: `300`. |
+| `uql_tune_global_steps`      | the number of steps for global fine-tuning. Default: `2000`. |
+| `uql_tune_disp_steps`        | the display steps to show the fine-tuning progress. Default: `100`. |
+| `uql_enbl_random_layers`     | the switch to randomly permute layers during RL agent training. Default: `True`. |
+
+Detailed description and usages for above hyper-parameters are listed below:
+
+* `uql_equivalent_bits`:  the total number of bits used in the optimal strategy will not exceed $n_{param}*$`uql_equivalent_bits` . For example, by setting `uql_equivalent_bits`=4, the RL agent will try to find the best quantization strategy with the same compression ratio to that each layer is quantized by 4 bits.
+
+The following parameters can be kept in default value in most cases. Users can also modify them when using their customized models if necessary.
+
+* `uql_nb_rlouts`: the number of roll-out for training the RL agent.  Generally we will use the first quarter of `uql_nb_rlouts` for collection of  the training buffer, and last three quarters for the training of the agent. The larger the `uql_nb_rlouts`, the slower the search for the hyper-parameter optimizer.
+* `uql_w_bit_min`: the minimum number of quantization bit for a layer. This is used to constrain the searching space and avoid extreme strategies that crash the entire performance of the compressed model.
+* `uql_w_bit_max`: the maximum number of quantization bit for a layer. This is used to constrain the searching space and avoid that one layer may use too much unnecessary bits.
+* `uql_enbl_rl_global_tune`: the switch to globally fine-tune the network in each roll-out, which is done by updating the full-precision weights for all layers via the STE estimator. The aim of the fine-tune is to obtain effective reward from the current strategy.
+* `uql_enbl_rl_layerwise_tune`: the switch to layer-wise fine-tune the network in each roll-out, which is done by minimizing the l2-norm between the quantized layer and full-precision layer.
+* `uql_tune_layerwise_steps`: the number of steps for layer-wise fine-tuning. Generally, the larger the value, the more precise the reward and thereon the better the strategy.
+* `uql_tune_global_steps`: the number of steps for global fine-tuning. Generally, the larger the value, the more precise the reward and thereon the better the strategy.
+* `uql_tune_disp_steps`: the intervals to display the global training process in each roll-out.
+* `uql_enbl_random_layers` : the switch to randomly permute layers of the network when searching the optimal strategy. This could be helpful since the bit budget used in previous layers may affect the searching space for following layers, while randomly shuffling all layers makes sure that all layers have equal probability of all strategies.
+
+### Usage Examples
+
+In this section, we provide some usage examples to demonstrate how to use `UniformQuantLearner`under different execution modes and hyper-parameter combinations.
+
+To quantize the network, users should first get the model prepared. Users can either use the pre-built models in PocketFlow, or develop their customized nets following the model definition in PocketFlow (for example, [resnet_at_cifar10.py](https://github.com/Tencent/PocketFlow/blob/master/nets/resnet_at_cifar10.py)). Once the model is built, the quantization can be easily triggered by directly  as follows:
+
+To quantize a ResNet-20 model for CIFAR-10 classification task with 4 bits in the local mode, use:
+
+```bash
+# quantize resnet-20 on CIFAR-10
+sh ./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+--learner=uniform \
+--uql_weight_bits=4 \
+--uql_activation_bits=4 \
+```
+
+To quantize a ResNet-18 model for ILSVRC_12 classification task with 8 bits in the docker mode with 4 GPUs, and allow to use the channel-wise bucketing, use:
+
+``` bash
+# quantize the resnet-18 on ILSVRC-12
+sh ./scripts/run_docker.sh nets/resnet_at_ilsvrc12_run.py \
+-n=4 \
+--learner=uniform \
+--uql_weight_bits=8 \
+--uql_activation_bits=8 \
+--uql_use_buckets=True \
+--uql_bucket_type=channel
+```
+
+To quantize a MobileNet-v1 model for ILSVRC_12 classification task with 4 bits in the seven mode with 8 GPUs, and allow the RL agent to search for the optimal bit strategy, use:
+
+```bash
+# quantize mobilenet-v1 on ILSVRC-12
+sh ./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py \
+-n=8 \
+--learner=uniform \
+--uql_enbl_rl_agent=True \
+--uql_equivalent_bits=4 \
+```
+
+## UniformQuantTFLearner
+
+PocketFlow also wraps the quantization aware training in TensorFlow. The quantized model can be directly exported to `.tflite` format via [export_quant_tflite_model.py](https://github.com/haolibai/PocketFlow/blob/master/tools/conversion/export_quant_tflite_model.py) in PocketFlow, and then be easily deployed on Android devices.
+
+To configure `UniformQuantTFLearner`, the hyper-parameters are as follows:
+
+| Option | Description |
+|:-------|:------------|
+| `uqtf_save_path`       | UQ-TF: model\'s save path. Default: `./models_uqtf/model.ckpt`. |
+| `uqtf_save_path_eval`  | UQ-TF: model\'s save path for evaluation. Default: `./models_uqtf_eval/model.ckpt`. |
+| `uqtf_weight_bits`     | UQ-TF: # of bits for weight quantization. Default: `8`.      |
+| `uqtf_activation_bits` | UQ-TF: # of bits for activation quantization. Default: `8`.  |
+| `uqtf_quant_delay`     | UQ-TF: # of steps after which weights and activations are quantized. Default: `0`. |
+| `uqtf_freeze_bn_delay` | UT-TF: # of steps after which moving mean and variance are frozen. Default: `None`. |
+| `uqtf_lrn_rate_dcy`    | UQ-TF: learning rate\'s decaying factor. Default: `1e-2`.    |
+
+Here, the detailed description (and some analysis) for some above hyper-parameters are listed as follows:
+
+* `uqtf_quant_delay`: The number of steps to start fine-tuning on the quantized network. Before the training step reaches `uqtf_quant_delay`, only full precision weights of the model are updated.
+* `uqtf_freeze_bn_delay`: The number of steps after which the moving mean and variance of batch normalization layers are frozen and used, instead of the batch statistics during training.
+* `uqtf_lrn_rate_dcy` : The decay of learning rate for the quantized model. Generally the quantized network needs smaller learning rate comparing to that for the full-precision model.
+
+### Usage Examples
+
+To deploy a quantized network on Android devices, there are generally 3 steps:
+
+### Quantize the pre-trained network
+
+To quantize a MobileNet-v1 model for ILSVRC-12 classification task with 8 bits in the seven mode, use:
+
+``` bash
+# quantize MobileNet-v1 on ILSVRC-12
+$ ./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py -n=8 \
+    --learner uniform-tf \
+    --nb_epochs_rat 0.2
+```
+
+where `--nb_epochs_rat 0.2` specifies that only 20% training epochs to be used, which usually should be enough.
+
+### Export to .tflite format
+
+```bash
+# load the checkpoints in ./models_uqtf_eval
+$ python tools/conversion/export_quant_tflite_models.py \
+    --model_dir ./models_uqtf_eval \
+    --enbl_post_quant
+```
+
+Note that we enable the `enbl_post_quant` option to ensure all operations being quantized. On one hand, some operations may not be successfully quantized via TensorFlow's quantization-aware training APIs, so post-training quantization can help remedy this, possibly at the cost of slightly reduced accuracy of the quantized model. On the other hand, users can directly export a full-precision model to its quantized counterpart without going through the `UniformQuantTFLearner`. This could be helpful when users want to quickly evaluate the inference speed, or there is more tolerance for the performance degradation of quantized model.
+
+If the conversion completes without error, then `.pb` and `.tflite` files will be saved in `./models_uqtf_eval`.
+
+### Deploy on Mobile Devices
+
+The Deployment of a quantized model is very similar to that of a full-precision model, as is shown in the [tutorial page](https://pocketflow.github.io/tutorial/). Specifically, users need to do the following modifications:
+
+1. In [ImageClassifierQuantizedMobileNet.java](https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/ImageClassifierQuantizedMobileNet.java) L24: rename the class w.r.t. your model.
+2. In [ImageClassifierQuantizedMobileNet.java](https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/ImageClassifierQuantizedMobileNet.java) L46: replace the model input "mobilenet_quant_v1_224.tflite" to your "*.tflite" file.
+3. In [ImageClassifierQuantizedMobileNet.java](https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/ImageClassifierQuantizedMobileNet.java) L51: replace the label file "labels_mobilenet_quant_v1_224.txt" to your label files.
+
+4. In [Camera2BasicFragment.java](https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/lite/java/demo/app/src/main/java/com/example/android/tflitecamerademo/Camera2BasicFragment.java) L332: change the name of the class accordingly.
diff --git a/docs/docs/ws_learner.md b/docs/docs/ws_learner.md
new file mode 100644
index 0000000..7b88066
--- /dev/null
+++ b/docs/docs/ws_learner.md
@@ -0,0 +1,96 @@
+# Weight Sparsification
+
+## Introduction
+
+By imposing sparsity constraints on convolutional and fully-connected layers, the number of non-zero weights can be dramatically reduced, which leads to smaller model size and lower FLOPS for inference (actual acceleration depends on efficient implementation for sparse operations). Directly training a network with fixed sparsity degree may encounter some optimization difficulties and takes longer time to converge. To overcome this, Zhu & Gupta proposed a dynamic pruning schedule to gradually remove network weights to simplify the optimization process (Zhu & Gupta, 2017).
+
+Note: in this documentation, we will use both "sparsity" and "pruning ratio" to denote the ratio of zero-valued weights over all weights.
+
+## Algorithm Description
+
+For each convolutional kernel (for convolutional layer) or weighting matrix (for fully-connected layer), we create a binary mask of the same size to impose the sparsity constraint. During the forward pass, the convolutional kernel (or weighting matrix) is multiplied with the binary mask, so that some weights will not participate in the computation and also will not be updated via gradients. The binary mask is computed based on absolute values of weights: weight with the smallest absolute value will be masked-out until the desired sparsity is reached.
+
+During the training process, the sparsity is gradually increased to improve the overall optimization behavior. The dynamic pruning schedule is defined as:
+
+$$
+s_{t} = s_{f} - s_{f} \cdot \left( 1 - \frac{t - t_{b}}{t_{e} - t_{b}} \right)^{\alpha}, t \in \left[ t_{b}, t_{e} \right]
+$$
+
+where $s_{t}$ is the sparsity at iteration \#$t$, $s_{f}$ is the target sparsity, $t_{b}$ and $t_{e}$ are the iteration indices where the sparsity begins and stops increasing, and $\alpha$ is the exponent term. In the actual implementation, the binary mask is not updated at each iteration. Instead, it is updated every $\Delta t$ iterations so as to stabilize the training process. We visualize the dynamic pruning schedule in the figure below.
+
+![WSL PR Schedule](pics/wsl_pr_schedule.png)
+
+Most networks consist of multiple layers, and the weight redundancy may differ from one layer to another. In order to maximally exploit the weight redundancy, we incorporate a reinforcement learning controller to automatically determine the optimal sparsity (or pruning ratio) for each layer. In each roll-out, the RL agent sequentially determine the sparsity for each layer, and then the network is pruned and re-trained for a few iterations using layer-wise regression & global fine-tuning. The reward function's value is computed based on the re-trained network's accuracy (and computation efficiency), and then used update model parameters of RL agent. For more details, please refer to the documentation named "Hyper-parameter Optimizer - Reinforcement Learning".
+
+## Hyper-parameters
+
+Below is the full list of hyper-parameters used in the weight sparsification learner:
+
+| Name | Description |
+|:-----|:------------|
+| `ws_save_path`        | model's save path |
+| `ws_prune_ratio`      | target pruning ratio |
+| `ws_prune_ratio_prtl` | pruning ratio protocol: 'uniform' / 'heurist' / 'optimal' |
+| `ws_nb_rlouts`        | number of roll-outs for the RL agent |
+| `ws_nb_rlouts_min`    | minimal number of roll-outs for the RL agent to start training |
+| `ws_reward_type`      | reward type: 'single-obj' / 'multi-obj' |
+| `ws_lrn_rate_rg`      | learning rate for layer-wise regression |
+| `ws_nb_iters_rg`      | number of iterations for layer-wise regression |
+| `ws_lrn_rate_ft`      | learning rate for global fine-tuning |
+| `ws_nb_iters_ft`      | number of iterations for global fine-tuning |
+| `ws_nb_iters_feval`   | number of iterations for fast evaluation |
+| `ws_prune_ratio_exp`  | pruning ratio's exponent term |
+| `ws_iter_ratio_beg`   | iteration ratio at which the pruning ratio begins increasing |
+| `ws_iter_ratio_end`   | iteration ratio at which the pruning ratio stops increasing |
+| `ws_mask_update_step` | step size for updating the pruning mask |
+
+Here, we provide detailed description (and some analysis) for above hyper-parameters:
+
+* `ws_save_path`: save path for model created in the training graph. The resulting checkpoint files can be used to resume training from a previous run and compute model's loss function's value and some other evaluation metrics.
+* `ws_prune_ratio`: target pruning ratio for convolutional & fully-connected layers. The larger `ws_prune_ratio` is, the more weights will be pruned. If `ws_prune_ratio` equals 0, then no weights will be pruned and model remains the same; if `ws_prune_ratio` equals 1, then all weights are pruned.
+* `ws_prune_ratio_prtl`: pruning ratio protocol. Possible options include: 1) uniform: all layers use the same pruning ratio; 2) heurist: the more weights in one layer, the higher pruning ratio will be; 3) optimal: each layer's pruning ratio is determined by reinforcement learning.
+* `ws_nb_rlouts`: number of roll-outs for training the reinforcement learning agent. A roll-out refers to: use the RL agent to determine the pruning ratio for each layer; fine-tune the weight sparsified network; evaluate the fine-tuned network to obtain the reward value.
+* `ws_nb_rlouts_min`: minimal number of roll-outs for the RL agent to start training. The RL agent requires a few roll-outs for random exploration before actual training starts. We recommend to set this to be a quarter of `ws_nb_rlouts`.
+* `ws_reward_type`: reward function's type for the RL agent. Possible options include: 1) single-obj: the reward function only depends on the compressed model's accuracy (the sparsity constraint is imposed during roll-out); 2) multi-obj: the reward function depends on both the compressed model's accuracy and the actual sparsity.
+* `ws_lrn_rate_rg`: learning rate for layer-wise regression.
+* `ws_nb_iters_rg`: number of iterations for layer-wise regression. This should be set to some value that the layer-wise regression can almost converge and the loss function's value does not decrease much even if more iterations are used.
+* `ws_lrn_rate_ft`: learning rate for global fine-tuning.
+* `ws_nb_iters_ft`: number of iterations for global fine-tuning. This should be set to some value that the global fine-tuning can almost converge and the loss function's value does not decrease much even if more iterations are used.
+* `ws_nb_iters_feval`: number of iterations for fast evaluation. In each roll-out, the re-trained network is evaluated on a subset of evaluation data to save time.
+* `ws_prune_ratio_exp`: pruning ratio's exponent term as defined in the dynamic pruning schedule above.
+* `ws_iter_ratio_beg`: iteration ratio at which the pruning ratio begins increasing. In the dynamic pruning schedule defined above, $t_{b}$ equals to the total number of training iterations multiplied with `ws_iter_ratio_beg`.
+* `ws_iter_ratio_end`: iteration ratio at which the pruning ratio stops increasing. In the dynamic pruning schedule defined above, $t_{e}$ equals to the total number of training iterations multiplied with `ws_iter_ratio_end`.
+* `ws_mask_update_step`: step size for updating the pruning mask. By increasing `ws_mask_update_step`, binary masks for weight pruning are less frequently updated, which will speed-up the training but the difference between pre-update and post-update sparsity will be larger.
+
+## Usage Examples
+
+In this section, we provide some usage examples to demonstrate how to use `WeightSparseLearner` under different execution modes and hyper-parameter combinations:
+
+To compress a ResNet-20 model for CIFAR-10 classification task in the local mode, use:
+
+``` bash
+# set the target pruning ratio to 0.75
+./scripts/run_local.sh nets/resnet_at_cifar10_run.py \
+    --learner weight-sparse \
+    --ws_prune_ratio 0.75
+```
+
+To compress a ResNet-34 model for ILSVRC-12 classification task in the docker mode with 4 GPUs, use:
+
+``` bash
+# set the pruning ratio protocol to "heurist"
+./scripts/run_docker.sh nets/resnet_at_ilsvrc12_run.py -n=4 \
+    --learner weight-sparse \
+    --resnet_size 34 \
+    --ws_prune_ratio_prtl heurist
+```
+
+To compress a MobileNet-v2 model for ILSVRC-12 classification task in the seven mode with 8 GPUs, use:
+
+``` bash
+# enable training with distillation loss
+./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py -n=8 \
+    --learner weight-sparse \
+    --mobilenet_version 2 \
+    --enbl_dst
+```
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
new file mode 100644
index 0000000..46f64a1
--- /dev/null
+++ b/docs/mkdocs.yml
@@ -0,0 +1,32 @@
+site_name: PocketFlow Docs
+nav:
+- Home: index.md
+- Installation: installation.md
+- Tutorial: tutorial.md
+- Learners - Algorithms:
+  - Channel Pruning: cp_learner.md
+  - Channel Pruning - Remastered: cpr_learner.md
+  - Discrimination-aware Channel Pruning: dcp_learner.md
+  - Weight Sparsification: ws_learner.md
+  - Uniform Quantization: uq_learner.md
+  - Non-uniform Quantization: nuq_learner.md
+- Learners - Misc.:
+  - Distillation: distillation.md
+  - Multi-GPU Training: multi_gpu_training.md
+- Hyper-parameter Optimizers:
+  - Reinforcement Learning: reinforcement_learning.md
+  - AutoML-based Methods: automl_based_methods.md
+- Self-defined Models: self_defined_models.md
+- Performance: performance.md
+- Frequently Asked Questions: faq.md
+- Appendix:
+  - Pre-trained Models: pre_trained_models.md
+  - Test Cases: test_cases.md
+  - Reference: reference.md
+theme: readthedocs
+
+markdown_extensions:
+  - pymdownx.arithmatex
+extra_javascript:
+  - mathjax-config.js
+  - MathJax.js?config=TeX-AMS-MML_HTMLorMML
diff --git a/docs/qr_code.jpg b/docs/qr_code.jpg
new file mode 100644
index 0000000..644fb1c
Binary files /dev/null and b/docs/qr_code.jpg differ
diff --git a/examples/convnet_at_fmnist.py b/examples/convnet_at_fmnist.py
new file mode 100644
index 0000000..bdd004a
--- /dev/null
+++ b/examples/convnet_at_fmnist.py
@@ -0,0 +1,135 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Model helper for creating a ConvNet model for the Fashion-MNIST dataset."""
+
+import tensorflow as tf
+
+from nets.abstract_model_helper import AbstractModelHelper
+from datasets.fmnist_dataset import FMnistDataset
+from utils.lrn_rate_utils import setup_lrn_rate_piecewise_constant
+from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_float('nb_epochs_rat', 1.0, '# of training epochs\'s ratio')
+tf.app.flags.DEFINE_float('lrn_rate_init', 1e-1, 'initial learning rate')
+tf.app.flags.DEFINE_float('batch_size_norm', 128, 'normalization factor of batch size')
+tf.app.flags.DEFINE_float('momentum', 0.9, 'momentum coefficient')
+tf.app.flags.DEFINE_float('loss_w_dcy', 3e-4, 'weight decaying loss\'s coefficient')
+
+def forward_fn(inputs, data_format):
+  """Forward pass function.
+
+  Args:
+  * inputs: inputs to the network's forward pass
+  * data_format: data format ('channels_last' OR 'channels_first')
+
+  Returns:
+  * inputs: outputs from the network's forward pass
+  """
+
+  # tranpose the image tensor if needed
+  if data_format == 'channel_first':
+    inputs = tf.transpose(inputs, [0, 3, 1, 2])
+
+  # conv1
+  inputs = tf.layers.conv2d(inputs, 32, [5, 5], padding='same',
+                            data_format=data_format, activation=tf.nn.relu, name='conv1')
+  inputs = tf.layers.max_pooling2d(inputs, [2, 2], 2, data_format=data_format, name='pool1')
+
+  # conv2
+  inputs = tf.layers.conv2d(inputs, 64, [5, 5], padding='same',
+                            data_format=data_format, activation=tf.nn.relu, name='conv2')
+  inputs = tf.layers.max_pooling2d(inputs, [2, 2], 2, data_format=data_format, name='pool2')
+
+  # fc3
+  inputs = tf.layers.flatten(inputs, name='flatten')
+  inputs = tf.layers.dense(inputs, 1024, activation=tf.nn.relu, name='fc3')
+
+  # fc4
+  inputs = tf.layers.dense(inputs, FLAGS.nb_classes, name='fc4')
+  inputs = tf.nn.softmax(inputs, name='softmax')
+
+  return inputs
+
+class ModelHelper(AbstractModelHelper):
+  """Model helper for creating a ConvNet model for the Fashion-MNIST dataset."""
+
+  def __init__(self, data_format='channels_last'):
+    """Constructor function."""
+
+    # class-independent initialization
+    super(ModelHelper, self).__init__(data_format)
+
+    # initialize training & evaluation subsets
+    self.dataset_train = FMnistDataset(is_train=True)
+    self.dataset_eval = FMnistDataset(is_train=False)
+
+  def build_dataset_train(self, enbl_trn_val_split=False):
+    """Build the data subset for training, usually with data augmentation."""
+
+    return self.dataset_train.build(enbl_trn_val_split)
+
+  def build_dataset_eval(self):
+    """Build the data subset for evaluation, usually without data augmentation."""
+
+    return self.dataset_eval.build()
+
+  def forward_train(self, inputs):
+    """Forward computation at training."""
+
+    return forward_fn(inputs, self.data_format)
+
+  def forward_eval(self, inputs):
+    """Forward computation at evaluation."""
+
+    return forward_fn(inputs, self.data_format)
+
+  def calc_loss(self, labels, outputs, trainable_vars):
+    """Calculate loss (and some extra evaluation metrics)."""
+
+    loss = tf.losses.softmax_cross_entropy(labels, outputs)
+    loss += FLAGS.loss_w_dcy * tf.add_n([tf.nn.l2_loss(var) for var in trainable_vars])
+    accuracy = tf.reduce_mean(
+      tf.cast(tf.equal(tf.argmax(labels, axis=1), tf.argmax(outputs, axis=1)), tf.float32))
+    metrics = {'accuracy': accuracy}
+
+    return loss, metrics
+
+  def setup_lrn_rate(self, global_step):
+    """Setup the learning rate (and number of training iterations)."""
+
+    nb_epochs = 160
+    idxs_epoch = [40, 80, 120]
+    decay_rates = [1.0, 0.1, 0.01, 0.001]
+    batch_size = FLAGS.batch_size * (1 if not FLAGS.enbl_multi_gpu else mgw.size())
+    lrn_rate = setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay_rates)
+    nb_iters = int(FLAGS.nb_smpls_train * nb_epochs * FLAGS.nb_epochs_rat / batch_size)
+
+    return lrn_rate, nb_iters
+
+  @property
+  def model_name(self):
+    """Model's name."""
+
+    return 'convnet'
+
+  @property
+  def dataset_name(self):
+    """Dataset's name."""
+
+    return 'fmnist'
diff --git a/examples/convnet_at_fmnist_run.py b/examples/convnet_at_fmnist_run.py
new file mode 100644
index 0000000..008aaf4
--- /dev/null
+++ b/examples/convnet_at_fmnist_run.py
@@ -0,0 +1,69 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Execution script for ConvNet models on the Fashion-MNIST dataset."""
+
+import traceback
+import tensorflow as tf
+
+from nets.convnet_at_fmnist import ModelHelper
+from learners.learner_utils import create_learner
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_string('log_dir', './logs', 'logging directory')
+tf.app.flags.DEFINE_boolean('enbl_multi_gpu', False, 'enable multi-GPU training')
+tf.app.flags.DEFINE_string('learner', 'full-prec', 'learner\'s name')
+tf.app.flags.DEFINE_string('exec_mode', 'train', 'execution mode: train / eval')
+tf.app.flags.DEFINE_boolean('debug', False, 'debugging information')
+
+def main(unused_argv):
+  """Main entry."""
+
+  try:
+    # setup the TF logging routine
+    if FLAGS.debug:
+      tf.logging.set_verbosity(tf.logging.DEBUG)
+    else:
+      tf.logging.set_verbosity(tf.logging.INFO)
+    sm_writer = tf.summary.FileWriter(FLAGS.log_dir)
+
+    # display FLAGS's values
+    tf.logging.info('FLAGS:')
+    for key, value in FLAGS.flag_values_dict().items():
+      tf.logging.info('{}: {}'.format(key, value))
+
+    # build the model helper & learner
+    model_helper = ModelHelper()
+    learner = create_learner(sm_writer, model_helper)
+
+    # execute the learner
+    if FLAGS.exec_mode == 'train':
+      learner.train()
+    elif FLAGS.exec_mode == 'eval':
+      learner.download_model()
+      learner.evaluate()
+    else:
+      raise ValueError('unrecognized execution mode: ' + FLAGS.exec_mode)
+
+    # exit normally
+    return 0
+  except ValueError:
+    traceback.print_exc()
+    return 1  # exit with errors
+
+if __name__ == '__main__':
+  tf.app.run()
diff --git a/examples/fmnist_dataset.py b/examples/fmnist_dataset.py
new file mode 100644
index 0000000..be40576
--- /dev/null
+++ b/examples/fmnist_dataset.py
@@ -0,0 +1,166 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Fashion-MNIST dataset."""
+
+import os
+import gzip
+import numpy as np
+import tensorflow as tf
+
+from datasets.abstract_dataset import AbstractDataset
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_integer('nb_classes', 10, '# of classes')
+tf.app.flags.DEFINE_integer('nb_smpls_train', 60000, '# of samples for training')
+tf.app.flags.DEFINE_integer('nb_smpls_val', 5000, '# of samples for validation')
+tf.app.flags.DEFINE_integer('nb_smpls_eval', 10000, '# of samples for evaluation')
+tf.app.flags.DEFINE_integer('batch_size', 128, 'batch size per GPU for training')
+tf.app.flags.DEFINE_integer('batch_size_eval', 100, 'batch size for evaluation')
+
+# Fashion-MNIST specifications
+IMAGE_HEI = 28
+IMAGE_WID = 28
+IMAGE_CHN = 1
+
+def load_mnist(image_file, label_file):
+  """Load images and labels from *.gz files.
+
+  This function is modified from utils/mnist_reader.py in the Fashion-MNIST repo.
+
+  Args:
+  * image_file: file path to images
+  * label_file: file path to labels
+
+  Returns:
+  * images: np.array of the image data
+  * labels: np.array of the label data
+  """
+
+  with gzip.open(label_file, 'rb') as i_file:
+    labels = np.frombuffer(i_file.read(), dtype=np.uint8, offset=8)
+  with gzip.open(image_file, 'rb') as i_file:
+    images = np.frombuffer(i_file.read(), dtype=np.uint8, offset=16)
+    image_size = IMAGE_HEI * IMAGE_WID * IMAGE_CHN
+    assert images.size == image_size * len(labels)
+    images = images.reshape(len(labels), image_size)
+
+  return images, labels
+
+def parse_fn(image, label, is_train):
+  """Parse an (image, label) pair and apply data augmentation if needed.
+
+  Args:
+  * image: image tensor
+  * label: label tensor
+  * is_train: whether data augmentation should be applied
+
+  Returns:
+  * image: image tensor
+  * label: one-hot label tensor
+  """
+
+  # data parsing
+  label = tf.one_hot(tf.reshape(label, []), FLAGS.nb_classes)
+  image = tf.cast(tf.reshape(image, [IMAGE_HEI, IMAGE_WID, IMAGE_CHN]), tf.float32)
+  image = tf.image.per_image_standardization(image)
+
+  # data augmentation
+  if is_train:
+    image = tf.image.resize_image_with_crop_or_pad(image, IMAGE_HEI + 8, IMAGE_WID + 8)
+    image = tf.random_crop(image, [IMAGE_HEI, IMAGE_WID, IMAGE_CHN])
+    image = tf.image.random_flip_left_right(image)
+
+  return image, label
+
+class FMnistDataset(AbstractDataset):
+  '''Fashion-MNIST dataset.'''
+
+  def __init__(self, is_train):
+    """Constructor function.
+
+    Args:
+    * is_train: whether to construct the training subset
+    """
+
+    # initialize the base class
+    super(FMnistDataset, self).__init__(is_train)
+
+    # choose local files or HDFS files w.r.t. FLAGS.data_disk
+    if FLAGS.data_disk == 'local':
+      assert FLAGS.data_dir_local is not None, '<FLAGS.data_dir_local> must not be None'
+      data_dir = FLAGS.data_dir_local
+    elif FLAGS.data_disk == 'hdfs':
+      assert FLAGS.data_hdfs_host is not None and FLAGS.data_dir_hdfs is not None, \
+        'both <FLAGS.data_hdfs_host> and <FLAGS.data_dir_hdfs> must not be None'
+      data_dir = FLAGS.data_hdfs_host + FLAGS.data_dir_hdfs
+    else:
+      raise ValueError('unrecognized data disk: ' + FLAGS.data_disk)
+
+    # setup paths to image & label files, and read in images & labels
+    if is_train:
+      self.batch_size = FLAGS.batch_size
+      image_file = os.path.join(data_dir, 'train-images-idx3-ubyte.gz')
+      label_file = os.path.join(data_dir, 'train-labels-idx1-ubyte.gz')
+    else:
+      self.batch_size = FLAGS.batch_size_eval
+      image_file = os.path.join(data_dir, 't10k-images-idx3-ubyte.gz')
+      label_file = os.path.join(data_dir, 't10k-labels-idx1-ubyte.gz')
+    self.images, self.labels = load_mnist(image_file, label_file)
+    self.parse_fn = lambda x, y: parse_fn(x, y, is_train)
+
+  def build(self, enbl_trn_val_split=False):
+    """Build iterator(s) for tf.data.Dataset() object.
+
+    Args:
+    * enbl_trn_val_split: whether to split into training & validation subsets
+
+    Returns:
+    * iterator_trn: iterator for the training subset
+    * iterator_val: iterator for the validation subset
+      OR
+    * iterator: iterator for the chosen subset (training OR testing)
+    """
+
+    # create a tf.data.Dataset() object from NumPy arrays
+    dataset = tf.data.Dataset.from_tensor_slices((self.images, self.labels))
+    dataset = dataset.map(self.parse_fn, num_parallel_calls=FLAGS.nb_threads)
+
+    # create iterators for training & validation subsets separately
+    if self.is_train and enbl_trn_val_split:
+      iterator_val = self.__make_iterator(dataset.take(FLAGS.nb_smpls_val))
+      iterator_trn = self.__make_iterator(dataset.skip(FLAGS.nb_smpls_val))
+      return iterator_trn, iterator_val
+
+    return self.__make_iterator(dataset)
+
+  def __make_iterator(self, dataset):
+    """Make an iterator from tf.data.Dataset.
+
+    Args:
+    * dataset: tf.data.Dataset object
+
+    Returns:
+    * iterator: iterator for the dataset
+    """
+
+    dataset = dataset.apply(tf.contrib.data.shuffle_and_repeat(buffer_size=FLAGS.buffer_size))
+    dataset = dataset.batch(self.batch_size)
+    dataset = dataset.prefetch(FLAGS.prefetch_size)
+    iterator = dataset.make_one_shot_iterator()
+
+    return iterator
diff --git a/learners/abstract_learner.py b/learners/abstract_learner.py
index cabbba7..3d31455 100644
--- a/learners/abstract_learner.py
+++ b/learners/abstract_learner.py
@@ -23,6 +23,8 @@
 import subprocess
 import tensorflow as tf
 
+from utils.misc_utils import auto_barrier as auto_barrier_impl
+from utils.misc_utils import is_primary_worker as is_primary_worker_impl
 from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
 
 FLAGS = tf.app.flags.FLAGS
@@ -77,8 +79,12 @@ def __init__(self, sm_writer, model_helper):
     self.forward_train = model_helper.forward_train
     self.forward_eval = model_helper.forward_eval
     self.calc_loss = model_helper.calc_loss
+    self.setup_lrn_rate = model_helper.setup_lrn_rate
+    self.warm_start = model_helper.warm_start
+    self.dump_n_eval = model_helper.dump_n_eval
     self.model_name = model_helper.model_name
     self.dataset_name = model_helper.dataset_name
+    self.forward_w_labels = model_helper.forward_w_labels
 
     # checkpoint path determined by model's & dataset's names
     self.ckpt_file = 'models_%s_at_%s.tar.gz' % (self.model_name, self.dataset_name)
@@ -121,10 +127,7 @@ def download_model(self):
   def auto_barrier(self):
     """Automatically insert a barrier for multi-GPU training, or pass for single-GPU training."""
 
-    if FLAGS.enbl_multi_gpu:
-      self.mpi_comm.Barrier()
-    else:
-      pass
+    auto_barrier_impl(self.mpi_comm)
 
   @classmethod
   def is_primary_worker(cls, scope='global'):
@@ -137,12 +140,7 @@ def is_primary_worker(cls, scope='global'):
     * flag: whether is the primary worker
     """
 
-    if scope == 'global':
-      return True if not FLAGS.enbl_multi_gpu else mgw.rank() == 0
-    elif scope == 'local':
-      return True if not FLAGS.enbl_multi_gpu else mgw.local_rank() == 0
-    else:
-      raise ValueError('unrecognized worker scope: ' + scope)
+    return is_primary_worker_impl(scope)
 
   @property
   def vars(self):
diff --git a/learners/channel_pruning/channel_pruner.py b/learners/channel_pruning/channel_pruner.py
index 522e7eb..b29dc10 100644
--- a/learners/channel_pruning/channel_pruner.py
+++ b/learners/channel_pruning/channel_pruner.py
@@ -45,7 +45,7 @@
                            achieve low flops with guaranted accuracy.''')
 tf.app.flags.DEFINE_integer('cp_nb_points_per_layer', 10,
                             'Sample how many point for each layer')
-tf.app.flags.DEFINE_integer('cp_nb_batches', 60,
+tf.app.flags.DEFINE_integer('cp_nb_batches', 30,
                             'Input how many bathes data into a model')
 
 
diff --git a/learners/channel_pruning/learner.py b/learners/channel_pruning/learner.py
index f44f88e..689e9ac 100644
--- a/learners/channel_pruning/learner.py
+++ b/learners/channel_pruning/learner.py
@@ -27,7 +27,6 @@
 from tensorflow.contrib import graph_editor
 
 from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
-from utils.lrn_rate_utils import setup_lrn_rate
 from learners.distillation_helper import DistillationHelper
 from learners.abstract_learner import AbstractLearner
 from learners.channel_pruning.model_wrapper import Model
@@ -49,6 +48,10 @@
   'cp_prune_list_file',
   'ratio.list',
   'the prune list file which contains the compression ratio of each convolution layers')
+tf.app.flags.DEFINE_string(
+  'cp_channel_pruned_path',
+  './models/pruned_model.ckpt',
+  'channel pruned model\'s save path')
 tf.app.flags.DEFINE_string(
   'cp_best_path',
   './models/best_model.ckpt',
@@ -116,30 +119,6 @@ def __init__(self, sm_writer, model_helper):
     self.__build(is_train=True)
     self.__build(is_train=False)
 
-    channel_pruned_path = './models/pruned_model.ckpt'
-    best_model_path = './models/best_model.ckpt'
-    if FLAGS.enbl_multi_gpu:
-      self.parent_path = ''
-      if self.mpi_comm.rank == 0:
-        self.parent_path = '/opt/ml/disk/' + \
-          ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(8))
-        pathlib.Path(self.parent_path).mkdir(parents=True, exist_ok=True)
-        channel_pruned_path = self.parent_path + '/' + channel_pruned_path
-        best_model_path = self.parent_path + '/' + best_model_path
-
-      channel_pruned_path = self.mpi_comm.bcast(channel_pruned_path, root=0)
-      best_model_path = self.mpi_comm.bcast(best_model_path, root=0)
-      self.parent_path = self.mpi_comm.bcast(self.parent_path, root=0)
-
-    tf.app.flags.DEFINE_string(
-      'cp_channel_pruned_path',
-      channel_pruned_path,
-      'channel pruned model\'s save path')
-    tf.app.flags.DEFINE_string(
-      'cp_best_model_path',
-      best_model_path,
-      'channel best model\'s save path')
-
   def train(self):
     """Train the pruned model"""
     # download pre-trained model
@@ -312,7 +291,9 @@ def __build_pruned_evaluate_model(self, path=None):
       self.saver_eval = tf.train.import_meta_graph(path + '.meta')
       self.saver_eval.restore(self.sess_eval, path)
       eval_logits = tf.get_collection('logits')[0]
+      tf.add_to_collection('logits_final', eval_logits)
       eval_images = tf.get_collection('eval_images')[0]
+      tf.add_to_collection('images_final', eval_images)
       eval_labels = tf.get_collection('eval_labels')[0]
       mem_images = tf.get_collection('mem_images')[0]
       mem_labels = tf.get_collection('mem_labels')[0]
@@ -370,8 +351,7 @@ def __build_pruned_train_model(self, path=None, finetune=False): # pylint: disab
 
       global_step = tf.get_variable('global_step', shape=[], dtype=tf.int32, trainable=False)
       self.global_step = global_step
-      lrn_rate, self.nb_iters_train = setup_lrn_rate(
-        self.global_step, self.model_name, self.dataset_name)
+      lrn_rate, self.nb_iters_train = self.setup_lrn_rate(self.global_step)
 
       if finetune and not FLAGS.cp_retrain:
         mom_optimizer = tf.train.AdamOptimizer(FLAGS.cp_lrn_rate_ft)
@@ -497,7 +477,7 @@ def __save_best_pruned_model(self):
 
   def __save_in_progress_pruned_model(self):
     """ save a in progress training model with a max evaluation result"""
-    self.max_save_path = self.saver_eval.save(self.sess_eval, FLAGS.cp_best_model_path)
+    self.max_save_path = self.saver_eval.save(self.sess_eval, FLAGS.cp_best_path)
     tf.logging.info('model saved best model to ' + self.max_save_path)
 
   def __save_model(self):
@@ -711,8 +691,6 @@ def __prune_rl(self): # pylint: disable=too-many-locals
                 strategy: {},
                 accuracy: {} and
                 pruned ratio: {}""".format(self.bestinfo[0], self.bestinfo[1], self.bestinfo[2]))
-        with self.pruner.model.g.as_default():
-          self.__save_best_pruned_model()
 
       tf.logging.info('automatic channl pruning time cost: {}s'.format(timer() - start))
 
diff --git a/learners/channel_pruning/model_wrapper.py b/learners/channel_pruning/model_wrapper.py
index 6e1b766..1bd51f9 100644
--- a/learners/channel_pruning/model_wrapper.py
+++ b/learners/channel_pruning/model_wrapper.py
@@ -267,13 +267,11 @@ def compute_layer_flops(self, op):
       if opname in self.flops:
         flops = self.flops[opname]
       else:
-        flops = tf_ops.get_stats_for_node_def(self.g,
-                                              op.node_def,
-                                              'flops').value
-        flops = flops / 2. / FLAGS.batch_size
-
+        flops = tf_ops.get_stats_for_node_def(self.g, op.node_def, 'flops').value
+        flops = flops / FLAGS.batch_size
         self.flops[opname] = flops
-      return flops
+
+    return flops
 
   def get_Add_if_is_first_after_resblock(self, op):
     """ check whether the input operation is first layer after sum
diff --git a/learners/channel_pruning_gpu/__init__.py b/learners/channel_pruning_gpu/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/learners/channel_pruning_gpu/learner.py b/learners/channel_pruning_gpu/learner.py
new file mode 100644
index 0000000..c68013f
--- /dev/null
+++ b/learners/channel_pruning_gpu/learner.py
@@ -0,0 +1,568 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Channel pruning learner with GPU-based optimization."""
+
+import os
+import re
+import math
+from timeit import default_timer as timer
+import numpy as np
+import tensorflow as tf
+
+from learners.abstract_learner import AbstractLearner
+from learners.distillation_helper import DistillationHelper
+from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_string('cpg_save_path', './models_cpg/model.ckpt', 'CPG: model\'s save path')
+tf.app.flags.DEFINE_string('cpg_save_path_eval', './models_cpg_eval/model.ckpt',
+                           'CPG: model\'s save path for evaluation')
+tf.app.flags.DEFINE_string('cpg_prune_ratio_type', 'uniform',
+                           'CPG: pruning ratio type (\'uniform\' OR \'list\')')
+tf.app.flags.DEFINE_float('cpg_prune_ratio', 0.5, 'CPG: uniform pruning ratio')
+tf.app.flags.DEFINE_boolean('cpg_skip_ht_layers', True, 'CPG: skip head & tail layers for pruning')
+tf.app.flags.DEFINE_string('cpg_prune_ratio_file', None,
+                           'CPG: file path to the list of pruning ratios')
+tf.app.flags.DEFINE_float('cpg_lrn_rate_pgd_init', 1e-10,
+                          'CPG: proximal gradient descent\'s initial learning rate')
+tf.app.flags.DEFINE_float('cpg_lrn_rate_pgd_incr', 1.4,
+                          'CPG: proximal gradient descent\'s learning rate\'s increase ratio')
+tf.app.flags.DEFINE_float('cpg_lrn_rate_pgd_decr', 0.7,
+                          'CPG: proximal gradient descent\'s learning rate\'s decrease ratio')
+tf.app.flags.DEFINE_float('cpg_lrn_rate_adam', 1e-2, 'CPG: Adam\'s initial learning rate')
+tf.app.flags.DEFINE_integer('cpg_nb_iters_layer', 1000, 'CPG: # of iterations for layer-wise FT')
+
+def get_vars_by_scope(scope):
+  """Get list of variables within certain name scope.
+
+  Args:
+  * scope: name scope
+
+  Returns:
+  * vars_dict: dictionary of list of all, trainable, and maskable variables
+  """
+
+  vars_dict = {}
+  vars_dict['all'] = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=scope)
+  vars_dict['trainable'] = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope)
+  vars_dict['maskable'] = []
+  conv2d_pattern = re.compile(r'/Conv2D$')
+  conv2d_ops = get_ops_by_scope_n_pattern(scope, conv2d_pattern)
+  for var in vars_dict['trainable']:
+    for op in conv2d_ops:
+      for op_input in op.inputs:
+        if op_input.name == var.name.replace(':0', '/read:0'):
+          vars_dict['maskable'] += [var]
+
+  return vars_dict
+
+def get_ops_by_scope_n_pattern(scope, pattern):
+  """Get list of operations within certain name scope and also matches the pattern.
+
+  Args:
+  * scope: name scope
+  * pattern: name pattern to be matched
+
+  Returns:
+  * ops: list of operations
+  """
+
+  ops = []
+  for op in tf.get_default_graph().get_operations():
+    if op.name.startswith(scope) and re.search(pattern, op.name) is not None:
+      ops += [op]
+
+  return ops
+
+def calc_prune_ratio(vars_list):
+  """Calculate the overall pruning ratio for the given list of variables.
+
+  Args:
+  * vars_list: list of variables
+
+  Returns:
+  * prune_ratio: overall pruning ratio of the given list of variables
+  """
+
+  nb_params_nnz = tf.add_n([tf.count_nonzero(var) for var in vars_list])
+  nb_params_all = tf.add_n([tf.size(var) for var in vars_list])
+  prune_ratio = 1.0 - tf.cast(nb_params_nnz, tf.float32) / tf.cast(nb_params_all, tf.float32)
+
+  return prune_ratio
+
+class ChannelPrunedGpuLearner(AbstractLearner):  # pylint: disable=too-many-instance-attributes
+  """Channel pruning learner with GPU-based optimization."""
+
+  def __init__(self, sm_writer, model_helper):
+    """Constructor function.
+
+    Args:
+    * sm_writer: TensorFlow's summary writer
+    * model_helper: model helper with definitions of model & dataset
+    """
+
+    # class-independent initialization
+    super(ChannelPrunedGpuLearner, self).__init__(sm_writer, model_helper)
+
+    # define scopes for full & channel-pruned models
+    self.model_scope_full = 'model'
+    self.model_scope_prnd = 'pruned_model'
+
+    # download the pre-trained model
+    if self.is_primary_worker('local'):
+      self.download_model()  # pre-trained model is required
+    self.auto_barrier()
+    tf.logging.info('model files: ' + ', '.join(os.listdir('./models')))
+
+    # class-dependent initialization
+    if FLAGS.enbl_dst:
+      self.helper_dst = DistillationHelper(sm_writer, model_helper, self.mpi_comm)
+    self.__build_train()
+    self.__build_eval()
+
+  def train(self):
+    """Train a model and periodically produce checkpoint files."""
+
+    # restore the full model from pre-trained checkpoints
+    save_path = tf.train.latest_checkpoint(os.path.dirname(self.save_path_full))
+    self.saver_full.restore(self.sess_train, save_path)
+
+    # initialization
+    self.sess_train.run([self.init_op, self.init_opt_op])
+    self.sess_train.run([layer_op['init_opt'] for layer_op in self.layer_ops])
+    if FLAGS.enbl_multi_gpu:
+      self.sess_train.run(self.bcast_op)
+
+    # choose channels and evaluate the model before re-training
+    self.__choose_channels()
+    if self.is_primary_worker('global'):
+      self.__save_model(is_train=True)
+      self.evaluate()
+    self.auto_barrier()
+
+    # fine-tune the model with chosen channels only
+    time_prev = timer()
+    for idx_iter in range(self.nb_iters_train):
+      # train the model
+      if (idx_iter + 1) % FLAGS.summ_step != 0:
+        self.sess_train.run(self.train_op)
+      else:
+        __, summary, log_rslt = self.sess_train.run([self.train_op, self.summary_op, self.log_op])
+        if self.is_primary_worker('global'):
+          time_step = timer() - time_prev
+          self.__monitor_progress(summary, log_rslt, idx_iter, time_step)
+          time_prev = timer()
+
+      # save the model at certain steps
+      if self.is_primary_worker('global') and (idx_iter + 1) % FLAGS.save_step == 0:
+        self.__save_model(is_train=True)
+        self.evaluate()
+      self.auto_barrier()
+
+    # save the final model
+    if self.is_primary_worker('global'):
+      self.__save_model(is_train=True)
+      self.__restore_model(is_train=False)
+      self.__save_model(is_train=False)
+      self.evaluate()
+
+  def evaluate(self):
+    """Restore a model from the latest checkpoint files and then evaluate it."""
+
+    self.__restore_model(is_train=False)
+    nb_iters = int(np.ceil(float(FLAGS.nb_smpls_eval) / FLAGS.batch_size_eval))
+    eval_rslts = np.zeros((nb_iters, len(self.eval_op)))
+    self.dump_n_eval(outputs=None, action='init')
+    for idx_iter in range(nb_iters):
+      if (idx_iter + 1) % 100 == 0:
+        tf.logging.info('process the %d-th mini-batch for evaluation' % (idx_iter + 1))
+      eval_rslts[idx_iter], outputs = self.sess_eval.run([self.eval_op, self.outputs_eval])
+      self.dump_n_eval(outputs=outputs, action='dump')
+    self.dump_n_eval(outputs=None, action='eval')
+    for idx, name in enumerate(self.eval_op_names):
+      tf.logging.info('%s = %.4e' % (name, np.mean(eval_rslts[:, idx])))
+
+  def __build_train(self):  # pylint: disable=too-many-locals,too-many-statements
+    """Build the training graph."""
+
+    with tf.Graph().as_default():
+      # create a TF session for the current graph
+      config = tf.ConfigProto()
+      config.gpu_options.visible_device_list = str(mgw.local_rank() if FLAGS.enbl_multi_gpu else 0)  # pylint: disable=no-member
+      sess = tf.Session(config=config)
+
+      # data input pipeline
+      with tf.variable_scope(self.data_scope):
+        iterator = self.build_dataset_train()
+        images, labels = iterator.get_next()
+
+      # model definition - distilled model
+      if FLAGS.enbl_dst:
+        logits_dst = self.helper_dst.calc_logits(sess, images)
+
+      # model definition - full model
+      with tf.variable_scope(self.model_scope_full):
+        __ = self.forward_train(images)
+        self.vars_full = get_vars_by_scope(self.model_scope_full)
+        self.saver_full = tf.train.Saver(self.vars_full['all'])
+        self.save_path_full = FLAGS.save_path
+
+      # model definition - channel-pruned model
+      with tf.variable_scope(self.model_scope_prnd):
+        logits_prnd = self.forward_train(images)
+        self.vars_prnd = get_vars_by_scope(self.model_scope_prnd)
+        self.maskable_var_names = [var.name for var in self.vars_prnd['maskable']]
+        self.global_step = tf.train.get_or_create_global_step()
+        self.saver_prnd_train = tf.train.Saver(self.vars_prnd['all'] + [self.global_step])
+
+        # loss & extra evaluation metrics
+        loss, metrics = self.calc_loss(labels, logits_prnd, self.vars_prnd['trainable'])
+        if FLAGS.enbl_dst:
+          loss += self.helper_dst.calc_loss(logits_prnd, logits_dst)
+        tf.summary.scalar('loss', loss)
+        for key, value in metrics.items():
+          tf.summary.scalar(key, value)
+
+        # learning rate schedule
+        lrn_rate, self.nb_iters_train = self.setup_lrn_rate(self.global_step)
+
+        # overall pruning ratios of trainable & maskable variables
+        pr_trainable = calc_prune_ratio(self.vars_prnd['trainable'])
+        pr_maskable = calc_prune_ratio(self.vars_prnd['maskable'])
+        tf.summary.scalar('pr_trainable', pr_trainable)
+        tf.summary.scalar('pr_maskable', pr_maskable)
+
+        # create masks and corresponding operations for channel pruning
+        self.masks = []
+        self.mask_updt_ops = []
+        for idx, var in enumerate(self.vars_prnd['maskable']):
+          tf.logging.info('creating a pruning mask for {} of size {}'.format(var.name, var.shape))
+          name = '/'.join(var.name.split('/')[1:]).replace(':0', '_mask')
+          self.masks += [tf.get_variable(name, initializer=tf.ones(var.shape), trainable=False)]
+          var_norm = tf.sqrt(tf.reduce_sum(tf.square(var), axis=[0, 1, 3], keepdims=True))
+          mask_vec = tf.cast(var_norm > 0.0, tf.float32)
+          mask_new = tf.tile(mask_vec, [var.shape[0], var.shape[1], 1, var.shape[3]])
+          self.mask_updt_ops += [self.masks[-1].assign(mask_new)]
+
+        # build extra losses for regression & discrimination
+        self.reg_losses = self.__build_extra_losses()
+        self.nb_layers = len(self.reg_losses)
+        for idx, reg_loss in enumerate(self.reg_losses):
+          tf.summary.scalar('reg_loss_%d' % idx, reg_loss)
+
+        # obtain the full list of trainable variables & update operations
+        self.vars_all = tf.get_collection(
+          tf.GraphKeys.GLOBAL_VARIABLES, scope=self.model_scope_prnd)
+        self.trainable_vars_all = tf.get_collection(
+          tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.model_scope_prnd)
+        self.update_ops_all = tf.get_collection(
+          tf.GraphKeys.UPDATE_OPS, scope=self.model_scope_prnd)
+
+        # TF operations for initializing the channel-pruned model
+        init_ops = []
+        with tf.control_dependencies([tf.variables_initializer(self.vars_all)]):
+          for var_full, var_prnd in zip(self.vars_full['all'], self.vars_prnd['all']):
+            init_ops += [var_prnd.assign(var_full)]
+        init_ops += [self.global_step.initializer]
+        self.init_op = tf.group(init_ops)
+
+        # TF operations for layer-wise, block-wise, and whole-network fine-tuning
+        self.layer_ops, self.lrn_rates_pgd, self.prune_perctls = self.__build_layer_ops()
+        self.train_op, self.init_opt_op = self.__build_network_ops(loss, lrn_rate)
+
+      # TF operations for logging & summarizing
+      self.sess_train = sess
+      self.summary_op = tf.summary.merge_all()
+      self.log_op = [lrn_rate, loss, pr_trainable, pr_maskable] + list(metrics.values())
+      self.log_op_names = ['lr', 'loss', 'pr_trn', 'pr_msk'] + list(metrics.keys())
+      if FLAGS.enbl_multi_gpu:
+        self.bcast_op = mgw.broadcast_global_variables(0)
+
+  def __build_eval(self):
+    """Build the evaluation graph."""
+
+    with tf.Graph().as_default():
+      # create a TF session for the current graph
+      config = tf.ConfigProto()
+      config.gpu_options.visible_device_list = str(mgw.local_rank() if FLAGS.enbl_multi_gpu else 0)  # pylint: disable=no-member
+      self.sess_eval = tf.Session(config=config)
+
+      # data input pipeline
+      with tf.variable_scope(self.data_scope):
+        iterator = self.build_dataset_eval()
+        images, labels = iterator.get_next()
+
+      # model definition - distilled model
+      if FLAGS.enbl_dst:
+        logits_dst = self.helper_dst.calc_logits(self.sess_eval, images)
+
+      # model definition - channel-pruned model
+      with tf.variable_scope(self.model_scope_prnd):
+        logits = self.forward_eval(images)
+        vars_prnd = get_vars_by_scope(self.model_scope_prnd)
+        global_step = tf.train.get_or_create_global_step()
+        self.saver_prnd_eval = tf.train.Saver(vars_prnd['all'] + [global_step])
+
+        # loss & extra evaluation metrics
+        loss, metrics = self.calc_loss(labels, logits, vars_prnd['trainable'])
+        if FLAGS.enbl_dst:
+          loss += self.helper_dst.calc_loss(logits, logits_dst)
+
+        # overall pruning ratios of trainable & maskable variables
+        pr_trainable = calc_prune_ratio(vars_prnd['trainable'])
+        pr_maskable = calc_prune_ratio(vars_prnd['maskable'])
+
+        # TF operations for evaluation
+        self.eval_op = [loss, pr_trainable, pr_maskable] + list(metrics.values())
+        self.eval_op_names = ['loss', 'pr_trn', 'pr_msk'] + list(metrics.keys())
+        self.outputs_eval = logits
+
+      # add input & output tensors to certain collections
+      tf.add_to_collection('images_final', images)
+      tf.add_to_collection('logits_final', logits)
+
+  def __build_extra_losses(self):
+    """Build extra losses for regression.
+
+    Returns:
+    * reg_losses: list of regression losses (one per layer)
+    """
+
+    # insert additional losses to intermediate layers
+    pattern = re.compile('Conv2D$')
+    core_ops_full = get_ops_by_scope_n_pattern(self.model_scope_full, pattern)
+    core_ops_prnd = get_ops_by_scope_n_pattern(self.model_scope_prnd, pattern)
+    reg_losses = []
+    for core_op_full, core_op_prnd in zip(core_ops_full, core_ops_prnd):
+      reg_losses += [tf.nn.l2_loss(core_op_full.outputs[0] - core_op_prnd.outputs[0])]
+
+    return reg_losses
+
+  def __build_layer_ops(self):
+    """Build layer-wise fine-tuning operations.
+
+    Returns:
+    * layer_ops: list of training and initialization operations for each layer
+    * lrn_rates_pgd: list of layer-wise learning rate
+    * prune_perctls: list of layer-wise pruning percentiles
+    """
+
+    layer_ops = []
+    lrn_rates_pgd = []  # list of layer-wise learning rate
+    prune_perctls = []  # list of layer-wise pruning percentiles
+    for idx, var_prnd in enumerate(self.vars_prnd['maskable']):
+      # create placeholders
+      lrn_rate_pgd = tf.placeholder(tf.float32, shape=[], name='lrn_rate_pgd_%d' % idx)
+      prune_perctl = tf.placeholder(tf.float32, shape=[], name='prune_perctl_%d' % idx)
+
+      # select channels for the current convolutional layer
+      optimizer = tf.train.GradientDescentOptimizer(lrn_rate_pgd)
+      if FLAGS.enbl_multi_gpu:
+        optimizer = mgw.DistributedOptimizer(optimizer)
+      grads = optimizer.compute_gradients(self.reg_losses[idx], [var_prnd])
+      with tf.control_dependencies(self.update_ops_all):
+        var_prnd_new = var_prnd - lrn_rate_pgd * grads[0][0]
+        var_norm = tf.sqrt(tf.reduce_sum(tf.square(var_prnd_new), axis=[0, 1, 3], keepdims=True))
+        threshold = tf.contrib.distributions.percentile(var_norm, prune_perctl)
+        shrk_vec = tf.maximum(1.0 - threshold / var_norm, 0.0)
+        prune_op = var_prnd.assign(var_prnd_new * shrk_vec)
+
+      # fine-tune with selected channels only
+      optimizer_base = tf.train.AdamOptimizer(FLAGS.cpg_lrn_rate_adam)
+      if not FLAGS.enbl_multi_gpu:
+        optimizer = optimizer_base
+      else:
+        optimizer = mgw.DistributedOptimizer(optimizer_base)
+      grads_origin = optimizer.compute_gradients(self.reg_losses[idx], [var_prnd])
+      grads_pruned = self.__calc_grads_pruned(grads_origin)
+      with tf.control_dependencies(self.update_ops_all):
+        finetune_op = optimizer.apply_gradients(grads_pruned)
+      init_opt_op = tf.variables_initializer(optimizer_base.variables())
+
+      # append layer-wise operations & variables
+      layer_ops += [{'prune': prune_op, 'finetune': finetune_op, 'init_opt': init_opt_op}]
+      lrn_rates_pgd += [lrn_rate_pgd]
+      prune_perctls += [prune_perctl]
+
+    return layer_ops, lrn_rates_pgd, prune_perctls
+
+  def __build_network_ops(self, loss, lrn_rate):
+    """Build network training operations.
+
+    Returns:
+    * train_op: training operation of the whole network
+    * init_opt_op: initialization operation of the whole network's optimizer
+    """
+
+    optimizer_base = tf.train.MomentumOptimizer(lrn_rate, FLAGS.momentum)
+    if not FLAGS.enbl_multi_gpu:
+      optimizer = optimizer_base
+    else:
+      optimizer = mgw.DistributedOptimizer(optimizer_base)
+    grads_origin = optimizer.compute_gradients(loss, self.trainable_vars_all)
+    grads_pruned = self.__calc_grads_pruned(grads_origin)
+    with tf.control_dependencies(self.update_ops_all):
+      train_op = optimizer.apply_gradients(grads_pruned, global_step=self.global_step)
+    init_opt_op = tf.variables_initializer(optimizer_base.variables())
+
+    return train_op, init_opt_op
+
+  def __calc_grads_pruned(self, grads_origin):
+    """Calculate the mask-pruned gradients.
+
+    Args:
+    * grads_origin: list of original gradients
+
+    Returns:
+    * grads_pruned: list of mask-pruned gradients
+    """
+
+    grads_pruned = []
+    for grad in grads_origin:
+      if grad[1].name not in self.maskable_var_names:
+        grads_pruned += [grad]
+      else:
+        idx_mask = self.maskable_var_names.index(grad[1].name)
+        grads_pruned += [(grad[0] * self.masks[idx_mask], grad[1])]
+
+    return grads_pruned
+
+  def __choose_channels(self):  # pylint: disable=too-many-locals
+    """Choose channels for all convolutional layers."""
+
+    # obtain each layer's pruning ratio
+    if FLAGS.cpg_prune_ratio_type == 'uniform':
+      ratio_list = [FLAGS.cpg_prune_ratio] * self.nb_layers
+      if FLAGS.cpg_skip_ht_layers:
+        ratio_list[0] = 0.0
+        ratio_list[-1] = 0.0
+    elif FLAGS.cpg_prune_ratio_type == 'list':
+      with open(FLAGS.cpg_prune_ratio_file, 'r') as i_file:
+        i_line = i_file.readline().strip()
+        ratio_list = [float(sub_str) for sub_str in i_line.split(',')]
+    else:
+      raise ValueError('unrecognized pruning ratio type: ' + FLAGS.cpg_prune_ratio_type)
+
+    # select channels for all convolutional layers
+    nb_workers = mgw.size() if FLAGS.enbl_multi_gpu else 1
+    nb_iters_layer = int(FLAGS.cpg_nb_iters_layer / nb_workers)
+    for idx_layer in range(self.nb_layers):
+      # skip if no pruning is required
+      if ratio_list[idx_layer] == 0.0:
+        continue
+      if self.is_primary_worker('global'):
+        tf.logging.info('layer #%d: pr = %.2f (target)' % (idx_layer, ratio_list[idx_layer]))
+        tf.logging.info('mask.shape = {}'.format(self.masks[idx_layer].shape))
+
+      # select channels for the current convolutional layer
+      time_prev = timer()
+      reg_loss_prev = 0.0
+      lrn_rate_pgd = FLAGS.cpg_lrn_rate_pgd_init
+      for idx_iter in range(nb_iters_layer):
+        # take a stochastic proximal gradient descent step
+        prune_perctl = ratio_list[idx_layer] * 100.0 * (idx_iter + 1) / nb_iters_layer
+        __, reg_loss = self.sess_train.run(
+          [self.layer_ops[idx_layer]['prune'], self.reg_losses[idx_layer]],
+          feed_dict={self.lrn_rates_pgd[idx_layer]: lrn_rate_pgd,
+                     self.prune_perctls[idx_layer]: prune_perctl})
+        mask = self.sess_train.run(self.masks[idx_layer])
+        if self.is_primary_worker('global'):
+          nb_chns_nnz = np.count_nonzero(np.sum(mask, axis=(0, 1, 3)))
+          tf.logging.info('iter %d: nnz-chns = %d | loss = %.2e | lr = %.2e | percentile = %.2f'
+                          % (idx_iter + 1, nb_chns_nnz, reg_loss, lrn_rate_pgd, prune_perctl))
+
+        # adjust the learning rate
+        if reg_loss < reg_loss_prev:
+          lrn_rate_pgd *= FLAGS.cpg_lrn_rate_pgd_incr
+        else:
+          lrn_rate_pgd *= FLAGS.cpg_lrn_rate_pgd_decr
+        reg_loss_prev = reg_loss
+
+      # fine-tune with selected channels only
+      self.sess_train.run(self.mask_updt_ops[idx_layer])
+      for idx_iter in range(nb_iters_layer):
+        __, reg_loss = self.sess_train.run(
+          [self.layer_ops[idx_layer]['finetune'], self.reg_losses[idx_layer]])
+        mask = self.sess_train.run(self.masks[idx_layer])
+        if self.is_primary_worker('global'):
+          nb_chns_nnz = np.count_nonzero(np.sum(mask, axis=(0, 1, 3)))
+          tf.logging.info('iter %d: nnz-chns = %d | loss = %.2e'
+                          % (idx_iter + 1, nb_chns_nnz, reg_loss))
+
+      # re-compute the pruning ratio
+      mask_vec = np.mean(np.square(self.sess_train.run(self.masks[idx_layer])), axis=(0, 1, 3))
+      prune_ratio = 1.0 - float(np.count_nonzero(mask_vec)) / mask_vec.size
+      if self.is_primary_worker('global'):
+        tf.logging.info('layer #%d: pr = %.2f (actual) | time = %.2f'
+                        % (idx_layer, prune_ratio, timer() - time_prev))
+
+    # compute overall pruning ratios
+    if self.is_primary_worker('global'):
+      log_rslt = self.sess_train.run(self.log_op)
+      log_str = ' | '.join(['%s = %.4e' % (name, value)
+                            for name, value in zip(self.log_op_names, log_rslt)])
+
+  def __save_model(self, is_train):
+    """Save the current model for training or evaluation.
+
+    Args:
+    * is_train: whether to save a model for training
+    """
+
+    if is_train:
+      save_path = self.saver_prnd_train.save(self.sess_train, FLAGS.cpg_save_path, self.global_step)
+    else:
+      save_path = self.saver_prnd_eval.save(self.sess_eval, FLAGS.cpg_save_path_eval)
+    tf.logging.info('model saved to ' + save_path)
+
+  def __restore_model(self, is_train):
+    """Restore a model from the latest checkpoint files.
+
+    Args:
+    * is_train: whether to restore a model for training
+    """
+
+    save_path = tf.train.latest_checkpoint(os.path.dirname(FLAGS.cpg_save_path))
+    if is_train:
+      self.saver_prnd_train.restore(self.sess_train, save_path)
+    else:
+      self.saver_prnd_eval.restore(self.sess_eval, save_path)
+    tf.logging.info('model restored from ' + save_path)
+
+  def __monitor_progress(self, summary, log_rslt, idx_iter, time_step):
+    """Monitor the training progress.
+
+    Args:
+    * summary: summary protocol buffer
+    * log_rslt: logging operations' results
+    * idx_iter: index of the training iteration
+    * time_step: time step between two summary operations
+    """
+
+    # write summaries for TensorBoard visualization
+    self.sm_writer.add_summary(summary, idx_iter)
+
+    # compute the training speed
+    speed = FLAGS.batch_size * FLAGS.summ_step / time_step
+    if FLAGS.enbl_multi_gpu:
+      speed *= mgw.size()
+
+    # display monitored statistics
+    log_str = ' | '.join(['%s = %.4e' % (name, value)
+                          for name, value in zip(self.log_op_names, log_rslt)])
+    tf.logging.info('iter #%d: %s | speed = %.2f pics / sec' % (idx_iter + 1, log_str, speed))
diff --git a/learners/channel_pruning_rmt/__init__.py b/learners/channel_pruning_rmt/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/learners/channel_pruning_rmt/learner.py b/learners/channel_pruning_rmt/learner.py
new file mode 100644
index 0000000..85ac74c
--- /dev/null
+++ b/learners/channel_pruning_rmt/learner.py
@@ -0,0 +1,892 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Channel pruning learner - remastered."""
+
+import os
+import re
+import math
+from timeit import default_timer as timer
+import numpy as np
+from scipy.linalg import norm
+import tensorflow as tf
+
+from learners.abstract_learner import AbstractLearner
+from learners.distillation_helper import DistillationHelper
+from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_string('cpr_save_path', './models_cpr/model.ckpt', 'CPR: model\'s save path')
+tf.app.flags.DEFINE_string('cpr_save_path_eval', './models_cpr_eval/model.ckpt',
+                           'CPR: model\'s save path for evaluation')
+tf.app.flags.DEFINE_string('cpr_save_path_ws', './models_cpr_ws/model.ckpt',
+                           'CPR: model\'s save path for warm start')
+tf.app.flags.DEFINE_float('cpr_prune_ratio', 0.5, 'CPR: pruning ratio')
+tf.app.flags.DEFINE_boolean('cpr_skip_frst_layer', True, 'CPR: skip the first layer for pruning')
+tf.app.flags.DEFINE_boolean('cpr_skip_last_layer', False, 'CPR: skip the last layer for pruning')
+tf.app.flags.DEFINE_string('cpr_skip_op_names', None,
+                           'CPR: comma-separated Conv2D operations names to be skipped')
+tf.app.flags.DEFINE_integer('cpr_nb_smpls', 5000,
+                            'CPR: # of cached training samples for channel pruning')
+tf.app.flags.DEFINE_integer('cpr_nb_crops_per_smpl', 10, 'CPR: # of random crops per sample')
+tf.app.flags.DEFINE_float('cpr_ista_lrn_rate', 1e-2, 'CPR: ISTA\'s learning rate')
+tf.app.flags.DEFINE_integer('cpr_ista_nb_iters', 100, 'CPR: # of iterations in ISTA')
+tf.app.flags.DEFINE_float('cpr_lstsq_lrn_rate', 1e-3, 'CPR: least-sqaure regression\'s learning rate')
+tf.app.flags.DEFINE_integer('cpr_lstsq_nb_iters', 100, 'CPR: # of iterations in least-square regression')
+tf.app.flags.DEFINE_boolean('cpr_warm_start', False,
+                            'CPR: use a channel-pruned model for warm start '
+                            '(the channel selection process will be skipped)')
+
+def get_vars_by_scope(scope):
+  """Get list of variables within certain name scope.
+
+  Args:
+  * scope: name scope
+
+  Returns:
+  * vars_dict: dictionary of list of all, trainable, and convolutional kernel variables
+  """
+
+  vars_dict = {}
+  vars_dict['all'] = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=scope)
+  vars_dict['trainable'] = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=scope)
+  vars_dict['conv_krnl'] = []
+  conv2d_pattern = re.compile(r'/Conv2D$')
+  conv2d_ops = get_ops_by_scope_n_pattern(scope, conv2d_pattern)
+  for var in vars_dict['trainable']:
+    for op in conv2d_ops:
+      for op_input in op.inputs:
+        if op_input.name == var.name.replace(':0', '/read:0'):
+          vars_dict['conv_krnl'] += [var]
+          break
+
+  return vars_dict
+
+def get_ops_by_scope_n_pattern(scope, pattern):
+  """Get list of operations within certain name scope and also matches the pattern.
+
+  Args:
+  * scope: name scope
+  * pattern: name pattern to be matched
+
+  Returns:
+  * ops: list of operations
+  """
+
+  ops = []
+  for op in tf.get_default_graph().get_operations():
+    if op.name.startswith(scope) and re.search(pattern, op.name) is not None:
+      ops += [op]
+
+  return ops
+
+def calc_prune_ratio(vars_list):
+  """Calculate the overall pruning ratio for the given list of variables.
+
+  Args:
+  * vars_list: list of variables
+
+  Returns:
+  * prune_ratio: overall pruning ratio of the given list of variables
+  """
+
+  nb_params_nnz = tf.add_n([tf.count_nonzero(var) for var in vars_list])
+  nb_params_all = tf.add_n([tf.size(var) for var in vars_list])
+  prune_ratio = 1.0 - tf.cast(nb_params_nnz, tf.float32) / tf.cast(nb_params_all, tf.float32)
+
+  return prune_ratio
+
+class ChannelPrunedRmtLearner(AbstractLearner):  # pylint: disable=too-many-instance-attributes
+  """Channel pruning learner - remastered."""
+
+  def __init__(self, sm_writer, model_helper):
+    """Constructor function.
+
+    Args:
+    * sm_writer: TensorFlow's summary writer
+    * model_helper: model helper with definitions of model & dataset
+    """
+
+    # class-independent initialization
+    super(ChannelPrunedRmtLearner, self).__init__(sm_writer, model_helper)
+
+    # define scopes for full & channel-pruned models
+    self.model_scope_full = 'model'
+    self.model_scope_prnd = 'pruned_model'
+
+    # download the pre-trained model
+    if self.is_primary_worker('local'):
+      self.download_model()  # pre-trained model is required
+    self.auto_barrier()
+    tf.logging.info('model files: ' + ', '.join(os.listdir('./models')))
+
+    # class-dependent initialization
+    if FLAGS.enbl_dst:
+      self.helper_dst = DistillationHelper(sm_writer, model_helper, self.mpi_comm)
+    self.__build_train()
+    self.__build_eval()
+
+    # build the channel pruning graph
+    self.__build_prune()
+
+  def train(self):
+    """Train a model and periodically produce checkpoint files."""
+
+    # choose channels or directly load a pre-pruned model as warm-start
+    if not FLAGS.cpr_warm_start:
+      time_prev = timer()
+      self.__choose_channels()
+      tf.logging.info('time (channel selection): %.2f (s)' % (timer() - time_prev))
+    save_path = tf.train.latest_checkpoint(os.path.dirname(FLAGS.cpr_save_path_ws))
+    self.saver_prnd_train.restore(self.sess_train, save_path)
+    tf.logging.info('model restored from ' + save_path)
+
+    # initialize all the remaining variables and then broadcast
+    self.sess_train.run(self.init_op)
+    if FLAGS.enbl_multi_gpu:
+      self.sess_train.run(self.bcast_op)
+
+    # evaluate the model before fine-tuning
+    if self.is_primary_worker('global'):
+      self.__save_model(is_train=True)
+      self.evaluate()
+    self.auto_barrier()
+
+    # fine-tune the model with chosen channels only
+    time_prev = timer()
+    for idx_iter in range(self.nb_iters_train):
+      # train the model
+      if (idx_iter + 1) % FLAGS.summ_step != 0:
+        self.sess_train.run(self.train_op)
+      else:
+        __, summary, log_rslt = self.sess_train.run([self.train_op, self.summary_op, self.log_op])
+        if self.is_primary_worker('global'):
+          time_step = timer() - time_prev
+          self.__monitor_progress(summary, log_rslt, idx_iter, time_step)
+          time_prev = timer()
+
+      # save the model at certain steps
+      if self.is_primary_worker('global') and (idx_iter + 1) % FLAGS.save_step == 0:
+        self.__save_model(is_train=True)
+        self.evaluate()
+      self.auto_barrier()
+
+    # save the final model
+    if self.is_primary_worker('global'):
+      self.__save_model(is_train=True)
+      self.__restore_model(is_train=False)
+      self.__save_model(is_train=False)
+      self.evaluate()
+
+  def evaluate(self):
+    """Restore a model from the latest checkpoint files and then evaluate it."""
+
+    self.__restore_model(is_train=False)
+    nb_iters = int(np.ceil(float(FLAGS.nb_smpls_eval) / FLAGS.batch_size_eval))
+    eval_rslts = np.zeros((nb_iters, len(self.eval_op)))
+    self.dump_n_eval(outputs=None, action='init')
+    for idx_iter in range(nb_iters):
+      if (idx_iter + 1) % 100 == 0:
+        tf.logging.info('process the %d-th mini-batch for evaluation' % (idx_iter + 1))
+      eval_rslts[idx_iter], outputs = self.sess_eval.run([self.eval_op, self.outputs_eval])
+      self.dump_n_eval(outputs=outputs, action='dump')
+    self.dump_n_eval(outputs=None, action='eval')
+    for idx, name in enumerate(self.eval_op_names):
+      tf.logging.info('%s = %.4e' % (name, np.mean(eval_rslts[:, idx])))
+
+  def __build_train(self):  # pylint: disable=too-many-locals,too-many-statements
+    """Build the training graph."""
+
+    with tf.Graph().as_default():
+      # create a TF session for the current graph
+      config = tf.ConfigProto()
+      config.gpu_options.allow_growth = True  # pylint: disable=no-member
+      config.gpu_options.visible_device_list = \
+        str(mgw.local_rank() if FLAGS.enbl_multi_gpu else 0)  # pylint: disable=no-member
+      sess = tf.Session(config=config)
+
+      # data input pipeline
+      with tf.variable_scope(self.data_scope):
+        iterator = self.build_dataset_train()
+        images, labels = iterator.get_next()
+
+      # model definition - distilled model
+      if FLAGS.enbl_dst:
+        logits_dst = self.helper_dst.calc_logits(sess, images)
+
+      # model definition - channel-pruned model
+      with tf.variable_scope(self.model_scope_prnd):
+        logits_prnd = self.forward_train(images)
+        self.vars_prnd = get_vars_by_scope(self.model_scope_prnd)
+        self.global_step = tf.train.get_or_create_global_step()
+        self.saver_prnd_train = tf.train.Saver(self.vars_prnd['all'] + [self.global_step])
+
+        # loss & extra evaluation metrics
+        loss, metrics = self.calc_loss(labels, logits_prnd, self.vars_prnd['trainable'])
+        if FLAGS.enbl_dst:
+          loss += self.helper_dst.calc_loss(logits_prnd, logits_dst)
+        tf.summary.scalar('loss', loss)
+        for key, value in metrics.items():
+          tf.summary.scalar(key, value)
+
+        # learning rate schedule
+        lrn_rate, self.nb_iters_train = self.setup_lrn_rate(self.global_step)
+
+        # calculate pruning ratios
+        pr_trainable = calc_prune_ratio(self.vars_prnd['trainable'])
+        pr_conv_krnl = calc_prune_ratio(self.vars_prnd['conv_krnl'])
+        tf.summary.scalar('pr_trainable', pr_trainable)
+        tf.summary.scalar('pr_conv_krnl', pr_conv_krnl)
+
+        # create masks and corresponding operations for channel pruning
+        self.masks = []
+        for idx, var in enumerate(self.vars_prnd['conv_krnl']):
+          tf.logging.info('creating a pruning mask for {} of size {}'.format(var.name, var.shape))
+          mask_name = '/'.join(var.name.split('/')[1:]).replace(':0', '_mask')
+          var_norm = tf.reduce_sum(tf.square(var), axis=[0, 1, 3], keepdims=True)
+          mask_init = tf.cast(var_norm > 0.0, tf.float32)
+          mask = tf.get_variable(mask_name, initializer=mask_init, trainable=False)
+          self.masks += [mask]
+
+        # optimizer & gradients
+        optimizer_base = tf.train.MomentumOptimizer(lrn_rate, FLAGS.momentum)
+        if not FLAGS.enbl_multi_gpu:
+          optimizer = optimizer_base
+        else:
+          optimizer = mgw.DistributedOptimizer(optimizer_base)
+        grads_origin = optimizer.compute_gradients(loss, self.vars_prnd['trainable'])
+        grads_pruned = self.__calc_grads_pruned(grads_origin)
+        update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, scope=self.model_scope_prnd)
+        with tf.control_dependencies(update_ops):
+          self.train_op = optimizer.apply_gradients(grads_pruned, global_step=self.global_step)
+
+      # TF operations for logging & summarizing
+      self.sess_train = sess
+      self.summary_op = tf.summary.merge_all()
+      self.init_op = tf.group(
+        tf.variables_initializer([self.global_step] + self.masks + optimizer_base.variables()))
+      self.log_op = [lrn_rate, loss, pr_trainable, pr_conv_krnl] + list(metrics.values())
+      self.log_op_names = ['lr', 'loss', 'pr_trn', 'pr_krn'] + list(metrics.keys())
+      if FLAGS.enbl_multi_gpu:
+        self.bcast_op = mgw.broadcast_global_variables(0)
+
+  def __build_eval(self):
+    """Build the evaluation graph."""
+
+    with tf.Graph().as_default():
+      # create a TF session for the current graph
+      config = tf.ConfigProto()
+      config.gpu_options.allow_growth = True  # pylint: disable=no-member
+      config.gpu_options.visible_device_list = \
+        str(mgw.local_rank() if FLAGS.enbl_multi_gpu else 0)  # pylint: disable=no-member
+      self.sess_eval = tf.Session(config=config)
+
+      # data input pipeline
+      with tf.variable_scope(self.data_scope):
+        iterator = self.build_dataset_eval()
+        images, labels = iterator.get_next()
+
+      # model definition - distilled model
+      if FLAGS.enbl_dst:
+        logits_dst = self.helper_dst.calc_logits(self.sess_eval, images)
+
+      # model definition - channel-pruned model
+      with tf.variable_scope(self.model_scope_prnd):
+        logits = self.forward_eval(images)
+        vars_prnd = get_vars_by_scope(self.model_scope_prnd)
+        global_step = tf.train.get_or_create_global_step()
+        self.saver_prnd_eval = tf.train.Saver(vars_prnd['all'] + [global_step])
+
+        # loss & extra evaluation metrics
+        loss, metrics = self.calc_loss(labels, logits, vars_prnd['trainable'])
+        if FLAGS.enbl_dst:
+          loss += self.helper_dst.calc_loss(logits, logits_dst)
+
+        # calculate pruning ratios
+        pr_trainable = calc_prune_ratio(vars_prnd['trainable'])
+        pr_conv_krnl = calc_prune_ratio(vars_prnd['conv_krnl'])
+
+        # TF operations for evaluation
+        self.eval_op = [loss, pr_trainable, pr_conv_krnl] + list(metrics.values())
+        self.eval_op_names = ['loss', 'pr_trn', 'pr_krn'] + list(metrics.keys())
+        self.outputs_eval = logits
+
+      # add input & output tensors to certain collections
+      tf.add_to_collection('images_final', images)
+      tf.add_to_collection('logits_final', logits)
+
+  def __build_prune(self):
+    """Build the channel pruning graph."""
+
+    with tf.Graph().as_default():
+      # create a TF session for the current graph
+      config = tf.ConfigProto()
+      config.gpu_options.allow_growth = True  # pylint: disable=no-member
+      config.gpu_options.visible_device_list = \
+        str(mgw.local_rank() if FLAGS.enbl_multi_gpu else 0)  # pylint: disable=no-member
+      sess = tf.Session(config=config)
+
+      # data input pipeline
+      with tf.variable_scope(self.data_scope):
+        iterator = self.build_dataset_train()
+        images, labels = iterator.get_next()
+        if not isinstance(images, dict):
+          images_ph = tf.placeholder(tf.float32, shape=images.shape, name='images_ph')
+        else:
+          images_ph = {}
+          for key, value in images.items():
+            images_ph[key] = tf.placeholder(value.dtype, shape=value.shape, name=(key + '_ph'))
+
+      # restore a pre-trained model as full model
+      with tf.variable_scope(self.model_scope_full):
+        __ = self.forward_train(images_ph)
+        vars_full = get_vars_by_scope(self.model_scope_full)
+        saver_full = tf.train.Saver(vars_full['all'])
+        saver_full.restore(sess, tf.train.latest_checkpoint(os.path.dirname(FLAGS.save_path)))
+
+      # restore a pre-trained model as channel-pruned model
+      with tf.variable_scope(self.model_scope_prnd):
+        logits_prnd = self.forward_train(images_ph)
+        vars_prnd = get_vars_by_scope(self.model_scope_prnd)
+        global_step = tf.train.get_or_create_global_step()
+        saver_prnd = tf.train.Saver(vars_prnd['all'] + [global_step])
+
+        # loss & extra evaluation metrics
+        loss, metrics = self.calc_loss(labels, logits_prnd, vars_prnd['trainable'])
+
+        # calculate pruning ratios
+        pr_trainable = calc_prune_ratio(vars_prnd['trainable'])
+        pr_conv_krnl = calc_prune_ratio(vars_prnd['conv_krnl'])
+
+        # use full model's weights to initialize channel-pruned model
+        init_ops = [global_step.initializer]
+        for var_full, var_prnd in zip(vars_full['all'], vars_prnd['all']):
+          init_ops += [var_prnd.assign(var_full)]
+        self.init_op_prune = tf.group(init_ops)
+
+      # build a list of Conv2D operation's information
+      self.conv_info_list = self.__build_conv_info_list(vars_prnd['conv_krnl'])
+
+      # build meta LASSO/least-square optimization problems
+      self.meta_lasso = self.__build_meta_lasso()
+      self.meta_lstsq = self.__build_meta_lstsq()
+
+      # TF operations for logging & summarizing
+      self.sess_prune = sess
+      self.images_prune = images
+      self.images_prune_ph = images_ph
+      self.saver_prune = saver_prnd
+      self.pr_trn_prune = pr_trainable
+      self.pr_krn_prune = pr_conv_krnl
+
+  def __build_conv_info_list(self, conv_krnls_prnd):
+    """Build a list of Conv2D operation's information.
+
+    Args:
+    * conv_krnls_prnd: list of convolutional kernels in the channel-pruned model
+
+    Returns:
+    * conv_info_list: list of Conv2D operation's information
+    """
+
+    # find all the Conv2D operations
+    pattern = re.compile(r'/Conv2D$')
+    conv_ops_full = get_ops_by_scope_n_pattern(self.model_scope_full, pattern)
+    conv_ops_prnd = get_ops_by_scope_n_pattern(self.model_scope_prnd, pattern)
+
+    # build a list of Conv2D operation's information
+    conv_info_list = []
+    for idx_layer, (conv_op_full, conv_op_prnd) in enumerate(zip(conv_ops_full, conv_ops_prnd)):
+      conv_krnl_prnd = conv_krnls_prnd[idx_layer]
+      conv_krnl_prnd_ph = tf.placeholder(
+        tf.float32, shape=conv_krnl_prnd.shape, name='conv_krnl_prnd_ph_%d' % idx_layer)
+      conv_info_list += [{
+        'conv_krnl_full': conv_op_full.inputs[1],
+        'conv_krnl_prnd': conv_op_prnd.inputs[1],
+        'conv_krnl_prnd_ph': conv_krnl_prnd_ph,
+        'update_op': conv_krnl_prnd.assign(conv_krnl_prnd_ph),
+        'input_full': conv_op_full.inputs[0],
+        'input_prnd': conv_op_prnd.inputs[0],
+        'output_full': conv_op_full.outputs[0],
+        'output_prnd': conv_op_prnd.outputs[0],
+        'strides': conv_op_full.get_attr('strides'),
+        'padding': conv_op_full.get_attr('padding').decode('utf-8'),
+      }]
+
+    return conv_info_list
+
+  def __build_meta_lasso(self):
+    """Build a meta LASSO optimization problem."""
+
+    # build a meta LASSO optimization problem
+    with tf.variable_scope('meta_lasso'):
+      # create placeholders to customize the LASSO problem
+      xt_x_ph = tf.placeholder(tf.float32, name='xt_x_ph')
+      xt_y_ph = tf.placeholder(tf.float32, name='xt_y_ph')
+      mask_ph = tf.placeholder(tf.float32, name='mask_ph')
+      gamma = tf.placeholder(tf.float32, shape=[], name='gamma')
+
+      # create variables
+      xt_x = tf.get_variable('xt_x', initializer=xt_x_ph, trainable=False, validate_shape=False)
+      xt_y = tf.get_variable('xt_y', initializer=xt_y_ph, trainable=False, validate_shape=False)
+      mask = tf.get_variable('mask', initializer=mask_ph, trainable=True, validate_shape=False)
+
+      # TF operations
+      def prox_mapping(x, thres):
+        return tf.where(x > thres, x - thres, tf.where(x < -thres, x + thres, tf.zeros_like(x)))
+      mask_gd = mask - FLAGS.cpr_ista_lrn_rate * (tf.matmul(xt_x, mask) - xt_y)
+      train_op = mask.assign(prox_mapping(mask_gd, gamma * FLAGS.cpr_ista_lrn_rate))
+      init_op = tf.variables_initializer([xt_x, xt_y, mask])
+
+    # pack placeholders, variables, and TF operations into dict
+    meta_lasso = {
+      'xt_x_ph': xt_x_ph,
+      'xt_y_ph': xt_y_ph,
+      'mask_ph': mask_ph,
+      'gamma': gamma,
+      'xt_x': xt_x,
+      'xy_y': xt_y,
+      'mask': mask,
+      'init_op': init_op,
+      'train_op': train_op,
+    }
+
+    return meta_lasso
+
+  def __build_meta_lstsq(self):
+    """Build a meta least-square optimization problem."""
+
+    # build a meta least-square optimization problem
+    beta1 = 0.9
+    beta2 = 0.999
+    epsilon = 1e-8
+    with tf.variable_scope('meta_lstsq'):
+      # create placeholders to customize the least-square problem
+      x_mat_ph = tf.placeholder(tf.float32, name='x_mat_ph')
+      y_mat_ph = tf.placeholder(tf.float32, name='y_mat_ph')
+      w_mat_ph = tf.placeholder(tf.float32, name='w_mat_ph')
+      gacc1_ph = tf.placeholder(tf.float32, name='gacc1_ph')
+      gacc2_ph = tf.placeholder(tf.float32, name='gacc2_ph')
+
+      # create variables
+      x_mat = tf.get_variable('x_mat', initializer=x_mat_ph, validate_shape=False)
+      y_mat = tf.get_variable('y_mat', initializer=y_mat_ph, validate_shape=False)
+      w_mat = tf.get_variable('w_mat', initializer=w_mat_ph, validate_shape=False)
+      gacc1 = tf.get_variable('gacc1', initializer=gacc1_ph, validate_shape=False)
+      gacc2 = tf.get_variable('gacc2', initializer=gacc2_ph, validate_shape=False)
+      train_step = tf.get_variable('train_step', shape=[], initializer=tf.zeros_initializer)
+
+      # TF operations
+      nb_smpls = tf.cast(tf.shape(x_mat)[0], tf.float32)
+      loss_reg = tf.nn.l2_loss(tf.matmul(x_mat, w_mat) - y_mat) / nb_smpls
+      loss_dcy = FLAGS.loss_w_dcy * tf.nn.l2_loss(w_mat)
+      grad = tf.matmul(tf.transpose(x_mat), tf.matmul(x_mat, w_mat) - y_mat) / nb_smpls + FLAGS.loss_w_dcy * w_mat
+      update_ops = [
+        gacc1.assign(beta1 * gacc1 + (1.0 - beta1) * grad),
+        gacc2.assign(beta2 * gacc2 + (1.0 - beta2) * grad ** 2),
+        train_step.assign_add(tf.ones([]))
+      ]
+      with tf.control_dependencies(update_ops):
+        lrn_rate = FLAGS.cpr_lstsq_lrn_rate \
+          * tf.sqrt(1.0 - tf.pow(beta2, train_step)) / (1.0 - tf.pow(beta1, train_step))
+        train_op = w_mat.assign_add(-lrn_rate * gacc1 / (tf.sqrt(gacc2) + epsilon))
+      init_op = tf.variables_initializer([x_mat, y_mat, w_mat, gacc1, gacc2, train_step])
+
+    # pack placeholders and variables into dict
+    meta_lstsq = {
+      'x_mat_ph': x_mat_ph,
+      'y_mat_ph': y_mat_ph,
+      'w_mat_ph': w_mat_ph,
+      'gacc1_ph': gacc1_ph,
+      'gacc2_ph': gacc2_ph,
+      'w_mat': w_mat,
+      'loss_reg': loss_reg,
+      'loss_dcy': loss_dcy,
+      'init_op': init_op,
+      'train_op': train_op,
+    }
+
+    return meta_lstsq
+
+  def __calc_grads_pruned(self, grads_origin):
+    """Calculate the mask-pruned gradients.
+
+    Args:
+    * grads_origin: list of original gradients
+
+    Returns:
+    * grads_pruned: list of mask-pruned gradients
+    """
+
+    grads_pruned = []
+    conv_krnl_names = [var.name for var in self.vars_prnd['conv_krnl']]
+    for grad in grads_origin:
+      if grad[1].name not in conv_krnl_names:
+        grads_pruned += [grad]
+      else:
+        idx_mask = conv_krnl_names.index(grad[1].name)
+        grads_pruned += [(grad[0] * self.masks[idx_mask], grad[1])]
+
+    return grads_pruned
+
+  def __choose_channels(self):  # pylint: disable=too-many-locals
+    """Choose channels for all convolutional layers."""
+
+    # configure each layer's pruning ratio
+    nb_layers = len(self.conv_info_list)
+    prune_ratios = [FLAGS.cpr_prune_ratio] * nb_layers
+    if FLAGS.cpr_skip_frst_layer:
+      prune_ratios[0] = 0.0
+    if FLAGS.cpr_skip_last_layer:
+      prune_ratios[-1] = 0.0
+
+    # skip channel pruning at certain layers
+    skip_names = FLAGS.cpr_skip_op_names.split(',') if FLAGS.cpr_skip_op_names is not None else []
+    for idx_layer in range(nb_layers):
+      #if self.conv_info_list[idx_layer]['input_full'].shape[2] == 8:
+      #  prune_ratios[idx_layer] = 0.0
+      conv_krnl_prnd_name = self.conv_info_list[idx_layer]['conv_krnl_prnd'].name
+      for skip_name in skip_names:
+        if skip_name in conv_krnl_prnd_name:
+          prune_ratios[idx_layer] = 0.0
+          tf.logging.info('skip %s since no pruning is required' % conv_krnl_prnd_name)
+          break
+
+    # cache multiple mini-batches of images for channel selection
+    def __build_feed_dict(images_np):
+      if not isinstance(self.images_prune, dict):
+        feed_dict = {self.images_prune_ph: images_np}
+      else:
+        feed_dict = {}
+        for key in self.images_prune:
+          feed_dict[self.images_prune_ph[key]] = images_np[key]
+      return feed_dict
+
+    nb_mbtcs = int(math.ceil(FLAGS.cpr_nb_smpls / FLAGS.batch_size))
+    images_cached = []
+    for __ in range(nb_mbtcs):
+      images_cached += [self.sess_prune.run(self.images_prune)]
+
+    # select channels for all the convolutional layers
+    self.sess_prune.run(self.init_op_prune)
+    for idx_layer in range(nb_layers):
+      # display the layer information
+      prune_ratio = prune_ratios[idx_layer]
+      conv_info = self.conv_info_list[idx_layer]
+      if self.is_primary_worker('global'):
+        tf.logging.info('layer #%d: pr = %.2f (target)' % (idx_layer, prune_ratio))
+        tf.logging.info('kernel name = {}'.format(conv_info['conv_krnl_prnd'].name))
+        tf.logging.info('kernel shape = {}'.format(conv_info['conv_krnl_prnd'].shape))
+
+      # extract the current layer's information
+      conv_krnl_full = self.sess_prune.run(conv_info['conv_krnl_full'])
+      conv_krnl_prnd = self.sess_prune.run(conv_info['conv_krnl_prnd'])
+      conv_krnl_prnd_ph = conv_info['conv_krnl_prnd_ph']
+      update_op = conv_info['update_op']
+      input_full_tf = conv_info['input_full']
+      input_prnd_tf = conv_info['input_prnd']
+      output_full_tf = conv_info['output_full']
+      output_prnd_tf = conv_info['output_prnd']
+      strides = conv_info['strides']
+      padding = conv_info['padding']
+      nb_chns_input = conv_krnl_prnd.shape[2]
+
+      # sample inputs & outputs through multiple mini-batches
+      tf.logging.info('sampling inputs & outputs through multiple mini-batches')
+      time_beg = timer()
+      nb_insts = 0  # number of sampled instances (for regression) collected so far
+      nb_insts_min = FLAGS.cpr_nb_crops_per_smpl * FLAGS.cpr_nb_smpls  # minimal requirement
+      inputs_list = [[] for __ in range(nb_chns_input)]
+      outputs_list = []
+      for idx_mbtc in range(nb_mbtcs):
+        inputs_full, inputs_prnd, outputs_full, outputs_prnd = \
+          self.sess_prune.run([input_full_tf, input_prnd_tf, output_full_tf, output_prnd_tf],
+                              feed_dict=__build_feed_dict(images_cached[idx_mbtc]))
+        inputs_smpl, outputs_smpl = self.__smpl_inputs_n_outputs(
+          conv_krnl_full, conv_krnl_prnd,
+          inputs_full, inputs_prnd, outputs_full, outputs_prnd, strides, padding)
+        nb_insts += outputs_smpl.shape[0]
+        for idx_chn_input in range(nb_chns_input):
+          inputs_list[idx_chn_input] += [inputs_smpl[idx_chn_input]]
+        outputs_list += [outputs_smpl]
+        if nb_insts > nb_insts_min:
+          break
+      idxs_inst = np.random.choice(nb_insts, size=(nb_insts_min), replace=False)
+      inputs_np_list = [np.vstack(x)[idxs_inst] for x in inputs_list]
+      outputs_np = np.vstack(outputs_list)[idxs_inst]
+      tf.logging.info('time elapsed (sampling): %.4f (s)' % (timer() - time_beg))
+
+      # choose channels via solving the sparsity-constrained regression problem
+      tf.logging.info('choosing channels via solving the sparsity-constrained regression problem')
+      time_beg = timer()
+      conv_krnl_prnd = self.__solve_sparse_regression(
+        inputs_np_list, outputs_np, conv_krnl_prnd, prune_ratio)
+      self.sess_prune.run(update_op, feed_dict={conv_krnl_prnd_ph: conv_krnl_prnd})
+      tf.logging.info('time elapsed (selection): %.4f (s)' % (timer() - time_beg))
+
+      # compute the overall pruning ratios
+      pr_trn, pr_krn = self.sess_prune.run([self.pr_trn_prune, self.pr_krn_prune])
+      tf.logging.info('pruning ratios: %e (trn) / %e (krn)' % (pr_trn, pr_krn))
+
+    # save the temporary model containing channel pruned weights
+    if self.is_primary_worker('global'):
+      save_path = self.saver_prune.save(self.sess_prune, FLAGS.cpr_save_path_ws)
+      tf.logging.info('model saved to ' + save_path)
+    self.auto_barrier()
+
+  def __smpl_inputs_n_outputs(self, conv_krnl_full, conv_krnl_prnd, inputs_full, inputs_prnd, outputs_full, outputs_prnd, strides, padding):
+    """Sample inputs & outputs of sub-regions from full feature maps.
+
+    Args:
+
+    Returns:
+    """
+
+    # obtain parameters
+    bs = inputs_full.shape[0]
+    kh, kw = conv_krnl_full.shape[0], conv_krnl_full.shape[1]
+    ih, iw, ic = inputs_full.shape[1], inputs_full.shape[2], inputs_full.shape[3]
+    oh, ow, oc = outputs_full.shape[1], outputs_full.shape[2], outputs_full.shape[3]
+    sh, sw = strides[1], strides[2]
+    if padding == 'VALID':
+      pt, pb, pl, pr = 0, 0, 0, 0  # padding - top / bottom / left / right
+    else:
+      # ref link: https://www.tensorflow.org/api_guides/python/nn#Convolution
+      ph = max(kh - (sh if ih % sh == 0 else ih % sh), 0)
+      pw = max(kw - (sw if iw % sw == 0 else iw % sw), 0)
+      pt, pb = ph // 2, ph % 2
+      pl, pr = pw // 2, pw % 2
+
+    # sample inputs & outputs of sub-regions
+    inputs_smpl_full_list = []
+    inputs_smpl_prnd_list = []
+    outputs_smpl_full_list = []
+    outputs_smpl_prnd_list = []
+    for idx_iter in range(FLAGS.cpr_nb_crops_per_smpl):
+      idx_oh = np.random.randint(oh)
+      idx_ow = np.random.randint(ow)
+      idx_ih_low = idx_oh * strides[1] - pt  # uncropped indices of input feature maps
+      idx_ih_hgh = idx_ih_low + kh
+      idx_iw_low = idx_ow * strides[2] - pl
+      idx_iw_hgh = idx_iw_low + kw
+      idx_sh_low = max(-idx_ih_low, 0)  # cropped indices of sampled feature maps
+      idx_sh_hgh = kh - max(idx_ih_hgh - ih, 0)
+      idx_sw_low = max(-idx_iw_low, 0)
+      idx_sw_hgh = kw - max(idx_iw_hgh - iw, 0)
+      idx_ih_low = max(idx_ih_low, 0)  # cropped indices of input feature maps
+      idx_ih_hgh = min(idx_ih_hgh, ih)
+      idx_iw_low = max(idx_iw_low, 0)
+      idx_iw_hgh = min(idx_iw_hgh, iw)
+      inputs_smpl_full = np.zeros((bs, kh, kw, ic))
+      inputs_smpl_prnd = np.zeros((bs, kh, kw, ic))
+      inputs_smpl_full[:, idx_sh_low:idx_sh_hgh, idx_sw_low:idx_sw_hgh, :] = \
+        inputs_full[:, idx_ih_low:idx_ih_hgh, idx_iw_low:idx_iw_hgh, :]
+      inputs_smpl_prnd[:, idx_sh_low:idx_sh_hgh, idx_sw_low:idx_sw_hgh, :] = \
+        inputs_prnd[:, idx_ih_low:idx_ih_hgh, idx_iw_low:idx_iw_hgh, :]
+      inputs_smpl_full_list += [inputs_smpl_full]
+      inputs_smpl_prnd_list += [inputs_smpl_prnd]
+      outputs_smpl_full_list += [np.reshape(outputs_full[:, idx_oh, idx_ow, :], [bs, -1])]
+      outputs_smpl_prnd_list += [np.reshape(outputs_prnd[:, idx_oh, idx_ow, :], [bs, -1])]
+
+    # concatenate samples into a single np.array
+    inputs_smpl_full = np.concatenate(inputs_smpl_full_list, axis=0)
+    inputs_smpl_prnd = np.concatenate(inputs_smpl_prnd_list, axis=0)
+    outputs_smpl_full = np.vstack(outputs_smpl_full_list)
+    outputs_smpl_prnd = np.vstack(outputs_smpl_prnd_list)
+
+    # concatenate sampled inputs & outputs arrays
+    inputs_smpl = [np.reshape(x, [-1, kh * kw]) for x in np.split(inputs_smpl_prnd, ic, axis=3)]
+    outputs_smpl = outputs_smpl_full
+
+    # validate inputs & outputs
+    wei_mat_full = np.reshape(conv_krnl_full, [-1, oc])
+    wei_mat_prnd = np.reshape(conv_krnl_prnd, [-1, oc])
+    preds_smpl_full = np.matmul(np.reshape(inputs_smpl_full, [-1, kh * kw * ic]), wei_mat_full)
+    preds_smpl_prnd = np.matmul(np.reshape(inputs_smpl_prnd, [-1, kh * kw * ic]), wei_mat_prnd)
+    err_full = norm(outputs_smpl_full - preds_smpl_full) ** 2 / outputs_smpl_full.size
+    err_prnd = norm(outputs_smpl_prnd - preds_smpl_prnd) ** 2 / outputs_smpl_prnd.size
+    assert err_full < 1e-6, 'unable to recover output feature maps - full (%e)' % err_full
+    assert err_prnd < 1e-6, 'unable to recover output feature maps - prnd (%e)' % err_prnd
+
+    return inputs_smpl, outputs_smpl
+
+  def __solve_sparse_regression(self, inputs_np_list, outputs_np, conv_krnl, prune_ratio):
+    """Solve the sparsity-constrained regression problem.
+
+    Args:
+    * inputs_np_list: list of input feature maps (one per input channel, N x k^2)
+    * outputs_np: output feature maps (N x c_o)
+    * conv_krnl: initial convolutional kernel (k * k * c_i * c_o)
+    * prune_ratio: pruning ratio
+
+    Returns:
+    * conv_krnl: updated convolutional kernel (k * k * c_i * c_o)
+    """
+
+    # obtain parameters
+    bs = outputs_np.shape[0]
+    kh, kw, ic, oc = conv_krnl.shape[0], conv_krnl.shape[1], conv_krnl.shape[2], conv_krnl.shape[3]
+    nb_chns_nnz_target = int(ic * (1.0 - prune_ratio))
+    tf.logging.info('[sparse regression]')
+    tf.logging.info('\tinputs: {} / outputs: {} / conv_krnl: {} / pr: {} / nnz: {}'.format(
+      inputs_np_list[0].shape, outputs_np.shape, conv_krnl.shape, prune_ratio, nb_chns_nnz_target))
+
+    # compute the feature matrix & response vector
+    tf.logging.info('computing the feature matrix & response vector')
+    time_beg = timer()
+    bs_rdc = int(math.ceil(min(bs, bs / oc * 10.0)))
+    tf.logging.info('secondary sampling: %d -> %d' % (bs, bs_rdc))
+    idxs_inst = np.random.choice(bs, size=(bs_rdc), replace=False)
+    rspn_vec_np = np.reshape(outputs_np[idxs_inst], [-1, 1])  # N' x 1 (N' = N * c_o)
+    feat_mat_np = np.zeros((ic, bs_rdc * oc))  # c_i x N'
+    for idx in range(ic):
+      wei_mat = np.reshape(conv_krnl[:, :, idx, :], [kh * kw, oc])
+      feat_mat_np[idx] = np.matmul(inputs_np_list[idx][idxs_inst], wei_mat).ravel()
+    feat_mat_np = np.transpose(feat_mat_np)
+    tf.logging.info('time elapsed: %.4f (s)' % (timer() - time_beg))
+
+    # compute <X^T * X> & <X^T * y> in advance
+    tf.logging.info('computing <X^T * X> & <X^T * y> in advance')
+    time_beg = timer()
+    xt_x_np = np.matmul(feat_mat_np.T, feat_mat_np)
+    xt_y_np = np.matmul(feat_mat_np.T, rspn_vec_np)
+    xt_x_norm = norm(xt_x_np)  # normalize <xt_x> to unit norm, and adjust <xt_y> correspondingly
+    xt_x_np /= xt_x_norm
+    xt_y_np /= xt_x_norm
+    mask_np_init = np.random.uniform(size=(ic, 1))
+    tf.logging.info('time elapsed: %.4f (s)' % (timer() - time_beg))
+
+    # solve the LASSO problem
+    def __solve_lasso(x):
+      self.sess_prune.run(self.meta_lasso['init_op'], feed_dict={
+        self.meta_lasso['xt_x_ph']: xt_x_np,
+        self.meta_lasso['xt_y_ph']: xt_y_np,
+        self.meta_lasso['mask_ph']: mask_np_init,
+      })
+      for __ in range(FLAGS.cpr_ista_nb_iters):
+        self.sess_prune.run(self.meta_lasso['train_op'], feed_dict={self.meta_lasso['gamma']: x})
+      mask_np = self.sess_prune.run(self.meta_lasso['mask'])
+      nb_chns_nnz = np.count_nonzero(mask_np)
+      tf.logging.info('x = %e -> nb_chns_nnz = %d' % (x, nb_chns_nnz))
+      return mask_np, nb_chns_nnz
+
+    # determine <gamma>'s upper bound
+    tf.logging.info('determining <gamma>\'s upper bound')
+    time_beg = timer()
+    ubnd = 0.1
+    while True:
+      mask_np, nb_chns_nnz = __solve_lasso(ubnd)
+      if nb_chns_nnz <= nb_chns_nnz_target:
+        break
+      else:
+        ubnd *= 2.0
+    tf.logging.info('time elapsed: %.4f (s)' % (timer() - time_beg))
+
+    # determine <gamma> via binary search
+    tf.logging.info('determining <gamma> via binary search')
+    time_beg = timer()
+    lbnd = 0.0
+    while nb_chns_nnz != nb_chns_nnz_target and ubnd - lbnd > 1e-8:
+      val = (lbnd + ubnd) / 2.0
+      mask_np, nb_chns_nnz = __solve_lasso(val)
+      if nb_chns_nnz < nb_chns_nnz_target:
+        ubnd = val
+      elif nb_chns_nnz > nb_chns_nnz_target:
+        lbnd = val
+      else:
+        break
+    tf.logging.info('time elapsed: %.4f (s)' % (timer() - time_beg))
+
+    # construct a least-square regression problem
+    tf.logging.info('constructing a least-square regression problem')
+    time_beg = timer()
+    bnry_vec_np = (np.abs(mask_np) > 0.0).astype(np.float32)
+    rspn_mat_np = outputs_np
+    feat_tns_np = np.concatenate([np.expand_dims(x, axis=-1) for x in inputs_np_list], axis=-1)
+    feat_mat_np = np.reshape(feat_tns_np * np.reshape(bnry_vec_np, [1, 1, -1]), [bs, -1])
+    w_mat_np_init = np.reshape(conv_krnl, [-1, oc])
+    gacc1_np = np.zeros_like(w_mat_np_init)
+    gacc2_np = np.zeros_like(w_mat_np_init)
+    self.sess_prune.run(self.meta_lstsq['init_op'], feed_dict={
+      self.meta_lstsq['x_mat_ph']: feat_mat_np,
+      self.meta_lstsq['y_mat_ph']: rspn_mat_np,
+      self.meta_lstsq['w_mat_ph']: w_mat_np_init,
+      self.meta_lstsq['gacc1_ph']: gacc1_np,
+      self.meta_lstsq['gacc2_ph']: gacc2_np,
+    })
+    loss_reg, loss_dcy = self.sess_prune.run(
+      [self.meta_lstsq['loss_reg'], self.meta_lstsq['loss_dcy']])
+    tf.logging.info('losses: %e (reg) / %e (dcy)' % (loss_reg, loss_dcy))
+    for __ in range(FLAGS.cpr_lstsq_nb_iters):
+      self.sess_prune.run(self.meta_lstsq['train_op'])
+    w_mat_np, loss_reg, loss_dcy = self.sess_prune.run(
+      [self.meta_lstsq['w_mat'], self.meta_lstsq['loss_reg'], self.meta_lstsq['loss_dcy']])
+    tf.logging.info('losses: %e (reg) / %e (dcy)' % (loss_reg, loss_dcy))
+    conv_krnl = np.reshape(w_mat_np, conv_krnl.shape) * np.reshape(bnry_vec_np, [1, 1, -1, 1])
+    tf.logging.info('time elapsed: %.4f (s)' % (timer() - time_beg))
+
+    return conv_krnl
+
+  def __save_model(self, is_train):
+    """Save the current model for training or evaluation.
+
+    Args:
+    * is_train: whether to save a model for training
+    """
+
+    if is_train:
+      save_path = self.saver_prnd_train.save(self.sess_train, FLAGS.cpr_save_path, self.global_step)
+    else:
+      save_path = self.saver_prnd_eval.save(self.sess_eval, FLAGS.cpr_save_path_eval)
+    tf.logging.info('model saved to ' + save_path)
+
+  def __restore_model(self, is_train):
+    """Restore a model from the latest checkpoint files.
+
+    Args:
+    * is_train: whether to restore a model for training
+    """
+
+    save_path = tf.train.latest_checkpoint(os.path.dirname(FLAGS.cpr_save_path))
+    if is_train:
+      self.saver_prnd_train.restore(self.sess_train, save_path)
+    else:
+      self.saver_prnd_eval.restore(self.sess_eval, save_path)
+    tf.logging.info('model restored from ' + save_path)
+
+  def __monitor_progress(self, summary, log_rslt, idx_iter, time_step):
+    """Monitor the training progress.
+
+    Args:
+    * summary: summary protocol buffer
+    * log_rslt: logging operations' results
+    * idx_iter: index of the training iteration
+    * time_step: time step between two summary operations
+    """
+
+    # write summaries for TensorBoard visualization
+    self.sm_writer.add_summary(summary, idx_iter)
+
+    # compute the training speed
+    speed = FLAGS.batch_size * FLAGS.summ_step / time_step
+    if FLAGS.enbl_multi_gpu:
+      speed *= mgw.size()
+
+    # display monitored statistics
+    log_str = ' | '.join(['%s = %.4e' % (name, value)
+                          for name, value in zip(self.log_op_names, log_rslt)])
+    tf.logging.info('iter #%d: %s | speed = %.2f pics / sec' % (idx_iter + 1, log_str, speed))
diff --git a/learners/discr_channel_pruning/learner.py b/learners/discr_channel_pruning/learner.py
index 9cde94c..12a2155 100644
--- a/learners/discr_channel_pruning/learner.py
+++ b/learners/discr_channel_pruning/learner.py
@@ -25,7 +25,6 @@
 
 from learners.abstract_learner import AbstractLearner
 from learners.distillation_helper import DistillationHelper
-from utils.lrn_rate_utils import setup_lrn_rate
 from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
 
 FLAGS = tf.app.flags.FLAGS
@@ -225,8 +224,7 @@ def __build_train(self):  # pylint: disable=too-many-locals,too-many-statements
 
         # learning rate schedule
         self.global_step = tf.train.get_or_create_global_step()
-        lrn_rate, self.nb_iters_train = setup_lrn_rate(
-          self.global_step, self.model_name, self.dataset_name)
+        lrn_rate, self.nb_iters_train = self.setup_lrn_rate(self.global_step)
 
         # overall pruning ratios of trainable & maskable variables
         pr_trainable = calc_prune_ratio(self.vars_prnd['trainable'])
@@ -475,6 +473,7 @@ def __choose_discr_chns(self):  # pylint: disable=too-many-locals
         else:
           summary, __ = self.sess_train.run([self.summary_op, self.block_train_ops[idx_block]])
           if self.is_primary_worker('global'):
+            tf.logging.info('iter #%d: writing TF-summary to file' % idx_iter)
             self.sm_writer.add_summary(summary, nb_iters_block * idx_block + idx_iter)
 
       # select the most discriminative channels for each layer
@@ -482,23 +481,29 @@ def __choose_discr_chns(self):  # pylint: disable=too-many-locals
         if self.idxs_layer_to_block[idx_layer] != idx_block:
           continue
 
-        # initialize the mask as all channels are pruned
+        # initialize the gradient mask
         mask_shape = self.sess_train.run(tf.shape(self.masks[idx_layer]))
-        tf.logging.info('layer #{}: mask\'s shape is {}'.format(idx_layer, mask_shape))
+        if self.is_primary_worker('global'):
+          tf.logging.info('layer #{}: mask\'s shape is {}'.format(idx_layer, mask_shape))
         nb_chns = mask_shape[2]
+        idxs_chn_keep = []
         grad_norm_mask = np.ones(nb_chns)
-        mask_vec = np.sum(self.sess_train.run(self.masks[idx_layer]), axis=(0, 1, 3))
-        prune_ratio = 1.0 - float(np.count_nonzero(mask_vec)) / mask_vec.size
-        tf.logging.info('layer #%d: prune_ratio = %.4f' % (idx_layer, prune_ratio))
+
+        # sequentially add the most important channel to the non-pruned set
         is_first_entry = True
         while is_first_entry or prune_ratio > FLAGS.dcp_prune_ratio:
-          # choose the most important channel and then update the mask
+          # choose the most important channel
           grad_norm = self.sess_train.run(self.grad_norms[idx_layer])
-          idx_chn_input = np.argmax(grad_norm * grad_norm_mask)
-          grad_norm_mask[idx_chn_input] = 0.0
-          tf.logging.info('adding channel #%d to the non-pruned set' % idx_chn_input)
+          idx_chn = np.argmax((grad_norm + 1e-8) * grad_norm_mask)  # avoid all-zero gradients
+          assert idx_chn not in idxs_chn_keep, 'channel #%d already in the non-pruned set' % idx_chn
+          idxs_chn_keep += [idx_chn]
+          grad_norm_mask[idx_chn] = 0.0
+          if self.is_primary_worker('global'):
+            tf.logging.info('adding channel #%d to the non-pruned set' % idx_chn)
+
+          # update the mask
           mask_delta = np.zeros(mask_shape)
-          mask_delta[:, :, idx_chn_input, :] = 1.0
+          mask_delta[:, :, idx_chn, :] = 1.0
           if is_first_entry:
             is_first_entry = False
             self.sess_train.run(self.mask_init_ops[idx_layer])
@@ -513,7 +518,8 @@ def __choose_discr_chns(self):  # pylint: disable=too-many-locals
           # re-compute the pruning ratio
           mask_vec = np.sum(self.sess_train.run(self.masks[idx_layer]), axis=(0, 1, 3))
           prune_ratio = 1.0 - float(np.count_nonzero(mask_vec)) / mask_vec.size
-          tf.logging.info('layer #%d: prune_ratio = %.4f' % (idx_layer, prune_ratio))
+          if self.is_primary_worker('global'):
+            tf.logging.info('layer #%d: prune_ratio = %.4f' % (idx_layer, prune_ratio))
 
       # compute overall pruning ratios
       if self.is_primary_worker('global'):
diff --git a/learners/distillation_helper.py b/learners/distillation_helper.py
index 02ebd90..2cd8468 100644
--- a/learners/distillation_helper.py
+++ b/learners/distillation_helper.py
@@ -48,9 +48,8 @@ def __init__(self, sm_writer, model_helper, mpi_comm):
 
     # initialize a full-precision model
     self.model_scope = 'distilled_model'  # to distinguish from models created by other learners
-    enbl_dst = False  # disable the distillation loss for teacher model
     from learners.full_precision.learner import FullPrecLearner
-    self.learner = FullPrecLearner(sm_writer, model_helper, self.model_scope, enbl_dst)
+    self.learner = FullPrecLearner(sm_writer, model_helper, self.model_scope, enbl_dst=False)
 
     # initialize a model for training with the distillation loss
     if is_primary_worker('local'):
@@ -112,17 +111,18 @@ def __initialize(self):
 
     # download the pre-trained model from HDFS
     self.learner.download_model()
-
-    # rename the variable scope of pre-trained model
     if os.path.isdir(os.path.dirname(FLAGS.save_path_dst)):
       shutil.rmtree(os.path.dirname(FLAGS.save_path_dst))
     shutil.copytree(os.path.dirname(FLAGS.save_path), os.path.dirname(FLAGS.save_path_dst))
-    self.__rename_var_scope()
-    self.__evaluate_model()
 
-  def __rename_var_scope(self):
-    """Rename the name scope of all variables."""
+    # restore a pre-trained model and then evaluate
+    self.__restore()
+    self.__evaluate()
+
+  def __restore(self):
+    """Restore a pre-trained model with the variable scope renamed."""
 
+    # rename the variable scope
     ckpt_dir = os.path.dirname(FLAGS.save_path_dst)
     ckpt = tf.train.get_checkpoint_state(ckpt_dir)
     with tf.Graph().as_default():
@@ -139,14 +139,14 @@ def __rename_var_scope(self):
         sess.run(tf.global_variables_initializer())
         saver.save(sess, ckpt.model_checkpoint_path)  # pylint: disable=no-member
 
-  def __evaluate_model(self):
-    """Evaluate the model's loss & accuracy."""
-
     # restore the model from checkpoint files
     ckpt_file = tf.train.latest_checkpoint(os.path.dirname(FLAGS.save_path_dst))
     self.learner.saver_eval.restore(self.learner.sess_eval, ckpt_file)
     tf.logging.info('model restored from ' + ckpt_file)
 
+  def __evaluate(self):
+    """Evaluate the model's loss & accuracy."""
+
     # evaluate the model
     losses, accuracies = [], []
     nb_iters = int(np.ceil(float(FLAGS.nb_smpls_eval) / FLAGS.batch_size_eval))
diff --git a/learners/full_precision/learner.py b/learners/full_precision/learner.py
index 55b7762..30c3245 100644
--- a/learners/full_precision/learner.py
+++ b/learners/full_precision/learner.py
@@ -23,7 +23,6 @@
 
 from learners.abstract_learner import AbstractLearner
 from learners.distillation_helper import DistillationHelper
-from utils.lrn_rate_utils import setup_lrn_rate
 from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
 
 FLAGS = tf.app.flags.FLAGS
@@ -60,6 +59,7 @@ def train(self):
 
     # initialization
     self.sess_train.run(self.init_op)
+    self.warm_start(self.sess_train)
     if FLAGS.enbl_multi_gpu:
       self.sess_train.run(self.bcast_op)
 
@@ -92,10 +92,13 @@ def evaluate(self):
     """Restore a model from the latest checkpoint files and then evaluate it."""
 
     self.__restore_model(is_train=False)
-    nb_iters = int(np.ceil(float(FLAGS.nb_smpls_eval) / FLAGS.batch_size))
+    nb_iters = int(np.ceil(float(FLAGS.nb_smpls_eval) / FLAGS.batch_size_eval))
     eval_rslts = np.zeros((nb_iters, len(self.eval_op)))
+    self.dump_n_eval(outputs=None, action='init')
     for idx_iter in range(nb_iters):
-      eval_rslts[idx_iter] = self.sess_eval.run(self.eval_op)
+      eval_rslts[idx_iter], outputs = self.sess_eval.run([self.eval_op, self.outputs_eval])
+      self.dump_n_eval(outputs=outputs, action='dump')
+    self.dump_n_eval(outputs=None, action='eval')
     for idx, name in enumerate(self.eval_op_names):
       tf.logging.info('%s = %.4e' % (name, np.mean(eval_rslts[:, idx])))
 
@@ -116,7 +119,10 @@ def __build(self, is_train):  # pylint: disable=too-many-locals
       with tf.variable_scope(self.data_scope):
         iterator = self.build_dataset_train() if is_train else self.build_dataset_eval()
         images, labels = iterator.get_next()
-        tf.add_to_collection('images_final', images)
+        if not isinstance(images, dict):
+          tf.add_to_collection('images_final', images)
+        else:
+          tf.add_to_collection('images_final', images['image'])
 
       # model definition - distilled model
       if self.enbl_dst:
@@ -125,8 +131,15 @@ def __build(self, is_train):  # pylint: disable=too-many-locals
       # model definition - primary model
       with tf.variable_scope(self.model_scope):
         # forward pass
-        logits = self.forward_train(images) if is_train else self.forward_eval(images)
-        tf.add_to_collection('logits_final', logits)
+        if is_train and self.forward_w_labels:
+          logits = self.forward_train(images, labels)
+        else:
+          logits = self.forward_train(images) if is_train else self.forward_eval(images)
+        if not isinstance(logits, dict):
+          tf.add_to_collection('logits_final', logits)
+        else:
+          for value in logits.values():
+            tf.add_to_collection('logits_final', value)
 
         # loss & extra evalution metrics
         loss, metrics = self.calc_loss(labels, logits, self.trainable_vars)
@@ -139,8 +152,7 @@ def __build(self, is_train):  # pylint: disable=too-many-locals
         # optimizer & gradients
         if is_train:
           self.global_step = tf.train.get_or_create_global_step()
-          lrn_rate, self.nb_iters_train = setup_lrn_rate(
-            self.global_step, self.model_name, self.dataset_name)
+          lrn_rate, self.nb_iters_train = self.setup_lrn_rate(self.global_step)
           optimizer = tf.train.MomentumOptimizer(lrn_rate, FLAGS.momentum)
           if FLAGS.enbl_multi_gpu:
             optimizer = mgw.DistributedOptimizer(optimizer)
@@ -162,6 +174,7 @@ def __build(self, is_train):  # pylint: disable=too-many-locals
         self.sess_eval = sess
         self.eval_op = [loss] + list(metrics.values())
         self.eval_op_names = ['loss'] + list(metrics.keys())
+        self.outputs_eval = logits
         self.saver_eval = tf.train.Saver(self.vars)
 
   def __save_model(self, is_train):
diff --git a/learners/learner_utils.py b/learners/learner_utils.py
index 6d57627..217390c 100644
--- a/learners/learner_utils.py
+++ b/learners/learner_utils.py
@@ -21,6 +21,8 @@
 from learners.full_precision.learner import FullPrecLearner
 from learners.weight_sparsification.learner import WeightSparseLearner
 from learners.channel_pruning.learner import ChannelPrunedLearner
+from learners.channel_pruning_gpu.learner import ChannelPrunedGpuLearner
+from learners.channel_pruning_rmt.learner import ChannelPrunedRmtLearner
 from learners.discr_channel_pruning.learner import DisChnPrunedLearner
 from learners.uniform_quantization.learner import UniformQuantLearner
 from learners.uniform_quantization_tf.learner import UniformQuantTFLearner
@@ -46,6 +48,10 @@ def create_learner(sm_writer, model_helper):
     learner = WeightSparseLearner(sm_writer, model_helper)
   elif FLAGS.learner == 'channel':
     learner = ChannelPrunedLearner(sm_writer, model_helper)
+  elif FLAGS.learner == 'chn-pruned-gpu':
+    learner = ChannelPrunedGpuLearner(sm_writer, model_helper)
+  elif FLAGS.learner == 'chn-pruned-rmt':
+    learner = ChannelPrunedRmtLearner(sm_writer, model_helper)
   elif FLAGS.learner == 'dis-chn-pruned':
     learner = DisChnPrunedLearner(sm_writer, model_helper)
   elif FLAGS.learner == 'uniform':
diff --git a/learners/nonuniform_quantization/bit_optimizer.py b/learners/nonuniform_quantization/bit_optimizer.py
index 5cdc406..dd5337b 100644
--- a/learners/nonuniform_quantization/bit_optimizer.py
+++ b/learners/nonuniform_quantization/bit_optimizer.py
@@ -278,7 +278,7 @@ def __layerwise_finetune(self, feed_dict_train, layer_bits):
   def __global_finetune(self, feed_dict_train):
     time_prev = timer()
     for t_step in range(self.tune_global_steps):
-      _ = self.sess_train.run(self.ops['train'], feed_dict=feed_dict_train)
+      self.sess_train.run(self.ops['rl_fintune'], feed_dict=feed_dict_train)
       if (t_step+1) % self.tune_global_disp_steps == 0:
         log_rslt = self.sess_train.run(self.ops['log'], feed_dict=feed_dict_train)
         time_prev = self.__monitor_progress(t_step, log_rslt, time_prev)
diff --git a/learners/nonuniform_quantization/learner.py b/learners/nonuniform_quantization/learner.py
index e5cbcf7..d6e7dbb 100644
--- a/learners/nonuniform_quantization/learner.py
+++ b/learners/nonuniform_quantization/learner.py
@@ -41,7 +41,7 @@
     'WARNING: Useless for activation quantization in non-uniform mode')
 tf.app.flags.DEFINE_boolean('nuql_use_buckets', False, 'Use bucketing or not')
 tf.app.flags.DEFINE_integer('nuql_bucket_size', 256, 'Number of bucket size')
-tf.app.flags.DEFINE_integer('nuql_quant_epochs', 60, 'Number of steps for quantization')
+tf.app.flags.DEFINE_integer('nuql_quant_epochs', 60, 'Number of finetune steps for quantization')
 tf.app.flags.DEFINE_string('nuql_save_quant_model_path', \
     './nuql_quant_models/model.ckpt', 'dir to save quantization model')
 tf.app.flags.DEFINE_boolean('nuql_quantize_all_layers', False, \
@@ -255,10 +255,17 @@ def __build_train(self):
           if v not in clusters]
 
       # determine the var_list optimize
-      if FLAGS.nuql_opt_mode == 'both':
-        optimizable_vars = self.trainable_vars
-      elif FLAGS.nuql_opt_mode == 'clusters':
-        optimizable_vars = clusters
+      if FLAGS.nuql_opt_mode in ['cluster', 'both']:
+        if FLAGS.nuql_opt_mode == 'both':
+          optimizable_vars = self.trainable_vars
+        else:
+          optimizable_vars = clusters
+        if FLAGS.nuql_enbl_rl_agent:
+          optimizer_fintune = tf.train.GradientDescentOptimizer(lrn_rate)
+          if FLAGS.enbl_multi_gpu:
+            optimizer_fintune = mgw.DistributedOptimizer(optimizer_fintune)
+          grads_fintune = optimizer_fintune.compute_gradients(loss, var_list=optimizable_vars)
+
       elif FLAGS.nuql_opt_mode == 'weights':
         optimizable_vars = rest_trainable_vars
       else:
@@ -272,7 +279,10 @@ def __build_train(self):
       # define the ops
       with tf.control_dependencies(self.update_ops):
         self.ops['train'] = optimizer.apply_gradients(grads, global_step=self.ft_step)
-
+        if FLAGS.nuql_opt_mode in ['both', 'cluster'] and FLAGS.nuql_enbl_rl_agent:
+          self.ops['rl_fintune'] = optimizer_fintune.apply_gradients(grads_fintune, global_step=self.ft_step)
+        else:
+          self.ops['rl_fintune'] = self.ops['train']
       self.ops['summary'] = tf.summary.merge_all()
       if FLAGS.enbl_dst:
         self.ops['log'] = [lrn_rate, dst_loss, model_loss, loss, acc_top1, acc_top5]
diff --git a/learners/nonuniform_quantization/utils.py b/learners/nonuniform_quantization/utils.py
index ea7963e..284e866 100644
--- a/learners/nonuniform_quantization/utils.py
+++ b/learners/nonuniform_quantization/utils.py
@@ -300,8 +300,10 @@ def __build_norm_quant_point(self, init_c, x_normalized, k):
     g = self.sess.graph
     w_new = tf.tile(tf.expand_dims(x_normalized, w_dims), shape_)
     min_index = tf.argmin(tf.abs(w_new - c), axis=-1)
-    with g.gradient_override_map({'Mul': 'Add', 'Abs': 'Identity', 'Sign': 'Identity'}):
-      qx = tf.gather(c, min_index) * tf.abs(tf.sign(x_normalized))
+
+    # override gradient for the STE estimator
+    with g.gradient_override_map({'Mul': 'Add', 'Sign': 'Identity'}):
+      qx = tf.gather(c, min_index) * tf.sign(x_normalized + 1e-6)
     return qx
 
   def __build_bucket_norm_quant_point(self, init_c, x_normalized, k, bucket_num):
@@ -325,8 +327,9 @@ def __build_bucket_norm_quant_point(self, init_c, x_normalized, k, bucket_num):
     g = self.sess.graph
     x_rep = tf.tile(tf.expand_dims(x_normalized, -1), shape_)
     x_rep = tf.transpose(x_rep, [0, 2, 1])  # [bucket_size, nb_cluster, bucket_num]
-    min_index = tf.argmin(tf.abs(x_rep - c), axis=1)  # [bucket_size, bucket_num]
+
     # Non uniform: assign each w to the closest cluster
+    min_index = tf.argmin(tf.abs(x_rep - c), axis=1)  # [bucket_size, bucket_num]
 
     # NOTE: slow but save memory
     tmp_qx = tf.map_fn(lambda idx: tf.gather(c[:, idx], min_index[:, idx]), \
@@ -337,8 +340,10 @@ def __build_bucket_norm_quant_point(self, init_c, x_normalized, k, bucket_num):
     #tmp_qx = tf.gather_nd(tmp_qx, list(zip(range(bucket_num), range(bucket_num))))
 
     qx = tf.transpose(tmp_qx) # [bucket_size, bucket_num]
-    with g.gradient_override_map({'Mul': 'Add', 'Abs': 'Identity', 'Sign': 'Identity'}):
-      qx = qx * tf.abs(tf.sign(x_normalized))
+
+    # override gradient for the STE estimator
+    with g.gradient_override_map({'Mul': 'Add', 'Sign': 'Identity'}):
+      qx = qx * tf.sign(x_normalized + 1e-6)
     return qx
 
   def __quantile_init(self, x_normalized, nb_clusters):
@@ -405,8 +410,9 @@ def __scale(self, w, mode):
 
     w_max = tf.stop_gradient(tf.reduce_max(w, axis=axis))
     w_min = tf.stop_gradient(tf.reduce_min(w, axis=axis))
+    eps = tf.constant(value=1e-10, dtype=tf.float32)
 
-    alpha = w_max - w_min
+    alpha = w_max - w_min + eps
     beta = w_min
     w = (w - beta) / alpha
     return w, alpha, beta
diff --git a/learners/uniform_quantization/utils.py b/learners/uniform_quantization/utils.py
index c51fbc2..fc41bda 100644
--- a/learners/uniform_quantization/utils.py
+++ b/learners/uniform_quantization/utils.py
@@ -223,8 +223,9 @@ def __scale(self, w, mode):
 
     w_max = tf.stop_gradient(tf.reduce_max(w, axis=axis))
     w_min = tf.stop_gradient(tf.reduce_min(w, axis=axis))
-
-    alpha = w_max - w_min
+    eps = tf.constant(value=1e-10, dtype=tf.float32)
+    
+    alpha = w_max - w_min + eps
     beta = w_min
     w = (w - beta) / alpha
     return w, alpha, beta
diff --git a/learners/uniform_quantization_tf/learner.py b/learners/uniform_quantization_tf/learner.py
index e848ba5..f0484ed 100644
--- a/learners/uniform_quantization_tf/learner.py
+++ b/learners/uniform_quantization_tf/learner.py
@@ -23,7 +23,8 @@
 
 from learners.abstract_learner import AbstractLearner
 from learners.distillation_helper import DistillationHelper
-from utils.lrn_rate_utils import setup_lrn_rate
+from learners.uniform_quantization_tf.utils import find_unquant_act_nodes
+from learners.uniform_quantization_tf.utils import insert_quant_op
 from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
 
 FLAGS = tf.app.flags.FLAGS
@@ -41,6 +42,8 @@
                             'UT-TF: # of steps after which moving mean and variance are frozen \
                             and used instead of batch statistics during training.')
 tf.app.flags.DEFINE_float('uqtf_lrn_rate_dcy', 1e-2, 'UQ-TF: learning rate\'s decaying factor')
+tf.app.flags.DEFINE_boolean('uqtf_enbl_manual_quant', False,
+                            'UQ-TF: enable manually inserting quantization operations')
 
 def get_vars_by_scope(scope):
   """Get list of variables within certain name scope.
@@ -82,6 +85,13 @@ def __init__(self, sm_writer, model_helper):
     self.auto_barrier()
     tf.logging.info('model files: ' + ', '.join(os.listdir('./models')))
 
+    # detect unquantized activations nodes
+    self.unquant_node_names = []
+    if FLAGS.uqtf_enbl_manual_quant:
+      self.unquant_node_names = find_unquant_act_nodes(
+        model_helper, self.data_scope, self.model_scope_quan, self.mpi_comm)
+    tf.logging.info('unquantized activation nodes: {}'.format(self.unquant_node_names))
+
     # class-dependent initialization
     if FLAGS.enbl_dst:
       self.helper_dst = DistillationHelper(sm_writer, model_helper, self.mpi_comm)
@@ -117,6 +127,7 @@ def train(self):
       if self.is_primary_worker('global') and (idx_iter + 1) % FLAGS.save_step == 0:
         self.__save_model(is_train=True)
         self.evaluate()
+      self.auto_barrier()
 
     # save the final model
     if self.is_primary_worker('global'):
@@ -131,37 +142,49 @@ def evaluate(self):
     self.__restore_model(is_train=False)
     nb_iters = int(np.ceil(float(FLAGS.nb_smpls_eval) / FLAGS.batch_size_eval))
     eval_rslts = np.zeros((nb_iters, len(self.eval_op)))
+    self.dump_n_eval(outputs=None, action='init')
     for idx_iter in range(nb_iters):
-      eval_rslts[idx_iter] = self.sess_eval.run(self.eval_op)
+      if (idx_iter + 1) % 100 == 0:
+        tf.logging.info('process the %d-th mini-batch for evaluation' % (idx_iter + 1))
+      eval_rslts[idx_iter], outputs = self.sess_eval.run([self.eval_op, self.outputs_eval])
+      self.dump_n_eval(outputs=outputs, action='dump')
+    self.dump_n_eval(outputs=None, action='eval')
     for idx, name in enumerate(self.eval_op_names):
       tf.logging.info('%s = %.4e' % (name, np.mean(eval_rslts[:, idx])))
 
   def __build_train(self):  # pylint: disable=too-many-locals,too-many-statements
     """Build the training graph."""
 
-    with tf.Graph().as_default():
+    with tf.Graph().as_default() as graph:
       # create a TF session for the current graph
       config = tf.ConfigProto()
       config.gpu_options.visible_device_list = str(mgw.local_rank() if FLAGS.enbl_multi_gpu else 0)  # pylint: disable=no-member
+      config.gpu_options.allow_growth = True  # pylint: disable=no-member
       sess = tf.Session(config=config)
 
       # data input pipeline
       with tf.variable_scope(self.data_scope):
         iterator = self.build_dataset_train()
         images, labels = iterator.get_next()
-        images.set_shape((FLAGS.batch_size, images.shape[1], images.shape[2], images.shape[3]))
 
       # model definition - uniform quantized model - part 1
       with tf.variable_scope(self.model_scope_quan):
         logits_quan = self.forward_train(images)
+        if not isinstance(logits_quan, dict):
+          outputs = tf.nn.softmax(logits_quan)
+        else:
+          outputs = tf.nn.softmax(logits_quan['cls_pred'])
         tf.contrib.quantize.experimental_create_training_graph(
           weight_bits=FLAGS.uqtf_weight_bits,
           activation_bits=FLAGS.uqtf_activation_bits,
           quant_delay=FLAGS.uqtf_quant_delay,
           freeze_bn_delay=FLAGS.uqtf_freeze_bn_delay,
           scope=self.model_scope_quan)
+        for node_name in self.unquant_node_names:
+          insert_quant_op(graph, node_name, is_train=True)
         self.vars_quan = get_vars_by_scope(self.model_scope_quan)
-        self.saver_quan_train = tf.train.Saver(self.vars_quan['all'])
+        self.global_step = tf.train.get_or_create_global_step()
+        self.saver_quan_train = tf.train.Saver(self.vars_quan['all'] + [self.global_step])
 
       # model definition - distilled model
       if FLAGS.enbl_dst:
@@ -188,9 +211,7 @@ def __build_train(self):  # pylint: disable=too-many-locals,too-many-statements
           tf.summary.scalar(key, value)
 
         # learning rate schedule
-        self.global_step = tf.train.get_or_create_global_step()
-        lrn_rate, self.nb_iters_train = setup_lrn_rate(
-          self.global_step, self.model_name, self.dataset_name)
+        lrn_rate, self.nb_iters_train = self.setup_lrn_rate(self.global_step)
         lrn_rate *= FLAGS.uqtf_lrn_rate_dcy
 
         # decrease the learning rate by a constant factor
@@ -214,10 +235,12 @@ def __build_train(self):  # pylint: disable=too-many-locals,too-many-statements
         with tf.control_dependencies([tf.variables_initializer(self.vars_all)]):
           for var_full, var_quan in zip(self.vars_full['all'], self.vars_quan['all']):
             init_ops += [var_quan.assign(var_full)]
+        init_ops += [self.global_step.initializer]
         self.init_op = tf.group(init_ops)
 
         # TF operations for fine-tuning
-        optimizer_base = tf.train.MomentumOptimizer(lrn_rate, FLAGS.momentum)
+        #optimizer_base = tf.train.MomentumOptimizer(lrn_rate, FLAGS.momentum)
+        optimizer_base = tf.train.AdamOptimizer(lrn_rate)
         if not FLAGS.enbl_multi_gpu:
           optimizer = optimizer_base
         else:
@@ -238,10 +261,11 @@ def __build_train(self):  # pylint: disable=too-many-locals,too-many-statements
   def __build_eval(self):
     """Build the evaluation graph."""
 
-    with tf.Graph().as_default():
+    with tf.Graph().as_default() as graph:
       # create a TF session for the current graph
       config = tf.ConfigProto()
       config.gpu_options.visible_device_list = str(mgw.local_rank() if FLAGS.enbl_multi_gpu else 0)  # pylint: disable=no-member
+      config.gpu_options.allow_growth = True  # pylint: disable=no-member
       self.sess_eval = tf.Session(config=config)
 
       # data input pipeline
@@ -252,11 +276,19 @@ def __build_eval(self):
       # model definition - uniform quantized model - part 1
       with tf.variable_scope(self.model_scope_quan):
         logits = self.forward_eval(images)
+        if not isinstance(logits, dict):
+          outputs = tf.nn.softmax(logits)
+        else:
+          outputs = tf.nn.softmax(logits['cls_pred'])
         tf.contrib.quantize.experimental_create_eval_graph(
           weight_bits=FLAGS.uqtf_weight_bits,
           activation_bits=FLAGS.uqtf_activation_bits,
           scope=self.model_scope_quan)
+        for node_name in self.unquant_node_names:
+          insert_quant_op(graph, node_name, is_train=False)
         vars_quan = get_vars_by_scope(self.model_scope_quan)
+        global_step = tf.train.get_or_create_global_step()
+        self.saver_quan_eval = tf.train.Saver(vars_quan['all'] + [global_step])
 
       # model definition - distilled model
       if FLAGS.enbl_dst:
@@ -272,11 +304,17 @@ def __build_eval(self):
         # TF operations for evaluation
         self.eval_op = [loss] + list(metrics.values())
         self.eval_op_names = ['loss'] + list(metrics.keys())
-        self.saver_quan_eval = tf.train.Saver(vars_quan['all'])
+        self.outputs_eval = logits
 
       # add input & output tensors to certain collections
-      tf.add_to_collection('images_final', images)
-      tf.add_to_collection('logits_final', logits)
+      if not isinstance(images, dict):
+        tf.add_to_collection('images_final', images)
+      else:
+        tf.add_to_collection('images_final', images['image'])
+      if not isinstance(logits, dict):
+        tf.add_to_collection('logits_final', logits)
+      else:
+        tf.add_to_collection('logits_final', logits['cls_pred'])
 
   def __save_model(self, is_train):
     """Save the current model for training or evaluation.
diff --git a/learners/uniform_quantization_tf/utils.py b/learners/uniform_quantization_tf/utils.py
new file mode 100644
index 0000000..e520ced
--- /dev/null
+++ b/learners/uniform_quantization_tf/utils.py
@@ -0,0 +1,295 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Utility functions."""
+
+import os
+import subprocess
+import tensorflow as tf
+from tensorflow.contrib.quantize.python import common
+from tensorflow.contrib.quantize.python import input_to_ops
+from tensorflow.contrib.quantize.python import quant_ops
+from tensorflow.contrib.lite.python import lite_constants
+
+from utils.misc_utils import auto_barrier
+from utils.misc_utils import is_primary_worker
+from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_string('uqtf_save_path_probe', './models_uqtf_probe/model.ckpt',
+                           'UQ-TF: probe model\'s save path')
+tf.app.flags.DEFINE_string('uqtf_save_path_probe_eval', './models_uqtf_probe_eval/model.ckpt',
+                           'UQ-TF: probe model\'s save path for evaluation')
+
+def create_session():
+  """Create a TensorFlow session.
+
+  Return:
+  * sess: TensorFlow session
+  """
+
+  # create a TensorFlow session
+  config = tf.ConfigProto()
+  config.gpu_options.visible_device_list = str(mgw.local_rank() if FLAGS.enbl_multi_gpu else 0)  # pylint: disable=no-member
+  config.gpu_options.allow_growth = True  # pylint: disable=no-member
+  sess = tf.Session(config=config)
+
+  return sess
+
+def insert_quant_op(graph, node_name, is_train):
+  """Insert quantization operations to the specified activation node.
+
+  Args:
+  * graph: TensorFlow graph
+  * node_name: activation node's name
+  * is_train: insert training-related operations or not
+  """
+
+  # locate the node & activation operation
+  for op in graph.get_operations():
+    if node_name in [node.name for node in op.outputs]:
+      tf.logging.info('op: {} / inputs: {} / outputs: {}'.format(
+        op.name, [node.name for node in op.inputs], [node.name for node in op.outputs]))
+      node = op.outputs[0]
+      activation_op = op
+      break
+
+  # re-route the graph to insert quantization operations
+  input_to_ops_map = input_to_ops.InputToOps(graph)
+  consumer_ops = input_to_ops_map.ConsumerOperations(activation_op)
+  node_quant = quant_ops.MovingAvgQuantize(
+    node, is_training=is_train, num_bits=FLAGS.uqtf_activation_bits)
+  nb_update_inputs = common.RerouteTensor(node_quant, node, consumer_ops)
+  tf.logging.info('nb_update_inputs = %d' % nb_update_inputs)
+
+def export_tflite_model(input_coll, output_coll, images_shape, images_name):
+  """Export a *.tflite model from checkpoint files.
+
+  Args:
+  * input_coll: input collection's name
+  * output_coll: output collection's name
+
+  Returns:
+  * unquant_node_name: unquantized activation node name (None if not found)
+  """
+
+  # remove previously generated *.pb & *.tflite models
+  model_dir = os.path.dirname(FLAGS.uqtf_save_path_probe_eval)
+  idx_worker = mgw.local_rank() if FLAGS.enbl_multi_gpu else 0
+  pb_path = os.path.join(model_dir, 'model_%d.pb' % idx_worker)
+  tflite_path = os.path.join(model_dir, 'model_%d.tflite' % idx_worker)
+  if os.path.exists(pb_path):
+    os.remove(pb_path)
+  if os.path.exists(tflite_path):
+    os.remove(tflite_path)
+
+  # convert checkpoint files to a *.pb model
+  images_name_ph = 'images'
+  with tf.Graph().as_default() as graph:
+    # create a TensorFlow session
+    sess = create_session()
+
+    # restore the graph with inputs replaced
+    ckpt_path = tf.train.latest_checkpoint(model_dir)
+    meta_path = ckpt_path + '.meta'
+    images = tf.placeholder(tf.float32, shape=images_shape, name=images_name_ph)
+    saver = tf.train.import_meta_graph(meta_path, input_map={images_name: images})
+    saver.restore(sess, ckpt_path)
+
+    # obtain input & output nodes
+    net_inputs = tf.get_collection(input_coll)
+    net_logits = tf.get_collection(output_coll)[0]
+    net_outputs = [tf.nn.softmax(net_logits)]
+    for node in net_inputs:
+      tf.logging.info('inputs: {} / {}'.format(node.name, node.shape))
+    for node in net_outputs:
+      tf.logging.info('outputs: {} / {}'.format(node.name, node.shape))
+
+    # write the original grpah to *.pb file
+    graph_def = tf.graph_util.convert_variables_to_constants(
+      sess, graph.as_graph_def(), [node.name.replace(':0', '') for node in net_outputs])
+    tf.train.write_graph(graph_def, model_dir, os.path.basename(pb_path), as_text=False)
+    assert os.path.exists(pb_path), 'failed to generate a *.pb model'
+
+  # convert the *.pb model to a *.tflite model and detect the unquantized activation node (if any)
+  tf.logging.info(pb_path + ' -> ' + tflite_path)
+  converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(
+    pb_path, [images_name_ph], [node.name.replace(':0', '') for node in net_outputs])
+  converter.inference_type = lite_constants.QUANTIZED_UINT8
+  converter.quantized_input_stats = {images_name_ph: (0., 1.)}
+  unquant_node_name = None
+  try:
+    tflite_model = converter.convert()
+    with open(tflite_path, 'wb') as o_file:
+      o_file.write(tflite_model)
+  except Exception as err:
+    err_msg = str(err)
+    flag_str = 'tensorflow/contrib/lite/toco/tooling_util.cc:1634]'
+    for sub_line in err_msg.split('\\n'):
+      if flag_str in sub_line:
+        sub_strs = sub_line.replace(',', ' ').split()
+        unquant_node_name = sub_strs[sub_strs.index(flag_str) + 2] + ':0'
+        break
+    assert unquant_node_name is not None, 'unable to locate the unquantized node'
+
+  return unquant_node_name
+
+def build_graph(model_helper, unquant_node_names, config, is_train):
+  """Build a graph for training or evaluation.
+
+  Args:
+  * model_helper: model helper with definitions of model & dataset
+  * unquant_node_names: list of unquantized activation node names
+  * config: graph configuration
+  * is_train: insert training-related operations or not
+
+  Returns:
+  * model: dictionary of model-related objects & operations
+  """
+
+  # setup function handles
+  if is_train:
+    build_dataset_fn = model_helper.build_dataset_train
+    forward_fn = model_helper.forward_train
+    create_quant_graph_fn = tf.contrib.quantize.experimental_create_training_graph
+  else:
+    build_dataset_fn = model_helper.build_dataset_eval
+    forward_fn = model_helper.forward_eval
+    create_quant_graph_fn = tf.contrib.quantize.experimental_create_eval_graph
+
+  # build a graph for trianing or evaluation
+  model = {}
+  with tf.Graph().as_default() as graph:
+    # data input pipeline
+    with tf.variable_scope(config['data_scope']):
+      iterator = build_dataset_fn()
+      inputs, __ = iterator.get_next()
+
+    # model definition - uniform quantized model
+    with tf.variable_scope(config['model_scope']):
+      # obtain outputs from model's forward-pass
+      outputs = forward_fn(inputs)
+      if not isinstance(outputs, dict):
+        outputs_sfmax = tf.nn.softmax(outputs)  # <outputs> is logits
+      else:
+        outputs_sfmax = tf.nn.softmax(outputs['cls_pred'])  # <outputs['cls_pred']> is logits
+
+      # quantize the graph using TensorFlow APIs
+      create_quant_graph_fn(
+        weight_bits=FLAGS.uqtf_weight_bits,
+        activation_bits=FLAGS.uqtf_activation_bits,
+        scope=config['model_scope'])
+
+      # manually insert quantization operations
+      for node_name in unquant_node_names:
+        insert_quant_op(graph, node_name, is_train=is_train)
+
+      # randomly increase each trainable variable's value
+      incr_ops = []
+      for var in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES):
+        incr_ops += [var.assign_add(tf.random.uniform(var.shape))]
+      incr_op = tf.group(incr_ops)
+
+    # add input & output tensors to collections
+    if not isinstance(inputs, dict):
+      tf.add_to_collection(config['input_coll'], inputs)
+    else:
+      tf.add_to_collection(config['input_coll'], inputs['image'])
+    if not isinstance(outputs, dict):
+      tf.add_to_collection(config['output_coll'], outputs)
+    else:
+      tf.add_to_collection(config['output_coll'], outputs['cls_pred'])
+
+    # save the model
+    vars_list = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=config['model_scope'])
+    model['sess'] = create_session()
+    model['saver'] = tf.train.Saver(vars_list)
+    model['init_op'] = tf.variables_initializer(vars_list)
+    model['incr_op'] = incr_op
+
+  return model
+
+def find_unquant_act_nodes(model_helper, data_scope, model_scope, mpi_comm):
+  """Find unquantized activation nodes in the model.
+
+  TensorFlow's quantization-aware training APIs insert quantization operations into the graph,
+    so that model weights can be fine-tuned with quantization error taken into consideration.
+    However, these APIs only insert quantization operations into nodes matching certain topology
+    rules, and some nodes may be left unquantized. When converting such model to *.tflite model,
+    these unquantized nodes will introduce extra performance loss.
+  Here, we provide a utility function to detect these unquantized nodes before training, so that
+    quantization operations can be inserted. The resulting model can be smoothly exported to a
+    *.tflite model.
+
+  Args:
+  * model_helper: model helper with definitions of model & dataset
+  * data_scope: data scope name
+  * model_scope: model scope name
+  * mpi_comm: MPI communication object
+
+  Returns:
+  * unquant_node_names: list of unquantized activation node names
+  """
+
+  # setup configurations
+  config = {
+    'data_scope': data_scope,
+    'model_scope': model_scope,
+    'input_coll': 'inputs',
+    'output_coll': 'outputs',
+  }
+
+  # obtain the image tensor's name & shape
+  with tf.Graph().as_default():
+    with tf.variable_scope(data_scope):
+      iterator = model_helper.build_dataset_eval()
+      inputs, labels = iterator.get_next()
+      if not isinstance(inputs, dict):
+        images_shape, images_name = inputs.shape, inputs.name
+      else:
+        images_shape, images_name = inputs['image'].shape, inputs['image'].name
+
+  # iteratively check for unquantized nodes
+  unquant_node_names = []
+  while True:
+    # build training & evaluation graphs
+    model_train = build_graph(model_helper, unquant_node_names, config, is_train=True)
+    model_eval = build_graph(model_helper, unquant_node_names, config, is_train=False)
+
+    # initialize a model in the training graph, and then save
+    model_train['sess'].run(model_train['init_op'])
+    model_train['sess'].run(model_train['incr_op'])
+    save_path = model_train['saver'].save(model_train['sess'], FLAGS.uqtf_save_path_probe)
+    tf.logging.info('model saved to ' + save_path)
+
+    # restore a model in the evaluation graph from *.ckpt files, and then save again
+    save_path = tf.train.latest_checkpoint(os.path.dirname(FLAGS.uqtf_save_path_probe))
+    model_eval['saver'].restore(model_eval['sess'], save_path)
+    tf.logging.info('model restored from ' + save_path)
+    save_path = model_eval['saver'].save(model_eval['sess'], FLAGS.uqtf_save_path_probe_eval)
+    tf.logging.info('model saved to ' + save_path)
+
+    # try to export *.tflite models and check for unquantized nodes (if any)
+    unquant_node_name = export_tflite_model(
+      config['input_coll'], config['output_coll'], images_shape, images_name)
+    if unquant_node_name:
+      unquant_node_names += [unquant_node_name]
+      tf.logging.info('node <%s> is not quantized' % unquant_node_name)
+    else:
+      break
+
+  return unquant_node_names
diff --git a/learners/weight_sparsification/learner.py b/learners/weight_sparsification/learner.py
index 482e01e..6944180 100644
--- a/learners/weight_sparsification/learner.py
+++ b/learners/weight_sparsification/learner.py
@@ -25,7 +25,6 @@
 from learners.distillation_helper import DistillationHelper
 from learners.weight_sparsification.pr_optimizer import PROptimizer
 from learners.weight_sparsification.utils import get_maskable_vars
-from utils.lrn_rate_utils import setup_lrn_rate
 from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
 
 FLAGS = tf.app.flags.FLAGS
@@ -82,20 +81,22 @@ def __init__(self, sm_writer, model_helper):
     # define the scope for masks
     self.mask_scope = 'mask'
 
-    # compute the optimal pruning ratios
-    pr_optimizer = PROptimizer(model_helper, self.mpi_comm)
-    if FLAGS.ws_prune_ratio_prtl == 'optimal':
-      if self.is_primary_worker('local'):
-        self.download_model()  # pre-trained model is required
-      self.auto_barrier()
-      tf.logging.info('model files: ' + ', '.join(os.listdir('./models')))
-    self.var_names_n_prune_ratios = pr_optimizer.run()
+    # compute the optimal pruning ratios (only when the execution mode is 'train')
+    if FLAGS.exec_mode == 'train':
+      pr_optimizer = PROptimizer(model_helper, self.mpi_comm)
+      if FLAGS.ws_prune_ratio_prtl == 'optimal':
+        if self.is_primary_worker('local'):
+          self.download_model()  # pre-trained model is required
+        self.auto_barrier()
+        tf.logging.info('model files: ' + ', '.join(os.listdir('./models')))
+      self.var_names_n_prune_ratios = pr_optimizer.run()
 
     # class-dependent initialization
     if FLAGS.enbl_dst:
       self.helper_dst = DistillationHelper(sm_writer, model_helper, self.mpi_comm)
-    self.__build_train()
-    self.__build_eval()
+    if FLAGS.exec_mode == 'train':
+      self.__build_train()  # only when the execution mode is 'train'
+    self.__build_eval()  # needed whatever the execution mode is
 
   def train(self):
     """Train a model and periodically produce checkpoint files."""
@@ -143,7 +144,7 @@ def evaluate(self):
     """Restore a model from the latest checkpoint files and then evaluate it."""
 
     self.__restore_model(is_train=False)
-    nb_iters = int(np.ceil(float(FLAGS.nb_smpls_eval) / FLAGS.batch_size))
+    nb_iters = int(np.ceil(float(FLAGS.nb_smpls_eval) / FLAGS.batch_size_eval))
     eval_rslts = np.zeros((nb_iters, len(self.eval_op)))
     for idx_iter in range(nb_iters):
       eval_rslts[idx_iter] = self.sess_eval.run(self.eval_op)
@@ -185,8 +186,7 @@ def __build_train(self):  # pylint: disable=too-many-locals
 
         # learning rate schedule
         self.global_step = tf.train.get_or_create_global_step()
-        lrn_rate, self.nb_iters_train = setup_lrn_rate(
-          self.global_step, self.model_name, self.dataset_name)
+        lrn_rate, self.nb_iters_train = self.setup_lrn_rate(self.global_step)
 
         # overall pruning ratios of trainable & maskable variables
         pr_trainable = calc_prune_ratio(self.trainable_vars)
@@ -280,10 +280,10 @@ def __build_masks(self):
         var_bkup = tf.get_variable(name, initializer=var.initialized_value(), trainable=False)
 
         # create update operations
-        mask_thres = tf.contrib.distributions.percentile(tf.abs(var), prune_ratio * 100)
         var_bkup_update_op = var_bkup.assign(tf.where(mask > 0.5, var, var_bkup))
         with tf.control_dependencies([var_bkup_update_op]):
-          mask_update_op = mask.assign(tf.cast(tf.abs(var) > mask_thres, tf.float32))
+          mask_thres = tf.contrib.distributions.percentile(tf.abs(var_bkup), prune_ratio * 100)
+          mask_update_op = mask.assign(tf.cast(tf.abs(var_bkup) > mask_thres, tf.float32))
         with tf.control_dependencies([mask_update_op]):
           prune_op = var.assign(var_bkup * mask)
 
diff --git a/main.sh b/main.sh
index b29d9fc..cd43e42 100755
--- a/main.sh
+++ b/main.sh
@@ -10,6 +10,8 @@ mkdir -p ~/.pip/ \
 cat ~/.pip/pip.conf
 
 # install Python packages with Internet access
+pip install tensorflow-gpu==1.12.0
+pip install horovod
 pip install docopt
 pip install hdfs
 pip install scipy
@@ -19,12 +21,12 @@ pip install mpi4py
 
 # add the current directory to PYTHONPATH
 export PYTHONPATH=${PYTHONPATH}:`pwd`
+export LD_LIBRARY_PATH=/opt/ml/disk/local/cuda/lib64:$LD_LIBRARY_PATH
 
 # start TensorBoard
 LOG_DIR=/opt/ml/log
 mkdir -p ${LOG_DIR}
 nohup tensorboard \
-    --path_prefix=/seven-forward-port/${SEVEN_HTTP_FORWARD_PORT}/ \
     --port=${SEVEN_HTTP_FORWARD_PORT} \
     --host=127.0.0.1 \
     --logdir=${LOG_DIR} \
diff --git a/nets/abstract_model_helper.py b/nets/abstract_model_helper.py
index 8e3a0ed..c89cadf 100644
--- a/nets/abstract_model_helper.py
+++ b/nets/abstract_model_helper.py
@@ -30,12 +30,17 @@ class AbstractModelHelper(ABC):
   All functions marked with "@abstractmethod" must be explicitly implemented in the sub-class.
   """
 
-  def __init__(self):
+  def __init__(self, data_format, forward_w_labels=False):
     """Constructor function.
 
     Note: DO NOT create any TF operations here!!!
+
+    Args:
+    * data_format: data format ('channels_last' OR 'channels_first')
     """
-    pass
+
+    self.data_format = data_format
+    self.forward_w_labels = forward_w_labels
 
   @abstractmethod
   def build_dataset_train(self, enbl_trn_val_split):
@@ -62,12 +67,12 @@ def build_dataset_eval(self):
     pass
 
   @abstractmethod
-  def forward_train(self, inputs, data_format):
+  def forward_train(self, inputs, labels=None):
     """Forward computation at training.
 
     Args:
     * inputs: inputs to the network's forward pass
-    * data_format: data format ('channels_last' OR 'channels_first')
+    * labels: ground-truth labels
 
     Returns:
     * outputs: outputs from the network's forward pass
@@ -75,12 +80,11 @@ def forward_train(self, inputs, data_format):
     pass
 
   @abstractmethod
-  def forward_eval(self, inputs, data_format):
+  def forward_eval(self, inputs):
     """Forward computation at evaluation.
 
     Args:
     * inputs: inputs to the network's forward pass
-    * data_format: data format ('channels_last' OR 'channels_first')
 
     Returns:
     * outputs: outputs from the network's forward pass
@@ -102,6 +106,36 @@ def calc_loss(self, labels, outputs, trainable_vars):
     """
     pass
 
+  @abstractmethod
+  def setup_lrn_rate(self, global_step):
+    """Setup the learning rate (and number of training iterations).
+
+    Args:
+    * global_step: training iteration counter
+
+    Returns:
+    * lrn_rate: learning rate
+    * nb_iters: number of training iterations
+    """
+    pass
+
+  def warm_start(self, sess):
+    """Initialize the model for warm-start.
+
+    Args:
+    * sess: TensorFlow session
+    """
+    pass
+
+  def dump_n_eval(self, outputs, action):
+    """Dump the model's outputs to files and evaluate.
+
+    Args:
+    * outputs: outputs from the network's forward pass
+    * action: 'init' | 'dump' | 'eval'
+    """
+    pass
+
   @property
   @abstractmethod
   def model_name(self):
diff --git a/nets/faster_rcnn_at_pascalvoc.py b/nets/faster_rcnn_at_pascalvoc.py
new file mode 100644
index 0000000..f1ea6e7
--- /dev/null
+++ b/nets/faster_rcnn_at_pascalvoc.py
@@ -0,0 +1,676 @@
+import os
+import shutil
+import numpy as np
+import tensorflow as tf
+
+
+
+from nets.abstract_model_helper import AbstractModelHelper
+from datasets.pascalvoc_dataset import PascalVocDataset
+from utils.misc_utils import is_primary_worker
+
+import tensorflow.contrib.slim as slim
+
+from utils.external.faster_rcnn_tensorflow.preprocessing.faster_rcnn_preprocessing import preprocess_image
+
+from utils.external.faster_rcnn_tensorflow.net import resnet_faster_rcnn as resnet
+from utils.external.faster_rcnn_tensorflow.net import mobilenet_v2_faster_rcnn as mobilenet_v2
+
+from utils.external.faster_rcnn_tensorflow.utility import anchor_utils, encode_and_decode, boxes_utils
+from utils.external.faster_rcnn_tensorflow.configs import cfgs
+from utils.external.faster_rcnn_tensorflow.utility import loss_utils as losses
+from utils.external.faster_rcnn_tensorflow.utility import show_box_in_tensor
+
+from utils.external.faster_rcnn_tensorflow.utility.proposal_opr import postprocess_rpn_proposals
+from utils.external.faster_rcnn_tensorflow.utility.anchor_target_layer_without_boxweight import anchor_target_layer
+from utils.external.faster_rcnn_tensorflow.utility.proposal_target_layer import proposal_target_layer
+
+from utils.external.ssd_tensorflow.voc_eval import do_python_eval
+
+# model related configuration
+tf.app.flags.DEFINE_integer('nb_iters_train', 200000, 'The number of training iterations.')
+tf.app.flags.DEFINE_float('momentum', 0.9, 'momentum coefficient')
+# evaluation related configuration
+tf.app.flags.DEFINE_string('outputs_dump_dir', './f_rcnn_outputs/', 'outputs\'s dumping directory')
+# checkpoint related configuration
+tf.app.flags.DEFINE_string('backbone_ckpt_dir', './backbone_models/',
+                           'The backbone model\'s (e.g. VGG-16) checkpoint directory')
+FLAGS = tf.app.flags.FLAGS
+
+def build_base_network(inputs, is_train):
+  if cfgs.NET_NAME.startswith('resnet_v1'):
+    return resnet.resnet_base(inputs, scope_name=cfgs.NET_NAME, is_training=is_train)
+  elif cfgs.NET_NAME.startswith('MobilenetV2'):
+    return mobilenet_v2.mobilenetv2_base(inputs, is_training=is_train)
+  else:
+    raise ValueError('Sry, we only support resnet or mobilenet_v2')
+
+def build_fastrcnn(is_train, feature_to_cropped, rois, img_shape):
+  with tf.variable_scope('Fast-RCNN'):
+    # 5. ROI Pooling
+    with tf.variable_scope('rois_pooling'):
+      pooled_features = roi_pooling(feature_maps=feature_to_cropped, rois=rois, img_shape=img_shape)
+
+    # 6. inferecne rois in Fast-RCNN to obtain fc_flatten features
+    if cfgs.NET_NAME.startswith('resnet'):
+      fc_flatten = resnet.restnet_head(input=pooled_features,
+                                        is_training=is_train,
+                                        scope_name=cfgs.NET_NAME)
+    elif cfgs.NET_NAME.startswith('Mobile'):
+      fc_flatten = mobilenet_v2.mobilenetv2_head(inputs=pooled_features,
+                                                   is_training=is_train)
+    else:
+      raise NotImplementedError('only support resnet and mobilenet')
+
+      # 7. cls and reg in Fast-RCNN
+      # tf.variance_scaling_initializer()
+      # tf.VarianceScaling()
+    with slim.arg_scope([slim.fully_connected], weights_regularizer=slim.l2_regularizer(cfgs.WEIGHT_DECAY)):
+      cls_score = slim.fully_connected(fc_flatten,
+                                       num_outputs=FLAGS.nb_classes,
+                                       weights_initializer=slim.variance_scaling_initializer(factor=1.0,
+                                                                                             mode='FAN_AVG',
+                                                                                             uniform=True),
+                                       activation_fn=None, trainable=is_train,
+                                       scope='cls_fc')
+
+      bbox_pred = slim.fully_connected(fc_flatten,
+                                       num_outputs=(FLAGS.nb_classes) * 4,
+                                       weights_initializer=slim.variance_scaling_initializer(factor=1.0,
+                                                                                             mode='FAN_AVG',
+                                                                                             uniform=True),
+                                       activation_fn=None, trainable=is_train,
+                                       scope='reg_fc')
+      # for convient. It also produce (cls_num +1) bboxes
+
+      cls_score = tf.reshape(cls_score, [-1, FLAGS.nb_classes])
+      bbox_pred = tf.reshape(bbox_pred, [-1, 4 * (FLAGS.nb_classes)])
+
+  return bbox_pred, cls_score
+
+def postprocess_fastrcnn(is_train, rois, bbox_ppred, scores, img_shape):
+  """
+  :param rois:[-1, 4]
+  :param bbox_ppred: [-1, (cfgs.Class_num+1) * 4]
+  :param scores: [-1, FLAGS.nb_classes]
+  :return:
+  """
+
+  with tf.name_scope('postprocess_fastrcnn'):
+    rois = tf.stop_gradient(rois)
+    scores = tf.stop_gradient(scores)
+    bbox_ppred = tf.reshape(bbox_ppred, [-1, FLAGS.nb_classes, 4])
+    bbox_ppred = tf.stop_gradient(bbox_ppred)
+
+    bbox_pred_list = tf.unstack(bbox_ppred, axis=1)
+    score_list = tf.unstack(scores, axis=1)
+
+    allclasses_boxes = []
+    allclasses_scores = []
+    categories = []
+    for i in range(1, cfgs.CLASS_NUM+1):
+      # 1. decode boxes in each class
+      tmp_encoded_box = bbox_pred_list[i]
+      tmp_score = score_list[i]
+      tmp_decoded_boxes = encode_and_decode.decode_boxes(encoded_boxes=tmp_encoded_box,
+                                                         reference_boxes=rois,
+                                                         scale_factors=cfgs.ROI_SCALE_FACTORS)
+      # tmp_decoded_boxes = encode_and_decode.decode_boxes(boxes=rois,
+      #                                                    deltas=tmp_encoded_box,
+      #                                                    scale_factor=cfgs.ROI_SCALE_FACTORS)
+
+      # 2. clip to img boundaries
+      tmp_decoded_boxes = boxes_utils.clip_boxes_to_img_boundaries(decode_boxes=tmp_decoded_boxes,
+                                                                   img_shape=img_shape)
+
+      # 3. NMS
+      keep = tf.image.non_max_suppression(
+          boxes=tmp_decoded_boxes,
+          scores=tmp_score,
+          max_output_size=cfgs.FAST_RCNN_NMS_MAX_BOXES_PER_CLASS,
+          iou_threshold=cfgs.FAST_RCNN_NMS_IOU_THRESHOLD)
+
+      perclass_boxes = tf.gather(tmp_decoded_boxes, keep)
+      perclass_scores = tf.gather(tmp_score, keep)
+
+      allclasses_boxes.append(perclass_boxes)
+      allclasses_scores.append(perclass_scores)
+      categories.append(tf.ones_like(perclass_scores) * i)
+
+    final_boxes = tf.concat(allclasses_boxes, axis=0)
+    final_scores = tf.concat(allclasses_scores, axis=0)
+    final_category = tf.concat(categories, axis=0)
+
+    if is_train:
+      """
+      in training. We should show the detecitons in the tensorboard. So we add this.
+      """
+      kept_indices = tf.reshape(tf.where(tf.greater_equal(final_scores, cfgs.SHOW_SCORE_THRSHOLD)), [-1])
+
+      final_boxes = tf.gather(final_boxes, kept_indices)
+      final_scores = tf.gather(final_scores, kept_indices)
+      final_category = tf.gather(final_category, kept_indices)
+
+  return final_boxes, final_scores, final_category
+
+def roi_pooling(feature_maps, rois, img_shape):
+  '''
+  Here use roi warping as roi_pooling
+  :param featuremaps_dict: feature map to crop
+  :param rois: shape is [-1, 4]. [x1, y1, x2, y2]
+  :return:
+  '''
+  with tf.variable_scope('ROI_Warping'):
+    img_h, img_w = tf.cast(img_shape[1], tf.float32), tf.cast(img_shape[2], tf.float32)
+    N = tf.shape(rois)[0]
+    x1, y1, x2, y2 = tf.unstack(rois, axis=1)
+
+    normalized_x1 = x1 / img_w
+    normalized_x2 = x2 / img_w
+    normalized_y1 = y1 / img_h
+    normalized_y2 = y2 / img_h
+
+    normalized_rois = tf.transpose(
+        tf.stack([normalized_y1, normalized_x1, normalized_y2, normalized_x2]), name='get_normalized_rois')
+
+    normalized_rois = tf.stop_gradient(normalized_rois)
+
+    cropped_roi_features = tf.image.crop_and_resize(feature_maps, normalized_rois,
+                                                    box_ind=tf.zeros(shape=[N, ],
+                                                                     dtype=tf.int32),
+                                                    crop_size=[cfgs.ROI_SIZE, cfgs.ROI_SIZE],
+                                                    name='CROP_AND_RESIZE'
+                                                    )
+    roi_features = slim.max_pool2d(cropped_roi_features,
+                                  [cfgs.ROI_POOL_KERNEL_SIZE, cfgs.ROI_POOL_KERNEL_SIZE],
+                                  stride=cfgs.ROI_POOL_KERNEL_SIZE)
+
+  return roi_features
+
+def add_roi_batch_img_smry(img, rois, labels):
+  positive_roi_indices = tf.reshape(tf.where(tf.greater_equal(labels, 1)), [-1])
+  negative_roi_indices = tf.reshape(tf.where(tf.equal(labels, 0)), [-1])
+
+  pos_roi = tf.gather(rois, positive_roi_indices)
+  neg_roi = tf.gather(rois, negative_roi_indices)
+
+  pos_in_img = show_box_in_tensor.only_draw_boxes(img_batch=img,
+                                                  boxes=pos_roi)
+  neg_in_img = show_box_in_tensor.only_draw_boxes(img_batch=img,
+                                                  boxes=neg_roi)
+  tf.summary.image('pos_rois', pos_in_img)
+  tf.summary.image('neg_rois', neg_in_img)
+
+def add_anchor_img_smry(img, anchors, labels):
+  positive_anchor_indices = tf.reshape(tf.where(tf.greater_equal(labels, 1)), [-1])
+  negative_anchor_indices = tf.reshape(tf.where(tf.equal(labels, 0)), [-1])
+
+  positive_anchor = tf.gather(anchors, positive_anchor_indices)
+  negative_anchor = tf.gather(anchors, negative_anchor_indices)
+
+  pos_in_img = show_box_in_tensor.only_draw_boxes(img_batch=img,
+                                                  boxes=positive_anchor)
+  neg_in_img = show_box_in_tensor.only_draw_boxes(img_batch=img,
+                                                  boxes=negative_anchor)
+  tf.summary.image('positive_anchor', pos_in_img)
+  tf.summary.image('negative_anchors', neg_in_img)
+
+def forward_fn(inputs_dict,is_train):
+  """Forward pass function.
+
+    Args:
+    * inputs: input tensor to the network's forward pass
+    * is_train: whether to use the forward pass with training operations inserted
+    * data_format: data format ('channels_last' OR 'channels_first')
+    * anchor_info: anchor bounding boxes' information
+
+    Returns:
+    * outputs: a dictionary of output tensors
+    """
+  inputs = inputs_dict['inputs']
+  objects = inputs_dict['objects']
+
+  images = inputs['image']
+  filenames = inputs['filename']
+  shapes = inputs['shape']
+
+  if is_train:
+    flags, gtboxes_batch = tf.split(objects, [1, 5], axis=-1)
+    flags = tf.squeeze(tf.cast(flags, dtype=tf.int32), axis=-1)
+    index = tf.where(flags > 0)
+    gtboxes_batch = tf.gather_nd(gtboxes_batch, index)
+
+  with slim.arg_scope(
+      [slim.conv2d, slim.conv2d_in_plane, slim.conv2d_transpose, slim.separable_conv2d, slim.fully_connected],
+      weights_regularizer=tf.contrib.layers.l2_regularizer(cfgs.WEIGHT_DECAY),
+      biases_regularizer=tf.no_regularizer,
+      biases_initializer=tf.constant_initializer(0.0)):
+    img_shape = tf.shape(images)
+    # 1. build base network
+    feature_to_cropped = build_base_network(images, is_train)
+    # 2. build rpn
+    with tf.variable_scope('build_rpn',
+                           regularizer=slim.l2_regularizer(cfgs.WEIGHT_DECAY)):
+      rpn_conv3x3 = slim.conv2d(
+        feature_to_cropped, 512, [3, 3],
+        trainable=is_train, weights_initializer=cfgs.INITIALIZER,
+        activation_fn=tf.nn.relu,
+        scope='rpn_conv/3x3')
+      num_anchors_per_location = len(cfgs.ANCHOR_SCALES) * len(cfgs.ANCHOR_RATIOS)
+      rpn_cls_score = slim.conv2d(rpn_conv3x3, num_anchors_per_location * 2, [1, 1], stride=1,
+                                  trainable=is_train, weights_initializer=cfgs.INITIALIZER,
+                                  activation_fn=None,
+                                  scope='rpn_cls_score')
+      rpn_box_pred = slim.conv2d(rpn_conv3x3, num_anchors_per_location * 4, [1, 1], stride=1,
+                                 trainable=is_train, weights_initializer=cfgs.BBOX_INITIALIZER,
+                                 activation_fn=None,
+                                 scope='rpn_bbox_pred')
+      rpn_box_pred = tf.reshape(rpn_box_pred, [-1, 4])
+      rpn_cls_score = tf.reshape(rpn_cls_score, [-1, 2])
+      rpn_cls_prob = slim.softmax(rpn_cls_score, scope='rpn_cls_prob')
+
+    # 3. generate_anchors
+    featuremap_height, featuremap_width = tf.shape(feature_to_cropped)[1], tf.shape(feature_to_cropped)[2]
+    featuremap_height = tf.cast(featuremap_height, tf.float32)
+    featuremap_width = tf.cast(featuremap_width, tf.float32)
+
+    anchors = anchor_utils.make_anchors(base_anchor_size=cfgs.BASE_ANCHOR_SIZE_LIST[0],
+                                        anchor_scales=cfgs.ANCHOR_SCALES, anchor_ratios=cfgs.ANCHOR_RATIOS,
+                                        featuremap_height=featuremap_height,
+                                        featuremap_width=featuremap_width,
+                                        stride=cfgs.ANCHOR_STRIDE,
+                                        name="make_anchors_forRPN")
+
+    # 4. postprocess rpn proposals. such as: decode, clip, NMS
+    with tf.variable_scope('postprocess_RPN'):
+      # rpn_cls_prob = tf.reshape(rpn_cls_score, [-1, 2])
+      # rpn_cls_prob = slim.softmax(rpn_cls_prob, scope='rpn_cls_prob')
+      # rpn_box_pred = tf.reshape(rpn_box_pred, [-1, 4])
+      rois, roi_scores = postprocess_rpn_proposals(rpn_bbox_pred=rpn_box_pred,
+                                                   rpn_cls_prob=rpn_cls_prob,
+                                                   img_shape=img_shape,
+                                                   anchors=anchors,
+                                                   is_training=is_train)
+      # rois shape [-1, 4]
+      # +++++++++++++++++++++++++++++++++++++add img smry+++++++++++++++++++++++++++++++++++++++++++++++++++++++
+      if is_train:
+        rois_in_img = show_box_in_tensor.draw_boxes_with_scores(img_batch=images,
+                                                                boxes=rois,
+                                                                scores=roi_scores)
+        tf.summary.image('all_rpn_rois', rois_in_img)
+
+        score_gre_05 = tf.reshape(tf.where(tf.greater_equal(roi_scores, 0.5)), [-1])
+        score_gre_05_rois = tf.gather(rois, score_gre_05)
+        score_gre_05_score = tf.gather(roi_scores, score_gre_05)
+        score_gre_05_in_img = show_box_in_tensor.draw_boxes_with_scores(img_batch=images,
+                                                                        boxes=score_gre_05_rois,
+                                                                        scores=score_gre_05_score)
+        tf.summary.image('score_greater_05_rois', score_gre_05_in_img)
+      # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+    if is_train:
+      with tf.variable_scope('sample_anchors_minibatch'):
+        rpn_labels, rpn_bbox_targets = \
+          tf.py_func(
+            anchor_target_layer,
+            [gtboxes_batch, img_shape, anchors],
+            [tf.float32, tf.float32])
+        rpn_bbox_targets = tf.reshape(rpn_bbox_targets, [-1, 4])
+        rpn_labels = tf.to_int32(rpn_labels, name="to_int32")
+        rpn_labels = tf.reshape(rpn_labels, [-1])
+        add_anchor_img_smry(images, anchors, rpn_labels)
+
+      # --------------------------------------add smry----------------------------------------------------------------
+      rpn_cls_category = tf.argmax(rpn_cls_prob, axis=1)
+      kept_rpppn = tf.reshape(tf.where(tf.not_equal(rpn_labels, -1)), [-1])
+      rpn_cls_category = tf.gather(rpn_cls_category, kept_rpppn)
+      acc = tf.reduce_mean(tf.to_float(tf.equal(rpn_cls_category, tf.to_int64(tf.gather(rpn_labels, kept_rpppn)))))
+      with tf.control_dependencies([rpn_labels]):
+        with tf.variable_scope('sample_RCNN_minibatch'):
+          rois, labels, bbox_targets = \
+            tf.py_func(proposal_target_layer,
+                       [rois, gtboxes_batch],
+                       [tf.float32, tf.float32, tf.float32])
+          rois = tf.reshape(rois, [-1, 4])
+          labels = tf.to_int32(labels)
+          labels = tf.reshape(labels, [-1])
+          bbox_targets = tf.reshape(bbox_targets, [-1, 4 * (FLAGS.nb_classes)])
+          add_roi_batch_img_smry(images, rois, labels)
+
+    # -------------------------------------------------------------------------------------------------------------#
+    #                                            Fast-RCNN                                                         #
+    # -------------------------------------------------------------------------------------------------------------#
+    # 5. build Fast-RCNN
+    # rois = tf.Print(rois, [tf.shape(rois)], 'rois shape', summarize=10)
+    bbox_pred, cls_score = build_fastrcnn(is_train=is_train, feature_to_cropped=feature_to_cropped, rois=rois,
+                                          img_shape=img_shape)
+    # bbox_pred shape: [-1, 4*(cls_num+1)].
+    # cls_score shape： [-1, cls_num+1]
+    cls_prob = slim.softmax(cls_score, 'cls_prob')
+
+    # ----------------------------------------------add smry-------------------------------------------------------
+    if is_train:
+      cls_category = tf.argmax(cls_prob, axis=1)
+      fast_acc = tf.reduce_mean(tf.to_float(tf.equal(cls_category, tf.to_int64(labels))))
+
+    #  6. postprocess_fastrcnn
+    final_bboxes, final_scores, final_categories = postprocess_fastrcnn(is_train=is_train, rois=rois, bbox_ppred=bbox_pred,
+                                                                    scores=cls_prob, img_shape=img_shape)
+    if is_train and cfgs.ADD_BOX_IN_TENSORBOARD:
+      gtboxes_in_img = show_box_in_tensor.draw_boxes_with_categories(img_batch=images,
+                                                                     boxes=gtboxes_batch[:, :-1],
+                                                                     labels=gtboxes_batch[:, -1])
+      detections_in_img = show_box_in_tensor.draw_boxes_with_categories_and_scores(img_batch=images,
+                                                                                   boxes=final_bboxes,
+                                                                                   labels=final_categories,
+                                                                                   scores=final_scores)
+      tf.summary.image('Compare/final_detection', detections_in_img)
+      tf.summary.image('Compare/gtboxes', gtboxes_in_img)
+  if is_train:
+    predictions = None
+    forward_dict = { "rpn_box_pred": rpn_box_pred,
+                     "rpn_bbox_targets": rpn_bbox_targets,
+                     "rpn_cls_score": rpn_cls_score,
+                     "rpn_labels": rpn_labels,
+                     "bbox_pred": bbox_pred,
+                     "bbox_targets": bbox_targets,
+                     "cls_score": cls_score,
+                     "labels": labels }
+    metrics = {'ACC/rpn_accuracy': acc, 'ACC/fast_acc': fast_acc}
+  else:
+    forward_dict = {}
+    predictions = {'filename': filenames,
+                   'shape': shapes,
+                   'resized_shape':img_shape,
+                   'detected_boxes':final_bboxes,
+                   'detected_scores':final_scores,
+                   'detected_categories':final_categories
+                   }
+
+    metrics = {}
+  outputs = {'forward_dict': forward_dict, 'predictions': predictions, 'metrics': metrics}
+  return outputs
+
+def calc_loss_fn(objects, outputs, trainable_vars):
+  """Calculate the loss function's value.
+
+    Args:
+    * objects: one tensor with all the annotations packed together
+    * outputs: a dictionary of output tensors
+    * trainable_vars: list of trainable variables
+    * anchor_info: anchor bounding boxes' information
+    * batch_size: batch size
+
+    Returns:
+    * loss: loss function's value
+    * metrics: dictionary of extra evaluation metrics
+    """
+  # extract output tensors
+  rpn_box_pred = outputs['rpn_box_pred']
+  rpn_bbox_targets = outputs['rpn_bbox_targets']
+  rpn_cls_score = outputs['rpn_cls_score']
+  rpn_labels = outputs['rpn_labels']
+  bbox_pred = outputs['bbox_pred']
+  bbox_targets = outputs['bbox_targets']
+  cls_score = outputs['cls_score']
+  labels = outputs['labels']
+  with tf.variable_scope('build_loss') as sc:
+    with tf.variable_scope('rpn_loss'):
+      rpn_bbox_loss = losses.smooth_l1_loss_rpn(bbox_pred=rpn_box_pred,
+                                                bbox_targets=rpn_bbox_targets,
+                                                label=rpn_labels,
+                                                sigma=cfgs.RPN_SIGMA)
+      rpn_select = tf.reshape(tf.where(tf.not_equal(rpn_labels, -1)), [-1])
+      rpn_cls_score = tf.reshape(tf.gather(rpn_cls_score, rpn_select), [-1, 2])
+      rpn_labels = tf.reshape(tf.gather(rpn_labels, rpn_select), [-1])
+      rpn_cls_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=rpn_cls_score,
+                                                                                   labels=rpn_labels))
+      rpn_cls_loss = rpn_cls_loss * cfgs.RPN_CLASSIFICATION_LOSS_WEIGHT
+      rpn_loc_loss = rpn_bbox_loss * cfgs.RPN_LOCATION_LOSS_WEIGHT
+
+    with tf.variable_scope('FastRCNN_loss'):
+      if not cfgs.FAST_RCNN_MINIBATCH_SIZE == -1:
+        bbox_loss = losses.smooth_l1_loss_rcnn(bbox_pred=bbox_pred,
+                                               bbox_targets=bbox_targets,
+                                               label=labels,
+                                               num_classes=FLAGS.nb_classes,
+                                               sigma=cfgs.FASTRCNN_SIGMA)
+        cls_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
+          logits=cls_score,
+          labels=labels))  # beacause already sample before
+      else:
+        """
+        applying OHEM here
+        """
+        print(20 * "@@")
+        print("@@" + 10 * " " + "TRAIN WITH OHEM ...")
+        print(20 * "@@")
+        cls_loss, bbox_loss = losses.sum_ohem_loss(cls_score=cls_score,
+                                                   label=labels,
+                                                   bbox_targets=bbox_targets,
+                                                   bbox_pred=bbox_pred,
+                                                   num_ohem_samples=256,
+                                                   num_classes=FLAGS.nb_classes)
+      fastrcnn_cls_loss = cls_loss * cfgs.FAST_RCNN_CLASSIFICATION_LOSS_WEIGHT
+      fastrcnn_loc_loss = bbox_loss * cfgs.FAST_RCNN_LOCATION_LOSS_WEIGHT
+  rpn_total_loss = rpn_bbox_loss + rpn_cls_loss
+  fastrcnn_total_loss = cls_loss + bbox_loss
+  total_loss = rpn_total_loss + fastrcnn_total_loss
+
+  # ---------------------------------------------------------------------------------------------------add summary
+  tf.summary.scalar('RPN_LOSS/cls_loss', rpn_cls_loss)
+  tf.summary.scalar('RPN_LOSS/location_loss', rpn_loc_loss)
+  tf.summary.scalar('RPN_LOSS/rpn_total_loss', rpn_total_loss)
+  tf.summary.scalar('FAST_LOSS/fastrcnn_cls_loss', fastrcnn_cls_loss)
+  tf.summary.scalar('FAST_LOSS/fastrcnn_location_loss', fastrcnn_loc_loss)
+  tf.summary.scalar('FAST_LOSS/fastrcnn_total_loss', fastrcnn_total_loss)
+  return total_loss
+
+class ModelHelper(AbstractModelHelper):
+  """Model helper for creating a VGG model for the VOC dataset."""
+
+  def __init__(self, data_format='channels_last'):
+    """Constructor function."""
+
+    # class-independent initialization
+    super(ModelHelper, self).__init__(data_format, forward_w_labels=True)
+
+    # initialize training & evaluation subsets
+    self.dataset_train = PascalVocDataset(preprocess_fn=preprocess_image, is_train=True)
+    self.dataset_eval = PascalVocDataset(preprocess_fn=preprocess_image, is_train=False)
+
+    # setup hyper-parameters
+    self.batch_size = None  # track the most recently-used one
+    self.model_scope = "model"
+
+  def build_dataset_train(self, enbl_trn_val_split=False):
+    """Build the data subset for training, usually with data augmentation."""
+    return self.dataset_train.build()
+
+  def build_dataset_eval(self):
+    """Build the data subset for evaluation, usually without data augmentation."""
+    return self.dataset_eval.build()
+
+  def forward_train(self, inputs, objects, data_format='channels_last'):
+    """Forward computation at training."""
+    inputs_dict = {'inputs': inputs, 'objects': objects}
+    outputs = forward_fn(inputs_dict, True)
+    self.vars = slim.get_model_variables()
+    return outputs
+
+  def forward_eval(self, inputs, data_format='channels_last'):
+    """Forward computation at evaluation."""
+    inputs_dict = {'inputs': inputs, 'objects': None}
+    outputs = forward_fn(inputs_dict, False)
+    return outputs
+
+  def calc_loss(self, objects, outputs, trainable_vars):
+    """Calculate loss (and some extra evaluation metrics)."""
+    forward_dict = outputs['forward_dict']
+    metrics = outputs['metrics']
+    loss = tf.constant(0,dtype=tf.float32)
+    if forward_dict != {}:
+      """only build loss at training"""
+      loss = calc_loss_fn(objects, forward_dict, trainable_vars)
+    return loss, metrics
+
+  def setup_lrn_rate(self, global_step):
+    """Setup the learning rate (and number of training iterations)."""
+
+    lrn_rate = tf.train.piecewise_constant(global_step,
+                                     boundaries=[np.int64(cfgs.DECAY_STEP[0]), np.int64(cfgs.DECAY_STEP[1])],
+                                     values=[cfgs.LR, cfgs.LR / 10., cfgs.LR / 100.])
+    nb_iters = FLAGS.nb_iters_train
+
+    tf.summary.scalar('lrn_rate', lrn_rate)
+
+    return lrn_rate, nb_iters
+
+  def warm_start(self, sess):
+    """Initialize the model for warm-start.
+
+    Description:
+    * We use a pre-trained ImageNet classification model to initialize the backbone part of the SSD
+      model for feature extraction. If the SSD model's checkpoint files already exist, then skip.
+    """
+    # early return if checkpoint files already exist
+    checkpoint_path = tf.train.latest_checkpoint(os.path.dirname(FLAGS.save_path))
+    model_variables = self.vars
+    if checkpoint_path != None:
+      if cfgs.RESTORE_FROM_RPN:
+        print('___restore from rpn___')
+
+        restore_variables = [var for var in model_variables if not var.name.startswith(self.model_scope + 'FastRCNN_Head')] + \
+                            [slim.get_or_create_global_step()]
+        for var in restore_variables:
+          print(var.name)
+        saver = tf.train.Saver()
+        saver.build()
+        saver.restore(sess, checkpoint_path)
+      else:
+        print("___restore from trained model___")
+        for var in model_variables:
+          print(var.name)
+        saver = tf.train.Saver(model_variables)
+        saver.build()
+        saver.restore(sess, checkpoint_path)
+      print("model restore from :", checkpoint_path)
+    else:
+      if cfgs.NET_NAME.startswith("resnet"):
+        weights_name = cfgs.NET_NAME
+      elif cfgs.NET_NAME.startswith("MobilenetV2"):
+        weights_name = "mobilenet/mobilenet_v2_1.0_224"
+      else:
+        raise Exception('net name must in [resnet_v1_101, resnet_v1_50, MobilenetV2]')
+      checkpoint_path = os.path.join(FLAGS.backbone_ckpt_dir, weights_name + '.ckpt')
+      print("model restore from pretrained mode, path is :", checkpoint_path)
+      # for var in model_variables:
+      #     print(var.name)
+      # print(20*"__++__++__")
+
+      def name_in_ckpt_rpn(var):
+        '''
+        model/resnet_v1_50/block4 -->resnet_v1_50/block4
+        model/MobilenetV2/** -- > MobilenetV2 **
+        :param var:
+        :return:
+        '''
+        return '/'.join(var.op.name.split('/')[1:])
+
+      def name_in_ckpt_fastrcnn_head(var):
+        '''
+        model/Fast-RCNN/resnet_v1_50/block4 -->resnet_v1_50/block4
+        model/Fast-RCNN/MobilenetV2/** -- > MobilenetV2 **
+        :param var:
+        :return:
+        '''
+        return '/'.join(var.op.name.split('/')[2:])
+      nameInCkpt_Var_dict = {}
+      for var in model_variables:
+        if var.name.startswith(self.model_scope + '/Fast-RCNN/' + cfgs.NET_NAME):  # +'/block4'
+          var_name_in_ckpt = name_in_ckpt_fastrcnn_head(var)
+          nameInCkpt_Var_dict[var_name_in_ckpt] = var
+        else:
+          if var.name.startswith(self.model_scope + '/' + cfgs.NET_NAME):
+            var_name_in_ckpt = name_in_ckpt_rpn(var)
+            nameInCkpt_Var_dict[var_name_in_ckpt] = var
+          else:
+            continue
+      restore_variables = nameInCkpt_Var_dict
+      if not restore_variables:
+        tf.logging.warning('no variables to restore.')
+        return
+      for key, item in restore_variables.items():
+        print("var_in_graph: ", item.name)
+        print("var_in_ckpt: ", key)
+        print(20 * "___")
+      # restore variables from checkpoint files
+      saver = tf.train.Saver(restore_variables, reshape=False)
+      saver.build()
+      saver.restore(sess, checkpoint_path)
+      print(20 * "****")
+      print("restore from pretrained_weighs in IMAGE_NET")
+    print('model restored')
+
+
+  def dump_n_eval(self, outputs, action):
+    """Dump the model's outputs to files and evaluate."""
+    if not is_primary_worker('global'):
+      return
+    if action == 'init':
+      if os.path.exists(FLAGS.outputs_dump_dir):
+        shutil.rmtree(FLAGS.outputs_dump_dir)
+      os.mkdir(FLAGS.outputs_dump_dir)
+
+    elif action == 'dump':
+      filename = outputs['predictions']['filename'][0].decode('utf8')[:-4]
+      raw_shape = outputs['predictions']['shape'][0]
+      resized_shape= outputs['predictions']['resized_shape']
+
+      detected_boxes = outputs['predictions']['detected_boxes']
+      detected_scores = outputs['predictions']['detected_scores']
+      detected_categories = outputs['predictions']['detected_categories']
+
+
+      raw_h, raw_w = raw_shape[0], raw_shape[1]
+      resized_h, resized_w = resized_shape[1], resized_shape[2]
+
+      xmin, ymin, xmax, ymax = detected_boxes[:, 0], detected_boxes[:, 1], \
+                               detected_boxes[:, 2], detected_boxes[:, 3]
+
+      xmin = xmin * raw_w / resized_w
+      xmax = xmax * raw_w / resized_w
+      ymin = ymin * raw_h / resized_h
+      ymax = ymax * raw_h / resized_h
+
+      boxes = np.transpose(np.stack([xmin, ymin, xmax, ymax]))
+      dets = np.hstack((detected_categories.reshape(-1, 1),
+                        detected_scores.reshape(-1, 1),
+                        boxes))
+
+      for cls_id in range(1, FLAGS.nb_classes):
+        with open(os.path.join(FLAGS.outputs_dump_dir, 'results_%d.txt' % cls_id), 'a') as o_file:
+          this_cls_detections = dets[dets[:, 0] == cls_id]
+          if this_cls_detections.shape[0] == 0:
+            continue  # this cls has none detections in this img
+          for a_det in this_cls_detections:
+            o_file.write('{:s} {:.3f} {:.1f} {:.1f} {:.1f} {:.1f}\n'.
+                    format(filename, a_det[1],
+                           a_det[2], a_det[3],
+                           a_det[4], a_det[5]))  # that is [img_name, score, xmin, ymin, xmax, ymax]
+
+    elif action == 'eval':
+      do_python_eval(os.path.join(self.dataset_eval.data_dir, 'test'), FLAGS.outputs_dump_dir)
+    else:
+      raise ValueError('unrecognized action in dump_n_eval(): ' + action)
+
+  @property
+  def model_name(self):
+    """Model's name."""
+    return cfgs.NET_NAME
+
+  @property
+  def dataset_name(self):
+    """Dataset's name."""
+    return 'pascalvoc'
+
+
diff --git a/nets/faster_rcnn_at_pascalvoc_run.py b/nets/faster_rcnn_at_pascalvoc_run.py
new file mode 100644
index 0000000..97863df
--- /dev/null
+++ b/nets/faster_rcnn_at_pascalvoc_run.py
@@ -0,0 +1,69 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Execution script for VGG models on the Pascal VOC dataset."""
+
+import traceback
+import tensorflow as tf
+
+from nets.faster_rcnn_at_pascalvoc import ModelHelper
+from learners.learner_utils import create_learner
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_string('log_dir', './logs', 'logging directory')
+tf.app.flags.DEFINE_boolean('enbl_multi_gpu', False, 'enable multi-GPU training')
+tf.app.flags.DEFINE_string('learner', 'full-prec', 'learner\'s name')
+tf.app.flags.DEFINE_string('exec_mode', 'train', 'execution mode: train / eval')
+tf.app.flags.DEFINE_boolean('debug', False, 'debugging information')
+
+def main(unused_argv):
+  """Main entry."""
+
+  try:
+    # setup the TF logging routine
+    if FLAGS.debug:
+      tf.logging.set_verbosity(tf.logging.DEBUG)
+    else:
+      tf.logging.set_verbosity(tf.logging.INFO)
+    sm_writer = tf.summary.FileWriter(FLAGS.log_dir)
+
+    # display FLAGS's values
+    tf.logging.info('FLAGS:')
+    for key, value in FLAGS.flag_values_dict().items():
+      tf.logging.info('{}: {}'.format(key, value))
+
+    # build the model helper & learner
+    model_helper = ModelHelper()
+    learner = create_learner(sm_writer, model_helper)
+
+    # execute the learner
+    if FLAGS.exec_mode == 'train':
+      learner.train()
+    elif FLAGS.exec_mode == 'eval':
+      learner.download_model()
+      learner.evaluate()
+    else:
+      raise ValueError('unrecognized execution mode: ' + FLAGS.exec_mode)
+
+    # exit normally
+    return 0
+  except ValueError:
+    traceback.print_exc()
+    return 1  # exit with errors
+
+if __name__ == '__main__':
+  tf.app.run()
diff --git a/nets/lenet_at_cifar10.py b/nets/lenet_at_cifar10.py
index 631d223..b077230 100644
--- a/nets/lenet_at_cifar10.py
+++ b/nets/lenet_at_cifar10.py
@@ -20,9 +20,12 @@
 
 from nets.abstract_model_helper import AbstractModelHelper
 from datasets.cifar10_dataset import Cifar10Dataset
+from utils.lrn_rate_utils import setup_lrn_rate_piecewise_constant
+from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
 
 FLAGS = tf.app.flags.FLAGS
 
+tf.app.flags.DEFINE_float('nb_epochs_rat', 1.0, '# of training epochs\'s ratio')
 tf.app.flags.DEFINE_float('lrn_rate_init', 1e-2, 'initial learning rate')
 tf.app.flags.DEFINE_float('batch_size_norm', 128, 'normalization factor of batch size')
 tf.app.flags.DEFINE_float('momentum', 0.9, 'momentum coefficient')
@@ -39,6 +42,10 @@ def forward_fn(inputs, data_format):
   * inputs: outputs from the network's forward pass
   """
 
+  # convert inputs from channels_last (NHWC) to channels_first (NCHW)
+  if data_format == 'channels_first':
+    inputs = tf.transpose(inputs, [0, 3, 1, 2])
+
   # conv1
   inputs = tf.layers.conv2d(inputs, 32, [5, 5], data_format=data_format, name='conv1')
   inputs = tf.nn.relu(inputs, name='relu1')
@@ -63,11 +70,11 @@ def forward_fn(inputs, data_format):
 class ModelHelper(AbstractModelHelper):
   """Model helper for creating a LeNet-like model for the CIFAR-10 dataset."""
 
-  def __init__(self):
+  def __init__(self, data_format='channels_last'):
     """Constructor function."""
 
     # class-independent initialization
-    super(ModelHelper, self).__init__()
+    super(ModelHelper, self).__init__(data_format)
 
     # initialize training & evaluation subsets
     self.dataset_train = Cifar10Dataset(is_train=True)
@@ -83,15 +90,15 @@ def build_dataset_eval(self):
 
     return self.dataset_eval.build()
 
-  def forward_train(self, inputs, data_format='channels_last'):
+  def forward_train(self, inputs):
     """Forward computation at training."""
 
-    return forward_fn(inputs, data_format)
+    return forward_fn(inputs, self.data_format)
 
-  def forward_eval(self, inputs, data_format='channels_last'):
+  def forward_eval(self, inputs):
     """Forward computation at evaluation."""
 
-    return forward_fn(inputs, data_format)
+    return forward_fn(inputs, self.data_format)
 
   def calc_loss(self, labels, outputs, trainable_vars):
     """Calculate loss (and some extra evaluation metrics)."""
@@ -104,6 +111,18 @@ def calc_loss(self, labels, outputs, trainable_vars):
 
     return loss, metrics
 
+  def setup_lrn_rate(self, global_step):
+    """Setup the learning rate (and number of training iterations)."""
+
+    nb_epochs = 250
+    idxs_epoch = [100, 150, 200]
+    decay_rates = [1.0, 0.1, 0.01, 0.001]
+    batch_size = FLAGS.batch_size * (1 if not FLAGS.enbl_multi_gpu else mgw.size())
+    lrn_rate = setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay_rates)
+    nb_iters = int(FLAGS.nb_smpls_train * nb_epochs * FLAGS.nb_epochs_rat / batch_size)
+
+    return lrn_rate, nb_iters
+
   @property
   def model_name(self):
     """Model's name."""
diff --git a/nets/mobilenet_at_ilsvrc12.py b/nets/mobilenet_at_ilsvrc12.py
index 4d47015..4dfaf51 100644
--- a/nets/mobilenet_at_ilsvrc12.py
+++ b/nets/mobilenet_at_ilsvrc12.py
@@ -23,11 +23,15 @@
 from datasets.ilsvrc12_dataset import Ilsvrc12Dataset
 from utils.external import mobilenet_v1 as MobileNetV1
 from utils.external import mobilenet_v2 as MobileNetV2
+from utils.lrn_rate_utils import setup_lrn_rate_piecewise_constant
+from utils.lrn_rate_utils import setup_lrn_rate_exponential_decay
+from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
 
 FLAGS = tf.app.flags.FLAGS
 
 tf.app.flags.DEFINE_integer('mobilenet_version', 1, 'MobileNet\'s version (1 or 2)')
 tf.app.flags.DEFINE_float('mobilenet_depth_mult', 1.0, 'MobileNet\'s depth multiplier')
+tf.app.flags.DEFINE_float('nb_epochs_rat', 1.0, '# of training epochs\'s ratio')
 tf.app.flags.DEFINE_float('lrn_rate_init', 0.045, 'initial learning rate')
 tf.app.flags.DEFINE_float('batch_size_norm', 96, 'normalization factor of batch size')
 tf.app.flags.DEFINE_float('momentum', 0.9, 'momentum coefficient')
@@ -65,11 +69,12 @@ def forward_fn(inputs, is_train):
 class ModelHelper(AbstractModelHelper):
   """Model helper for creating a MobileNet model for the ILSVRC-12 dataset."""
 
-  def __init__(self):
+  def __init__(self, data_format='channels_last'):
     """Constructor function."""
 
     # class-independent initialization
-    super(ModelHelper, self).__init__()
+    assert data_format == 'channels_last', 'MobileNet only supports \'channels_last\' data format'
+    super(ModelHelper, self).__init__(data_format)
 
     # initialize training & evaluation subsets
     self.dataset_train = Ilsvrc12Dataset(is_train=True)
@@ -85,18 +90,14 @@ def build_dataset_eval(self):
 
     return self.dataset_eval.build()
 
-  def forward_train(self, inputs, data_format='channels_last'):
+  def forward_train(self, inputs):
     """Forward computation at training."""
 
-    assert data_format == 'channels_last', 'MobileNet only supports \'channels_last\' data format'
-
     return forward_fn(inputs, is_train=True)
 
-  def forward_eval(self, inputs, data_format='channels_last'):
+  def forward_eval(self, inputs):
     """Forward computation at evaluation."""
 
-    assert data_format == 'channels_last', 'MobileNet only supports \'channels_last\' data format'
-
     return forward_fn(inputs, is_train=False)
 
   def calc_loss(self, labels, outputs, trainable_vars):
@@ -113,6 +114,27 @@ def calc_loss(self, labels, outputs, trainable_vars):
 
     return loss, metrics
 
+  def setup_lrn_rate(self, global_step):
+    """Setup the learning rate (and number of training iterations)."""
+
+    batch_size = FLAGS.batch_size * (1 if not FLAGS.enbl_multi_gpu else mgw.size())
+    if FLAGS.mobilenet_version == 1:
+      nb_epochs = 100
+      idxs_epoch = [30, 60, 80, 90]
+      decay_rates = [1.0, 0.1, 0.01, 0.001, 0.0001]
+      lrn_rate = setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay_rates)
+      nb_iters = int(FLAGS.nb_smpls_train * nb_epochs * FLAGS.nb_epochs_rat / batch_size)
+    elif FLAGS.mobilenet_version == 2:
+      nb_epochs = 412
+      epoch_step = 2.5
+      decay_rate = 0.98 ** epoch_step  # which is better, 0.98 OR (0.98 ** epoch_step)?
+      lrn_rate = setup_lrn_rate_exponential_decay(global_step, batch_size, epoch_step, decay_rate)
+      nb_iters = int(FLAGS.nb_smpls_train * nb_epochs * FLAGS.nb_epochs_rat / batch_size)
+    else:
+      raise ValueError('invalid MobileNet version: {}'.format(FLAGS.mobilenet_version))
+
+    return lrn_rate, nb_iters
+
   @property
   def model_name(self):
     """Model's name."""
diff --git a/nets/resnet_at_cifar10.py b/nets/resnet_at_cifar10.py
index 1765212..ff689e8 100644
--- a/nets/resnet_at_cifar10.py
+++ b/nets/resnet_at_cifar10.py
@@ -21,10 +21,13 @@
 from nets.abstract_model_helper import AbstractModelHelper
 from datasets.cifar10_dataset import Cifar10Dataset
 from utils.external import resnet_model as ResNet
+from utils.lrn_rate_utils import setup_lrn_rate_piecewise_constant
+from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
 
 FLAGS = tf.app.flags.FLAGS
 
 tf.app.flags.DEFINE_integer('resnet_size', 20, '# of layers in the ResNet model')
+tf.app.flags.DEFINE_float('nb_epochs_rat', 1.0, '# of training epochs\'s ratio')
 tf.app.flags.DEFINE_float('lrn_rate_init', 1e-1, 'initial learning rate')
 tf.app.flags.DEFINE_float('batch_size_norm', 128, 'normalization factor of batch size')
 tf.app.flags.DEFINE_float('momentum', 0.9, 'momentum coefficient')
@@ -65,11 +68,11 @@ def forward_fn(inputs, is_train, data_format):
 class ModelHelper(AbstractModelHelper):
   """Model helper for creating a ResNet model for the CIFAR-10 dataset."""
 
-  def __init__(self):
+  def __init__(self, data_format='channels_last'):
     """Constructor function."""
 
     # class-independent initialization
-    super(ModelHelper, self).__init__()
+    super(ModelHelper, self).__init__(data_format)
 
     # initialize training & evaluation subsets
     self.dataset_train = Cifar10Dataset(is_train=True)
@@ -85,15 +88,15 @@ def build_dataset_eval(self):
 
     return self.dataset_eval.build()
 
-  def forward_train(self, inputs, data_format='channels_last'):
+  def forward_train(self, inputs):
     """Forward computation at training."""
 
-    return forward_fn(inputs, is_train=True, data_format=data_format)
+    return forward_fn(inputs, is_train=True, data_format=self.data_format)
 
-  def forward_eval(self, inputs, data_format='channels_last'):
+  def forward_eval(self, inputs):
     """Forward computation at evaluation."""
 
-    return forward_fn(inputs, is_train=False, data_format=data_format)
+    return forward_fn(inputs, is_train=False, data_format=self.data_format)
 
   def calc_loss(self, labels, outputs, trainable_vars):
     """Calculate loss (and some extra evaluation metrics)."""
@@ -108,6 +111,18 @@ def calc_loss(self, labels, outputs, trainable_vars):
 
     return loss, metrics
 
+  def setup_lrn_rate(self, global_step):
+    """Setup the learning rate (and number of training iterations)."""
+
+    nb_epochs = 250
+    idxs_epoch = [100, 150, 200]
+    decay_rates = [1.0, 0.1, 0.01, 0.001]
+    batch_size = FLAGS.batch_size * (1 if not FLAGS.enbl_multi_gpu else mgw.size())
+    lrn_rate = setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay_rates)
+    nb_iters = int(FLAGS.nb_smpls_train * nb_epochs * FLAGS.nb_epochs_rat / batch_size)
+
+    return lrn_rate, nb_iters
+
   @property
   def model_name(self):
     """Model's name."""
diff --git a/nets/resnet_at_ilsvrc12.py b/nets/resnet_at_ilsvrc12.py
index 02e4619..ea9efa6 100644
--- a/nets/resnet_at_ilsvrc12.py
+++ b/nets/resnet_at_ilsvrc12.py
@@ -21,10 +21,13 @@
 from nets.abstract_model_helper import AbstractModelHelper
 from datasets.ilsvrc12_dataset import Ilsvrc12Dataset
 from utils.external import resnet_model as ResNet
+from utils.lrn_rate_utils import setup_lrn_rate_piecewise_constant
+from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
 
 FLAGS = tf.app.flags.FLAGS
 
 tf.app.flags.DEFINE_integer('resnet_size', 18, '# of layers in the ResNet model')
+tf.app.flags.DEFINE_float('nb_epochs_rat', 1.0, '# of training epochs\'s ratio')
 tf.app.flags.DEFINE_float('lrn_rate_init', 1e-1, 'initial learning rate')
 tf.app.flags.DEFINE_float('batch_size_norm', 256, 'normalization factor of batch size')
 tf.app.flags.DEFINE_float('momentum', 0.9, 'momentum coefficient')
@@ -93,11 +96,11 @@ def forward_fn(inputs, is_train, data_format):
 class ModelHelper(AbstractModelHelper):
   """Model helper for creating a ResNet model for the ILSVRC-12 dataset."""
 
-  def __init__(self):
+  def __init__(self, data_format='channels_last'):
     """Constructor function."""
 
     # class-independent initialization
-    super(ModelHelper, self).__init__()
+    super(ModelHelper, self).__init__(data_format)
 
     # initialize training & evaluation subsets
     self.dataset_train = Ilsvrc12Dataset(is_train=True)
@@ -113,15 +116,15 @@ def build_dataset_eval(self):
 
     return self.dataset_eval.build()
 
-  def forward_train(self, inputs, data_format='channels_last'):
+  def forward_train(self, inputs):
     """Forward computation at training."""
 
-    return forward_fn(inputs, is_train=True, data_format=data_format)
+    return forward_fn(inputs, is_train=True, data_format=self.data_format)
 
-  def forward_eval(self, inputs, data_format='channels_last'):
+  def forward_eval(self, inputs):
     """Forward computation at evaluation."""
 
-    return forward_fn(inputs, is_train=False, data_format=data_format)
+    return forward_fn(inputs, is_train=False, data_format=self.data_format)
 
   def calc_loss(self, labels, outputs, trainable_vars):
     """Calculate loss (and some extra evaluation metrics)."""
@@ -137,6 +140,18 @@ def calc_loss(self, labels, outputs, trainable_vars):
 
     return loss, metrics
 
+  def setup_lrn_rate(self, global_step):
+    """Setup the learning rate (and number of training iterations)."""
+
+    nb_epochs = 100
+    idxs_epoch = [30, 60, 80, 90]
+    decay_rates = [1.0, 0.1, 0.01, 0.001, 0.0001]
+    batch_size = FLAGS.batch_size * (1 if not FLAGS.enbl_multi_gpu else mgw.size())
+    lrn_rate = setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay_rates)
+    nb_iters = int(FLAGS.nb_smpls_train * nb_epochs * FLAGS.nb_epochs_rat / batch_size)
+
+    return lrn_rate, nb_iters
+
   @property
   def model_name(self):
     """Model's name."""
diff --git a/nets/vgg_at_pascalvoc.py b/nets/vgg_at_pascalvoc.py
new file mode 100644
index 0000000..59708cc
--- /dev/null
+++ b/nets/vgg_at_pascalvoc.py
@@ -0,0 +1,595 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Model helper for creating a VGG model for the Pascal VOC dataset."""
+
+import os
+import shutil
+import numpy as np
+import tensorflow as tf
+from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
+
+from nets.abstract_model_helper import AbstractModelHelper
+from datasets.pascalvoc_dataset import PascalVocDataset
+from utils.misc_utils import is_primary_worker
+
+from utils.external.ssd_tensorflow.preprocessing.ssd_preprocessing import preprocess_image
+from utils.external.ssd_tensorflow.net import ssd_net
+from utils.external.ssd_tensorflow.utility import anchor_manipulator
+from utils.external.ssd_tensorflow.utility import scaffolds
+from utils.external.ssd_tensorflow.voc_eval import do_python_eval
+
+FLAGS = tf.app.flags.FLAGS
+
+# model related configuration
+tf.app.flags.DEFINE_integer('nb_iters_train', 120000, 'The number of training iterations.')
+tf.app.flags.DEFINE_float('negative_ratio', 3.0, 'Negative ratio in the loss function.')
+tf.app.flags.DEFINE_float('match_threshold', 0.5, 'Matching threshold in the loss function.')
+tf.app.flags.DEFINE_float('neg_threshold', 0.5,
+                          'Matching threshold for the negtive examples in the loss function.')
+tf.app.flags.DEFINE_float('select_threshold', 0.01,
+                          'Class-specific confidence score threshold for selecting a box.')
+tf.app.flags.DEFINE_float('min_size', 0.03, 'The min size of bboxes to keep.')
+tf.app.flags.DEFINE_float('nms_threshold', 0.45, 'Matching threshold in NMS algorithm.')
+tf.app.flags.DEFINE_integer('nms_topk', 200, 'Number of total object to keep after NMS.')
+tf.app.flags.DEFINE_integer('keep_topk', 400,
+                            'Number of total object to keep for each image before nms.')
+
+# optimizer related configuration
+tf.app.flags.DEFINE_float('lrn_rate_init', 1e-3, 'The initial learning rate.')
+tf.app.flags.DEFINE_float('lrn_rate_min', 1e-6, 'The minimal learning rate')
+tf.app.flags.DEFINE_string('lrn_rate_dcy_bnds', '500, 80000, 100000',
+                           'Learning rate decay boundaries.')
+tf.app.flags.DEFINE_string('lrn_rate_dcy_rates', '0.1, 1, 0.1, 0.01',
+                           'Learning rate decay rates for each segment between boundaries')
+tf.app.flags.DEFINE_float('momentum', 0.9, 'momentum coefficient')
+tf.app.flags.DEFINE_integer('nb_iters_cls_wmup', 10000,
+                            'The number of iterations for warming-up the classification loss')
+tf.app.flags.DEFINE_float('loss_w_dcy', 5e-4, 'weight decaying loss\'s coefficient')
+
+# checkpoint related configuration
+tf.app.flags.DEFINE_string('backbone_ckpt_dir', './backbone_models/',
+                           'The backbone model\'s (e.g. VGG-16) checkpoint directory')
+tf.app.flags.DEFINE_string('backbone_model_scope', 'vgg_16',
+                           'Model scope in the checkpoint. None if the same as the trained model.')
+tf.app.flags.DEFINE_string('model_scope', 'ssd300',
+                           'Model scope name used to replace the name_scope in checkpoint.')
+tf.app.flags.DEFINE_string('warm_start_excl_scopes',
+                           'ssd300/multibox_head, ssd300/additional_layers, ssd300/conv4_3_scale',
+                           'List of scopes to be excluded when restoring from a backbone model')
+tf.app.flags.DEFINE_boolean('ignore_missing_vars', True,
+                            'When restoring a checkpoint would ignore missing variables.')
+
+# evaluation related configuration
+tf.app.flags.DEFINE_string('outputs_dump_dir', './ssd_outputs/', 'outputs\'s dumping directory')
+
+def parse_comma_list(args):
+  """Convert a comma-separated list to a list of floating-point numbers."""
+
+  return [float(s.strip()) for s in args.split(',')]
+
+def setup_anchor_info():
+  """Setup the anchor bounding boxes' information."""
+
+  # get all anchor bounding boxes
+  out_shape = [FLAGS.image_size] * 2
+  anchor_creator = anchor_manipulator.AnchorCreator(
+    out_shape,
+    layers_shapes = [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
+    anchor_scales = [(0.1,), (0.2,), (0.375,), (0.55,), (0.725,), (0.9,)],
+    extra_anchor_scales = [(0.1414,), (0.2739,), (0.4541,), (0.6315,), (0.8078,), (0.9836,)],
+    anchor_ratios = [(1., 2., .5), (1., 2., 3., .5, 0.3333), (1., 2., 3., .5, 0.3333),
+                     (1., 2., 3., .5, 0.3333), (1., 2., .5), (1., 2., .5)],
+    layer_steps = [8, 16, 32, 64, 100, 300])
+  all_anchors, all_num_anchors_depth, all_num_anchors_spatial = anchor_creator.get_all_anchors()
+
+  # construct the anchor bounding boxes' encoder & decoder
+  num_anchors_per_layer = []
+  for ind in range(len(all_anchors)):
+    num_anchors_per_layer.append(all_num_anchors_depth[ind] * all_num_anchors_spatial[ind])
+  anchor_encoder = anchor_manipulator.AnchorEncoder(
+    allowed_borders=[1.0] * 6, positive_threshold=FLAGS.match_threshold,
+    ignore_threshold=FLAGS.neg_threshold, prior_scaling=[0.1, 0.1, 0.2, 0.2])
+
+  # pack all the information into one dictionary
+  anchor_info = {
+    'init_fn': lambda: anchor_encoder.init_all_anchors(
+      all_anchors, all_num_anchors_depth, all_num_anchors_spatial),
+    'encode_fn': lambda glabels_, gbboxes_: anchor_encoder.encode_all_anchors(
+      glabels_, gbboxes_, all_anchors, all_num_anchors_depth, all_num_anchors_spatial),
+    'decode_fn': lambda pred: anchor_encoder.decode_all_anchors(pred, num_anchors_per_layer),
+    'num_anchors_per_layer': num_anchors_per_layer,
+    'all_num_anchors_depth': all_num_anchors_depth,
+  }
+
+  return anchor_info
+
+def modified_smooth_l1(
+    bbox_pred, bbox_targets, bbox_inside_weights=1., bbox_outside_weights=1., sigma=1.):
+  """Modified smooth L1-loss.
+
+  Description:
+  * ResultLoss = outside_weights * SmoothL1(inside_weights * (bbox_pred - bbox_targets))
+  * SmoothL1(x) = 0.5 * (sigma * x)^2,    if |x| < 1 / sigma^2
+                  |x| - 0.5 / sigma^2,    otherwise
+  """
+
+  with tf.name_scope('smooth_l1', [bbox_pred, bbox_targets]):
+    sigma2 = sigma * sigma
+    inside_mul = tf.multiply(bbox_inside_weights, tf.subtract(bbox_pred, bbox_targets))
+    smooth_l1_sign = tf.cast(tf.less(tf.abs(inside_mul), 1.0 / sigma2), tf.float32)
+    smooth_l1_option1 = tf.multiply(tf.multiply(inside_mul, inside_mul), 0.5 * sigma2)
+    smooth_l1_option2 = tf.subtract(tf.abs(inside_mul), 0.5 / sigma2)
+    smooth_l1_result = tf.add(tf.multiply(smooth_l1_option1, smooth_l1_sign),
+                              tf.multiply(smooth_l1_option2, tf.abs(smooth_l1_sign - 1.0)))
+    outside_mul = tf.multiply(bbox_outside_weights, smooth_l1_result)
+
+  return outside_mul
+
+def select_bboxes(scores_pred, bboxes_pred, num_classes, select_threshold):
+  selected_bboxes = {}
+  selected_scores = {}
+  with tf.name_scope('select_bboxes', [scores_pred, bboxes_pred]):
+    for class_ind in range(1, num_classes):
+      class_scores = scores_pred[:, class_ind]
+      select_mask = class_scores > select_threshold
+      select_mask = tf.cast(select_mask, tf.float32)
+      selected_bboxes[class_ind] = tf.multiply(bboxes_pred, tf.expand_dims(select_mask, axis=-1))
+      selected_scores[class_ind] = tf.multiply(class_scores, select_mask)
+
+  return selected_bboxes, selected_scores
+
+def clip_bboxes(ymin, xmin, ymax, xmax, name):
+  with tf.name_scope(name, 'clip_bboxes', [ymin, xmin, ymax, xmax]):
+    ymin = tf.maximum(ymin, 0.)
+    xmin = tf.maximum(xmin, 0.)
+    ymax = tf.minimum(ymax, 1.)
+    xmax = tf.minimum(xmax, 1.)
+    ymin = tf.minimum(ymin, ymax)
+    xmin = tf.minimum(xmin, xmax)
+
+  return ymin, xmin, ymax, xmax
+
+def filter_bboxes(scores_pred, ymin, xmin, ymax, xmax, min_size, name):
+  with tf.name_scope(name, 'filter_bboxes', [scores_pred, ymin, xmin, ymax, xmax]):
+    width = xmax - xmin
+    height = ymax - ymin
+    filter_mask = tf.logical_and(width > min_size, height > min_size)
+    filter_mask = tf.cast(filter_mask, tf.float32)
+
+  return tf.multiply(ymin, filter_mask), tf.multiply(xmin, filter_mask), \
+    tf.multiply(ymax, filter_mask), tf.multiply(xmax, filter_mask), \
+    tf.multiply(scores_pred, filter_mask)
+
+def sort_bboxes(scores_pred, ymin, xmin, ymax, xmax, keep_topk, name):
+  with tf.name_scope(name, 'sort_bboxes', [scores_pred, ymin, xmin, ymax, xmax]):
+    cur_bboxes = tf.shape(scores_pred)[0]
+    scores, idxes = tf.nn.top_k(scores_pred, k=tf.minimum(keep_topk, cur_bboxes), sorted=True)
+    ymin, xmin, ymax, xmax = \
+      tf.gather(ymin, idxes), tf.gather(xmin, idxes), tf.gather(ymax, idxes), tf.gather(xmax, idxes)
+    paddings_scores = \
+      tf.expand_dims(tf.stack([0, tf.maximum(keep_topk-cur_bboxes, 0)], axis=0), axis=0)
+
+  return tf.pad(ymin, paddings_scores, "CONSTANT"), tf.pad(xmin, paddings_scores, "CONSTANT"),\
+    tf.pad(ymax, paddings_scores, "CONSTANT"), tf.pad(xmax, paddings_scores, "CONSTANT"),\
+    tf.pad(scores, paddings_scores, "CONSTANT")
+
+def nms_bboxes(scores_pred, bboxes_pred, nms_topk, nms_threshold, name):
+  with tf.name_scope(name, 'nms_bboxes', [scores_pred, bboxes_pred]):
+    idxes = tf.image.non_max_suppression(bboxes_pred, scores_pred, nms_topk, nms_threshold)
+
+  return tf.gather(scores_pred, idxes), tf.gather(bboxes_pred, idxes)
+
+def parse_by_class(cls_pred, bboxes_pred, num_classes,
+                   select_threshold, min_size, keep_topk, nms_topk, nms_threshold):
+  with tf.name_scope('select_bboxes', [cls_pred, bboxes_pred]):
+    scores_pred = tf.nn.softmax(cls_pred)
+    selected_bboxes, selected_scores = \
+      select_bboxes(scores_pred, bboxes_pred, num_classes, select_threshold)
+    for class_ind in range(1, num_classes):
+      ymin, xmin, ymax, xmax = tf.unstack(selected_bboxes[class_ind], 4, axis=-1)
+      ymin, xmin, ymax, xmax = \
+        clip_bboxes(ymin, xmin, ymax, xmax, 'clip_bboxes_{}'.format(class_ind))
+      ymin, xmin, ymax, xmax, selected_scores[class_ind] = filter_bboxes(
+        selected_scores[class_ind], ymin, xmin, ymax, xmax,
+        min_size, 'filter_bboxes_{}'.format(class_ind))
+      ymin, xmin, ymax, xmax, selected_scores[class_ind] = sort_bboxes(
+        selected_scores[class_ind], ymin, xmin, ymax, xmax,
+        keep_topk, 'sort_bboxes_{}'.format(class_ind))
+      selected_bboxes[class_ind] = tf.stack([ymin, xmin, ymax, xmax], axis=-1)
+      selected_scores[class_ind], selected_bboxes[class_ind] = nms_bboxes(
+        selected_scores[class_ind], selected_bboxes[class_ind],
+        nms_topk, nms_threshold, 'nms_bboxes_{}'.format(class_ind))
+
+  return selected_bboxes, selected_scores
+
+def forward_fn(inputs, is_train, data_format, anchor_info):
+  """Forward pass function.
+
+  Args:
+  * inputs: input tensor to the network's forward pass
+  * is_train: whether to use the forward pass with training operations inserted
+  * data_format: data format ('channels_last' OR 'channels_first')
+  * anchor_info: anchor bounding boxes' information
+
+  Returns:
+  * outputs: a dictionary of output tensors
+  """
+
+  tf.logging.info('building forward with is_train = {}'.format(is_train))
+
+  # extract anchor boundiing boxes' information
+  images = inputs['image']
+  filenames = inputs['filename']
+  shapes = inputs['shape']
+  decode_fn = anchor_info['decode_fn']
+  all_num_anchors_depth = anchor_info['all_num_anchors_depth']
+
+  # initialize anchor bounding boxes
+  anchor_info['init_fn']()
+
+  # compute output tensors
+  with tf.variable_scope(FLAGS.model_scope, values=[images], reuse=tf.AUTO_REUSE):
+    # obtain the current model scope
+    model_scope = tf.get_default_graph().get_name_scope()
+
+    # obtain predictions for localization & classification
+    backbone = ssd_net.VGG16Backbone(data_format)
+    feature_layers = backbone.forward(images, training=is_train)
+    loc_pred, cls_pred = ssd_net.multibox_head(
+      feature_layers, FLAGS.nb_classes, all_num_anchors_depth, data_format=data_format)
+    if data_format == 'channels_first':
+      cls_pred = [tf.transpose(pred, [0, 2, 3, 1]) for pred in cls_pred]
+      loc_pred = [tf.transpose(pred, [0, 2, 3, 1]) for pred in loc_pred]
+
+    # flatten predictions
+    def reshape_fn(preds, nb_dims):
+      preds = [tf.reshape(pred, [tf.shape(images)[0], -1, nb_dims]) for pred in preds]
+      preds = tf.concat(preds, axis=1)
+      preds = tf.reshape(preds, [-1, nb_dims])
+      return preds
+    cls_pred = reshape_fn(cls_pred, FLAGS.nb_classes)
+    loc_pred = reshape_fn(loc_pred, 4)
+
+    # obtain per-class predictions on bounding boxes and scores
+    if is_train:
+      predictions = None#tf.no_op()
+    else:
+      bboxes_pred = decode_fn(loc_pred)  # evaluation batch size is 1
+      bboxes_pred = tf.concat(bboxes_pred, axis=0)
+      selected_bboxes, selected_scores = parse_by_class(
+        cls_pred, bboxes_pred, FLAGS.nb_classes, FLAGS.select_threshold,
+        FLAGS.min_size, FLAGS.keep_topk, FLAGS.nms_topk, FLAGS.nms_threshold)
+      predictions = {'filename': filenames, 'shape': shapes}
+      for idx_cls in range(1, FLAGS.nb_classes):
+        predictions['scores_%d' % idx_cls] = tf.expand_dims(selected_scores[idx_cls], axis=0)
+        predictions['bboxes_%d' % idx_cls] = tf.expand_dims(selected_bboxes[idx_cls], axis=0)
+
+  # pack all the output tensors together
+  outputs = {'cls_pred': cls_pred, 'loc_pred': loc_pred, 'predictions': predictions}
+
+  return outputs, model_scope
+
+def calc_loss_fn(objects, outputs, trainable_vars, anchor_info, batch_size):
+  """Calculate the loss function's value.
+
+  Args:
+  * objects: one tensor with all the annotations packed together
+  * outputs: a dictionary of output tensors
+  * trainable_vars: list of trainable variables
+  * anchor_info: anchor bounding boxes' information
+  * batch_size: batch size
+
+  Returns:
+  * loss: loss function's value
+  * metrics: dictionary of extra evaluation metrics
+  """
+
+  # extract output tensors
+  #batch_size = FLAGS.batch_size
+  cls_pred = outputs['cls_pred']
+  loc_pred = outputs['loc_pred']
+
+  # extract anchor bounding boxes' information
+  encode_fn = anchor_info['encode_fn']
+  decode_fn = anchor_info['decode_fn']
+  num_anchors_per_layer = anchor_info['num_anchors_per_layer']
+  all_num_anchors_depth = anchor_info['all_num_anchors_depth']
+
+  # extract target values & predicted localization results
+  def encode_objects_n_decode_loc_pred(objects_n_loc_pred):
+    objects = objects_n_loc_pred[0]
+    loc_pred = objects_n_loc_pred[1]
+    flags, bboxes, labels = tf.split(objects, [1, 4, 1], axis=-1)
+    flags = tf.squeeze(tf.cast(flags, dtype=tf.int64), axis=-1)
+    labels = tf.squeeze(tf.cast(labels, dtype=tf.int64), axis=-1)
+    index = tf.where(flags > 0)
+    loc, cls, scr = encode_fn(tf.gather_nd(labels, index), tf.gather_nd(bboxes, index))
+    bbox = decode_fn(loc_pred)
+    return loc, cls, scr, bbox
+
+  # post-forward operations
+  with tf.control_dependencies([cls_pred, loc_pred]):
+    with tf.name_scope('post_forward'):
+      # obtain target values & localization predictions
+      loc_targets, cls_targets, match_scores, bboxes_pred = tf.map_fn(
+        encode_objects_n_decode_loc_pred,
+        (tf.reshape(objects, [batch_size, -1, 6]), tf.reshape(loc_pred, [batch_size, -1, 4])),
+        dtype=(tf.float32, tf.int64, tf.float32, [tf.float32] * len(num_anchors_per_layer)),
+        back_prop=False)
+      flatten_loc_targets = tf.reshape(loc_targets, [-1, 4])
+      flatten_cls_targets = tf.reshape(cls_targets, [-1])
+      flatten_match_scores = tf.reshape(match_scores, [-1])
+      bboxes_pred = [tf.reshape(preds, [-1, 4]) for preds in bboxes_pred]
+      bboxes_pred = tf.concat(bboxes_pred, axis=0)
+
+      # each positive examples has one label
+      positive_mask = flatten_cls_targets > 0
+      n_positives = tf.count_nonzero(positive_mask)
+      batch_n_positives = tf.count_nonzero(cls_targets, -1)
+      batch_negtive_mask = tf.equal(cls_targets, 0)
+      batch_n_negtives = tf.count_nonzero(batch_negtive_mask, -1)
+      batch_n_neg_select = tf.cast(
+        FLAGS.negative_ratio * tf.cast(batch_n_positives, tf.float32), tf.int32)
+      batch_n_neg_select = tf.minimum(batch_n_neg_select, tf.cast(batch_n_negtives, tf.int32))
+
+      # hard negative mining for classification
+      predictions_for_bg = tf.nn.softmax(
+        tf.reshape(cls_pred, [batch_size, -1, FLAGS.nb_classes]))[:, :, 0]
+      prob_for_negtives = tf.where(batch_negtive_mask,
+                                   0. - predictions_for_bg,
+                                   0. - tf.ones_like(predictions_for_bg))
+      topk_prob_for_bg, _ = tf.nn.top_k(prob_for_negtives, k=tf.shape(prob_for_negtives)[1])
+      score_at_k = tf.gather_nd(topk_prob_for_bg,
+                                tf.stack([tf.range(batch_size), batch_n_neg_select - 1], axis=-1))
+      selected_neg_mask = prob_for_negtives >= tf.expand_dims(score_at_k, axis=-1)
+
+      # include both selected negtive and all positive examples
+      final_mask = tf.stop_gradient(tf.logical_or(
+        tf.reshape(tf.logical_and(batch_negtive_mask, selected_neg_mask), [-1]), positive_mask))
+      total_examples = tf.count_nonzero(final_mask)
+
+      cls_pred = tf.boolean_mask(cls_pred, final_mask)
+      loc_pred = tf.boolean_mask(loc_pred, tf.stop_gradient(positive_mask))
+      flatten_cls_targets = tf.boolean_mask(
+        tf.clip_by_value(flatten_cls_targets, 0, FLAGS.nb_classes), final_mask)
+      flatten_loc_targets = tf.stop_gradient(tf.boolean_mask(flatten_loc_targets, positive_mask))
+
+      # final predictions & classification accuracy
+      predictions = {
+        'classes': tf.argmax(cls_pred, axis=-1),
+        'probabilities': tf.reduce_max(tf.nn.softmax(cls_pred, name='softmax_tensor'), axis=-1),
+        'loc_predict': bboxes_pred,
+      }
+      accuracy = tf.reduce_mean(
+        tf.cast(tf.equal(flatten_cls_targets, predictions['classes']), tf.float32))
+      metrics = {'accuracy': accuracy}
+
+  # cross-entropy loss
+  ce_loss = (FLAGS.negative_ratio + 1.) * \
+    tf.losses.sparse_softmax_cross_entropy(flatten_cls_targets, cls_pred)
+  tf.identity(ce_loss, name='ce_loss')
+  tf.summary.scalar('ce_loss', ce_loss)
+
+  # localization loss
+  loc_loss = tf.reduce_mean(
+    tf.reduce_sum(modified_smooth_l1(loc_pred, flatten_loc_targets, sigma=1.), axis=-1))
+  tf.identity(loc_loss, name='loc_loss')
+  tf.summary.scalar('loc_loss', loc_loss)
+
+  # L2-regularization loss
+  l2_loss_list = []
+  for var in trainable_vars:
+    if '_bn' not in var.name:
+      if 'conv4_3_scale' not in var.name:
+        l2_loss_list.append(tf.nn.l2_loss(var))
+      else:
+        l2_loss_list.append(tf.nn.l2_loss(var) * 0.1)
+  l2_loss = tf.add_n(l2_loss_list)
+  tf.identity(l2_loss, name='l2_loss')
+  tf.summary.scalar('l2_loss', l2_loss)
+
+  # overall loss
+  global_step = tf.train.get_or_create_global_step()
+  loss_w_cls = tf.minimum(
+    tf.cast(global_step, tf.float32) / tf.constant(FLAGS.nb_iters_cls_wmup, dtype=tf.float32), 1.0)
+  loss = loss_w_cls * ce_loss + loc_loss + FLAGS.loss_w_dcy * l2_loss
+
+  return loss, metrics
+
+class ModelHelper(AbstractModelHelper):
+  """Model helper for creating a VGG model for the VOC dataset."""
+
+  def __init__(self, data_format='channels_last'):
+    """Constructor function."""
+
+    # class-independent initialization
+    super(ModelHelper, self).__init__(data_format)
+
+    # initialize training & evaluation subsets
+    self.dataset_train = PascalVocDataset(preprocess_fn=preprocess_image, is_train=True)
+    self.dataset_eval = PascalVocDataset(preprocess_fn=preprocess_image, is_train=False)
+
+    # setup hyper-parameters & anchor information
+    self.anchor_info = None  # track the most recently-used one
+    self.batch_size = None  # track the most recently-used one
+    self.model_scope = None
+
+  def build_dataset_train(self, enbl_trn_val_split=False):
+    """Build the data subset for training, usually with data augmentation."""
+
+    return self.dataset_train.build()
+
+  def build_dataset_eval(self):
+    """Build the data subset for evaluation, usually without data augmentation."""
+
+    return self.dataset_eval.build()
+
+  def forward_train(self, inputs):
+    """Forward computation at training."""
+
+    anchor_info = setup_anchor_info()
+    outputs, self.model_scope = forward_fn(inputs, True, self.data_format, anchor_info)
+    self.anchor_info = anchor_info
+    self.batch_size = tf.shape(inputs['image'])[0]
+    self.trainable_vars = tf.get_collection(
+      tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.model_scope)
+
+    return outputs
+
+  def forward_eval(self, inputs):
+    """Forward computation at evaluation."""
+
+    anchor_info = setup_anchor_info()
+    outputs, __ = forward_fn(inputs, False, self.data_format, anchor_info)
+    self.anchor_info = anchor_info
+    self.batch_size = tf.shape(inputs['image'])[0]
+
+    return outputs
+
+  def calc_loss(self, objects, outputs, trainable_vars):
+    """Calculate loss (and some extra evaluation metrics)."""
+
+    return calc_loss_fn(objects, outputs, trainable_vars, self.anchor_info, self.batch_size)
+
+  def setup_lrn_rate(self, global_step):
+    """Setup the learning rate (and number of training iterations)."""
+
+    bnds = [int(x) for x in parse_comma_list(FLAGS.lrn_rate_dcy_bnds)]
+    vals = [FLAGS.lrn_rate_init * x for x in parse_comma_list(FLAGS.lrn_rate_dcy_rates)]
+    lrn_rate = tf.train.piecewise_constant(global_step, bnds, vals)
+    lrn_rate = tf.maximum(lrn_rate, tf.constant(FLAGS.lrn_rate_min, dtype=lrn_rate.dtype))
+    nb_iters = FLAGS.nb_iters_train
+
+    return lrn_rate, nb_iters
+
+  def warm_start(self, sess):
+    """Initialize the model for warm-start.
+
+    Description:
+    * We use a pre-trained ImageNet classification model to initialize the backbone part of the SSD
+      model for feature extraction. If the SSD model's checkpoint files already exist, then the
+      learner should restore model weights by itself.
+    """
+
+    # obtain a list of scopes to be excluded from initialization
+    excl_scopes = []
+    if FLAGS.warm_start_excl_scopes:
+      excl_scopes = [scope.strip() for scope in FLAGS.warm_start_excl_scopes.split(',')]
+    tf.logging.info('excluded scopes: {}'.format(excl_scopes))
+
+    # obtain a list of variables to be initialized
+    vars_list = []
+    for var in self.trainable_vars:
+      excluded = False
+      for scope in excl_scopes:
+        if scope in var.name:
+          excluded = True
+          break
+      if not excluded:
+        vars_list.append(var)
+
+    # rename the variables' scope
+    if FLAGS.backbone_model_scope is not None:
+      backbone_model_scope = FLAGS.backbone_model_scope.strip()
+      if backbone_model_scope == '':
+        vars_list = {var.op.name.replace(self.model_scope + '/', ''): var for var in vars_list}
+      else:
+        vars_list = {var.op.name.replace(
+          self.model_scope, backbone_model_scope): var for var in vars_list}
+
+    # re-map the variables' names
+    name_remap = {'/kernel': '/weights', '/bias': '/biases'}
+    vars_list_remap = {}
+    for var_name, var in vars_list.items():
+      for name_old, name_new in name_remap.items():
+        if name_old in var_name:
+          var_name = var_name.replace(name_old, name_new)
+          break
+      vars_list_remap[var_name] = var
+    vars_list = vars_list_remap
+
+    # display all the variables to be initialized
+    for var_name, var in vars_list.items():
+      tf.logging.info('using %s to initialize %s' % (var_name, var.op.name))
+    if not vars_list:
+      raise ValueError('variables to be restored cannot be empty')
+
+    # obtain the checkpoint files' path
+    ckpt_path = tf.train.latest_checkpoint(FLAGS.backbone_ckpt_dir)
+    tf.logging.info('restoring model weights from ' + ckpt_path)
+
+    # remove missing variables from the list
+    if FLAGS.ignore_missing_vars:
+      reader = tf.train.NewCheckpointReader(ckpt_path)
+      vars_list_avail = {}
+      for var in vars_list:
+        if reader.has_tensor(var):
+          vars_list_avail[var] = vars_list[var]
+        else:
+          tf.logging.warning('variable %s not found in checkpoint files %s.' % (var, ckpt_path))
+      vars_list = vars_list_avail
+    if not vars_list:
+      tf.logging.warning('no variables to restore.')
+      return
+
+    # restore variables from checkpoint files
+    saver = tf.train.Saver(vars_list, reshape=False)
+    saver.build()
+    saver.restore(sess, ckpt_path)
+
+  def dump_n_eval(self, outputs, action):
+    """Dump the model's outputs to files and evaluate."""
+
+    if not is_primary_worker('global'):
+      return
+
+    if action == 'init':
+      if os.path.exists(FLAGS.outputs_dump_dir):
+        shutil.rmtree(FLAGS.outputs_dump_dir)
+      os.mkdir(FLAGS.outputs_dump_dir)
+    elif action == 'dump':
+      filename = outputs['predictions']['filename'][0].decode('utf8')[:-4]
+      shape = outputs['predictions']['shape'][0]
+      for idx_cls in range(1, FLAGS.nb_classes):
+        with open(os.path.join(FLAGS.outputs_dump_dir, 'results_%d.txt' % idx_cls), 'a') as o_file:
+          scores = outputs['predictions']['scores_%d' % idx_cls][0]
+          bboxes = outputs['predictions']['bboxes_%d' % idx_cls][0]
+          bboxes[:, 0] = (bboxes[:, 0] * shape[0]).astype(np.int32, copy=False) + 1
+          bboxes[:, 1] = (bboxes[:, 1] * shape[1]).astype(np.int32, copy=False) + 1
+          bboxes[:, 2] = (bboxes[:, 2] * shape[0]).astype(np.int32, copy=False) + 1
+          bboxes[:, 3] = (bboxes[:, 3] * shape[1]).astype(np.int32, copy=False) + 1
+          for idx_bbox in range(bboxes.shape[0]):
+            bbox = bboxes[idx_bbox][:]
+            if bbox[2] > bbox[0] and bbox[3] > bbox[1]:
+              o_file.write('%s %.3f %.1f %.1f %.1f %.1f\n'
+                           % (filename, scores[idx_bbox], bbox[1], bbox[0], bbox[3], bbox[2]))
+    elif action == 'eval':
+      do_python_eval(os.path.join(self.dataset_eval.data_dir, 'test'), FLAGS.outputs_dump_dir)
+    else:
+      raise ValueError('unrecognized action in dump_n_eval(): ' + action)
+
+  @property
+  def model_name(self):
+    """Model's name."""
+
+    return 'ssd_vgg_300'
+
+  @property
+  def dataset_name(self):
+    """Dataset's name."""
+
+    return 'pascalvoc'
diff --git a/nets/vgg_at_pascalvoc_run.py b/nets/vgg_at_pascalvoc_run.py
new file mode 100644
index 0000000..2bcf9c2
--- /dev/null
+++ b/nets/vgg_at_pascalvoc_run.py
@@ -0,0 +1,69 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Execution script for VGG models on the Pascal VOC dataset."""
+
+import traceback
+import tensorflow as tf
+
+from nets.vgg_at_pascalvoc import ModelHelper
+from learners.learner_utils import create_learner
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_string('log_dir', './logs', 'logging directory')
+tf.app.flags.DEFINE_boolean('enbl_multi_gpu', False, 'enable multi-GPU training')
+tf.app.flags.DEFINE_string('learner', 'full-prec', 'learner\'s name')
+tf.app.flags.DEFINE_string('exec_mode', 'train', 'execution mode: train / eval')
+tf.app.flags.DEFINE_boolean('debug', False, 'debugging information')
+
+def main(unused_argv):
+  """Main entry."""
+
+  try:
+    # setup the TF logging routine
+    if FLAGS.debug:
+      tf.logging.set_verbosity(tf.logging.DEBUG)
+    else:
+      tf.logging.set_verbosity(tf.logging.INFO)
+    sm_writer = tf.summary.FileWriter(FLAGS.log_dir)
+
+    # display FLAGS's values
+    tf.logging.info('FLAGS:')
+    for key, value in FLAGS.flag_values_dict().items():
+      tf.logging.info('{}: {}'.format(key, value))
+
+    # build the model helper & learner
+    model_helper = ModelHelper()
+    learner = create_learner(sm_writer, model_helper)
+
+    # execute the learner
+    if FLAGS.exec_mode == 'train':
+      learner.train()
+    elif FLAGS.exec_mode == 'eval':
+      learner.download_model()
+      learner.evaluate()
+    else:
+      raise ValueError('unrecognized execution mode: ' + FLAGS.exec_mode)
+
+    # exit normally
+    return 0
+  except ValueError:
+    traceback.print_exc()
+    return 1  # exit with errors
+
+if __name__ == '__main__':
+  tf.app.run()
diff --git a/requirement.txt b/requirement.txt
new file mode 100644
index 0000000..2fa693c
--- /dev/null
+++ b/requirement.txt
@@ -0,0 +1,8 @@
+docopt>=0.6.2
+hdfs>=2.1.0
+numpy>=1.14.0
+scipy>=1.0.0
+sklearn>=0.19.1
+pandas>=0.22.0
+mpi4py>=3.0.0
+tensorflow>=1.10.0
diff --git a/scripts/create_minimal.sh b/scripts/create_minimal.sh
index 82305e1..36ccfcc 100755
--- a/scripts/create_minimal.sh
+++ b/scripts/create_minimal.sh
@@ -17,9 +17,9 @@ cd ${dir_temp}
 
 # remove redundant files
 #git clean -xdf  # all files ignored by git
-rm -r ./models
-rm -r ./logs 
-rm -rf .git
+rm -rf ./models*
+rm -rf ./logs
+rm -rf .git .gitignore
 cp ${dir_curr}/path.conf .
 
 # return to the original directory
diff --git a/scripts/run_docker.sh b/scripts/run_docker.sh
index bb398e8..68d3363 100755
--- a/scripts/run_docker.sh
+++ b/scripts/run_docker.sh
@@ -14,6 +14,7 @@ nb_gpus=1
 # parse arguments passed from the command line
 py_script="$1"
 shift
+extra_args=""
 for i in "$@"
 do
   case "$i" in
@@ -23,11 +24,13 @@ do
     ;;
     *)
     # unknown option
+    extra_args="${extra_args} ${i}"
+    shift
     ;;
   esac
 done
-extra_args=`python utils/get_path_args.py docker ${py_script} path.conf`
-extra_args="$@ ${extra_args}"
+extra_args_path=`python utils/get_path_args.py docker ${py_script} path.conf`
+extra_args="${extra_args} ${extra_args_path}"
 echo ${extra_args} > extra_args
 echo "Python script: ${py_script}"
 echo "Data directory: ${dir_data}"
diff --git a/scripts/run_local.sh b/scripts/run_local.sh
index 3354575..ecf020b 100755
--- a/scripts/run_local.sh
+++ b/scripts/run_local.sh
@@ -6,6 +6,7 @@ nb_gpus=1
 # parse arguments passed from the command line
 py_script="$1"
 shift
+extra_args=""
 for i in "$@"
 do
   case "$i" in
@@ -15,11 +16,13 @@ do
     ;;
     *)
     # unknown option
+    extra_args="${extra_args} ${i}"
+    shift
     ;;
   esac
 done
-extra_args=`python utils/get_path_args.py local ${py_script} path.conf`
-extra_args="$@ ${extra_args}"
+extra_args_path=`python utils/get_path_args.py local ${py_script} path.conf`
+extra_args="${extra_args} ${extra_args_path}"
 echo "Python script: ${py_script}"
 echo "# of GPUs: ${nb_gpus}"
 echo "extra arguments: ${extra_args}"
diff --git a/scripts/run_seven.sh b/scripts/run_seven.sh
index 4e54d16..e394807 100755
--- a/scripts/run_seven.sh
+++ b/scripts/run_seven.sh
@@ -15,6 +15,7 @@ job_name="pocket-flow"
 # parse arguments passed from the command line
 py_script="$1"
 shift
+extra_args=""
 for i in "$@"
 do
   case "$i" in
@@ -28,11 +29,13 @@ do
     ;;
     *)
     # unknown option
+    extra_args="${extra_args} ${i}"
+    shift
     ;;
   esac
 done
-extra_args=`python utils/get_path_args.py seven ${py_script} path.conf`
-extra_args="$@ ${extra_args}"
+extra_args_path=`python utils/get_path_args.py seven ${py_script} path.conf`
+extra_args="${extra_args} ${extra_args_path}"
 echo ${extra_args} > extra_args
 echo "Python script: ${py_script}"
 echo "Job name: ${job_name}"
diff --git a/seven.yaml b/seven.yaml
index cf06306..f8eca5f 100644
--- a/seven.yaml
+++ b/seven.yaml
@@ -3,7 +3,7 @@ kind: standalone
 jobname: pocket-flow
 container:
   image:
-    docker.oa.com/g_tfplus/tfplus:tensorflow1.8-python3.6-cuda9.0-cudnn7.0.4.31-ubuntu16.04-tfplus-v2
+    docker.oa.com/g_tfplus/tfplus:tensorflow1.8-python3.6-cuda9.0-cudnn7.0.4.31-ubuntu16.04-tfplus-v3
     #docker.oa.com/g_tfplus/horovod:python3.5
   resources:
     nvidia.com/gpu: 1
diff --git a/tools/benchmark/calc_inference_time.py b/tools/benchmark/calc_inference_time.py
new file mode 100644
index 0000000..d44dcb6
--- /dev/null
+++ b/tools/benchmark/calc_inference_time.py
@@ -0,0 +1,114 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Measure the time consumption of *.pb and *.tflite models."""
+
+import traceback
+from timeit import default_timer as timer
+import numpy as np
+import tensorflow as tf
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_string('model_file', None, 'model file path')
+tf.app.flags.DEFINE_string('input_name', 'net_input', 'input tensor\'s name in the *.pb model')
+tf.app.flags.DEFINE_string('output_name', 'net_output', 'output tensor\'s name in the *.pb model')
+tf.app.flags.DEFINE_string('input_dtype', 'float32',
+                           'input tensor\'s data type in the *.tflite model')
+tf.app.flags.DEFINE_integer('batch_size', 1, 'batch size for run-time benchmark')
+tf.app.flags.DEFINE_integer('nb_repts_warmup', 100, '# of repeated runs for warm-up')
+tf.app.flags.DEFINE_integer('nb_repts', 100, '# of repeated runs for elapsed time measurement')
+
+def test_pb_model():
+  """Test the *.pb model."""
+
+  with tf.Graph().as_default() as graph:
+    sess = tf.Session()
+
+    # restore the model
+    graph_def = tf.GraphDef()
+    with tf.gfile.GFile(FLAGS.model_file, 'rb') as i_file:
+      graph_def.ParseFromString(i_file.read())
+    tf.import_graph_def(graph_def)
+
+    # obtain input & output nodes and then test the model
+    net_input = graph.get_tensor_by_name('import/' + FLAGS.input_name + ':0')
+    net_output = graph.get_tensor_by_name('import/' + FLAGS.output_name + ':0')
+    net_input_data = np.zeros(tuple([FLAGS.batch_size] + list(net_input.shape[1:])))
+    for idx in range(FLAGS.nb_repts_warmup + FLAGS.nb_repts):
+      if idx == FLAGS.nb_repts_warmup:
+        time_beg = timer()
+      sess.run(net_output, feed_dict={net_input: net_input_data})
+    time_elapsed = (timer() - time_beg) / FLAGS.nb_repts / FLAGS.batch_size
+    tf.logging.info('time consumption of *.pb model: %.2f ms' % (time_elapsed * 1000))
+
+def test_tflite_model():
+  """Test the *.tflite model."""
+
+  # restore the model and allocate tensors
+  interpreter = tf.contrib.lite.Interpreter(model_path=FLAGS.model_file)
+  interpreter.allocate_tensors()
+
+  # get input & output tensors
+  input_details = interpreter.get_input_details()
+  output_details = interpreter.get_output_details()
+  assert len(input_details) == 1, '<input_details> should contain only one element'
+  if FLAGS.input_dtype == 'uint8':
+    net_input_data = np.zeros(input_details[0]['shape'], dtype=np.uint8)
+  elif FLAGS.input_dtype == 'float32':
+    net_input_data = np.zeros(input_details[0]['shape'], dtype=np.float32)
+  else:
+    raise ValueError('unrecognized input data type: ' + FLAGS.input_dtype)
+
+  # test the model with given inputs
+  for idx in range(FLAGS.nb_repts_warmup + FLAGS.nb_repts):
+    if idx == FLAGS.nb_repts_warmup:
+      time_beg = timer()
+    interpreter.set_tensor(input_details[0]['index'], net_input_data)
+    interpreter.invoke()
+    interpreter.get_tensor(output_details[0]['index'])
+  time_elapsed = (timer() - time_beg) / FLAGS.nb_repts
+  tf.logging.info('time consumption of *.tflite model: %.2f ms' % (time_elapsed * 1000))
+
+def main(unused_argv):
+  """Main entry.
+
+  Args:
+  * unused_argv: unused arguments (after FLAGS is parsed)
+  """
+
+  try:
+    # setup the TF logging routine
+    tf.logging.set_verbosity(tf.logging.INFO)
+
+    # call benchmark routines for *.pb / *.tflite models
+    if FLAGS.model_file is None:
+      raise ValueError('<FLAGS.model_file> must not be None')
+    elif FLAGS.model_file.endswith('.pb'):
+      test_pb_model()
+    elif FLAGS.model_file.endswith('.tflite'):
+      test_tflite_model()
+    else:
+      raise ValueError('unrecognized model file path: ' + FLAGS.model_file)
+
+    # exit normally
+    return 0
+  except ValueError:
+    traceback.print_exc()
+    return 1  # exit with errors
+
+if __name__ == '__main__':
+  tf.app.run()
diff --git a/tools/conversion/convert_data_format.py b/tools/conversion/convert_data_format.py
index a28319e..999b3ff 100644
--- a/tools/conversion/convert_data_format.py
+++ b/tools/conversion/convert_data_format.py
@@ -20,8 +20,11 @@
 import traceback
 import tensorflow as tf
 
-# you may need to replace <ModelHelper> before conversion
-from nets.resnet_at_cifar10 import ModelHelper
+# NOTE: un-comment the corresponding <ModelHelper> before conversion
+#from nets.lenet_at_cifar10 import ModelHelper
+#from nets.resnet_at_cifar10 import ModelHelper
+from nets.resnet_at_ilsvrc12 import ModelHelper
+#from nets.mobilenet_at_ilsvrc12 import ModelHelper
 
 FLAGS = tf.app.flags.FLAGS
 
@@ -29,7 +32,8 @@
 tf.app.flags.DEFINE_boolean('enbl_multi_gpu', False, 'enable multi-GPU training')
 tf.app.flags.DEFINE_string('model_dir_in', './models', 'input model directory')
 tf.app.flags.DEFINE_string('model_dir_out', './models_out', 'output model directory')
-tf.app.flags.DEFINE_string('data_format_src', 'channels_last', 'data format in the source model')
+tf.app.flags.DEFINE_string('model_scope', 'model', 'model\'s variable scope name')
+tf.app.flags.DEFINE_string('data_format', 'channels_last', 'data format in the output model')
 
 def main(unused_argv):
   """Main entry.
@@ -46,9 +50,9 @@ def main(unused_argv):
     #sess = tf.Session()
 
     # create the model helper
-    model_helper = ModelHelper()
+    model_helper = ModelHelper(FLAGS.data_format)
     data_scope = 'data'
-    model_scope = 'pruned_model'
+    model_scope = FLAGS.model_scope
 
     # bulid a graph with the target data format and rewrite checkpoint files
     with tf.Graph().as_default():
@@ -59,11 +63,7 @@ def main(unused_argv):
 
       # model definition
       with tf.variable_scope(model_scope):
-        if FLAGS.data_format_src == 'channels_last':
-          data_format_dst = 'channels_first'
-        else:
-          data_format_dst = 'channels_last'
-        logits = model_helper.forward_eval(images, data_format=data_format_dst)
+        logits = model_helper.forward_eval(images)
 
       # add input & output tensors to certain collections
       tf.add_to_collection('images_final', images)
diff --git a/tools/conversion/export_chn_pruned_tflite_model.py b/tools/conversion/export_chn_pruned_tflite_model.py
new file mode 100644
index 0000000..7f5ffdd
--- /dev/null
+++ b/tools/conversion/export_chn_pruned_tflite_model.py
@@ -0,0 +1,379 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Export a channel-pruned *.tflite model from checkpoint files."""
+
+import os
+import re
+import traceback
+from timeit import default_timer as timer
+import numpy as np
+import tensorflow as tf
+from tensorflow.contrib import graph_editor
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_string('log_dir', './logs', 'logging directory')
+tf.app.flags.DEFINE_string('model_dir', './models', 'model directory')
+tf.app.flags.DEFINE_string('input_coll', 'images_final', 'input tensor\'s collection')
+tf.app.flags.DEFINE_string('output_coll', 'logits_final', 'output tensor\'s collection')
+tf.app.flags.DEFINE_boolean('enbl_fake_prune', False, 'enable fake pruning (for speed test only)')
+tf.app.flags.DEFINE_float('fake_prune_ratio', 0.5, 'fake pruning ratio')
+tf.app.flags.DEFINE_integer('nb_repts_warmup', 100, '# of repeated runs for warm-up')
+tf.app.flags.DEFINE_integer('nb_repts', 100, '# of repeated runs for elapsed time measurement')
+
+def get_file_path_meta():
+  """Get the file path to the *.meta data.
+
+  Returns:
+  * file_path: file path to the *.meta data
+  """
+
+  pattern = re.compile('model.ckpt.meta$')
+  for file_name in os.listdir(FLAGS.model_dir):
+    if re.search(pattern, file_name) is not None:
+      file_path = os.path.join(FLAGS.model_dir, file_name)
+      break
+
+  return file_path
+
+def get_input_name_n_shape(file_path):
+  """Get the input tensor's name & shape from *.meta file.
+
+  Args:
+  * file_path: file path to the *.meta data
+
+  Returns:
+  * input_name: input tensor's name
+  * input_shape: input tensor's shape
+  """
+
+  with tf.Graph().as_default():
+    tf.train.import_meta_graph(file_path)
+    net_input = tf.get_collection(FLAGS.input_coll)[0]
+    input_name = net_input.name
+    input_shape = net_input.shape
+
+  return input_name, input_shape
+
+def get_data_format(sess):
+  """Get the data format of convolutional layers.
+
+  Args:
+  * sess: TensorFlow session
+
+  Returns:
+  * data_format: data format of convolutional layers
+  """
+
+  data_format = None
+  pattern = re.compile('Conv2D$')
+  for op in tf.get_default_graph().get_operations():
+    if re.search(pattern, op.name) is not None:
+      data_format = op.get_attr('data_format').decode('utf-8')
+      tf.logging.info('data format: ' + data_format)
+      break
+
+  return data_format
+
+def convert_pb_model_to_tflite(file_path_pb, file_path_tflite, net_input_name, net_output_name):
+  """Convert *.pb model to a *.tflite model.
+
+  Args:
+  * file_path_pb: file path to the *.pb model
+  * file_path_tflite: file path to the *.tflite model
+  * net_input_name: network's input node's name
+  * net_output_name: network's output node's name
+  """
+
+  tf.logging.info(file_path_pb + ' -> ' + file_path_tflite)
+  with tf.Graph().as_default():
+    converter = tf.contrib.lite.TocoConverter.from_frozen_graph(
+      file_path_pb, [net_input_name], [net_output_name])
+    tflite_model = converter.convert()
+    with tf.gfile.GFile(file_path_tflite, 'wb') as o_file:
+      o_file.write(tflite_model)
+
+def test_pb_model(file_path, net_input_name, net_output_name, net_input_data):
+  """Test the *.pb model.
+
+  Args:
+  * file_path: file path to the *.pb model
+  * net_input_name: network's input node's name
+  * net_output_name: network's output node's name
+  * net_input_data: network's input node's data
+  """
+
+  with tf.Graph().as_default() as graph:
+    sess = tf.Session()
+
+    # restore the model
+    graph_def = tf.GraphDef()
+    with tf.gfile.GFile(file_path, 'rb') as i_file:
+      graph_def.ParseFromString(i_file.read())
+    tf.import_graph_def(graph_def)
+
+    # obtain input & output nodes and then test the model
+    net_input = graph.get_tensor_by_name('import/' + net_input_name + ':0')
+    net_output = graph.get_tensor_by_name('import/' + net_output_name + ':0')
+    tf.logging.info('input: {} / output: {}'.format(net_input.name, net_output.name))
+    for idx in range(FLAGS.nb_repts_warmup + FLAGS.nb_repts):
+      if idx == FLAGS.nb_repts_warmup:
+        time_beg = timer()
+      net_output_data = sess.run(net_output, feed_dict={net_input: net_input_data})
+    time_elapsed = (timer() - time_beg) / FLAGS.nb_repts
+    tf.logging.info('outputs from the *.pb model: {}'.format(net_output_data))
+    tf.logging.info('time consumption of *.pb model: %.2f ms' % (time_elapsed * 1000))
+
+def test_tflite_model(file_path, net_input_data):
+  """Test the *.tflite model.
+
+  Args:
+  * file_path: file path to the *.tflite model
+  * net_input_data: network's input node's data
+  """
+
+  # restore the model and allocate tensors
+  interpreter = tf.contrib.lite.Interpreter(model_path=file_path)
+  interpreter.allocate_tensors()
+
+  # get input & output tensors
+  input_details = interpreter.get_input_details()
+  output_details = interpreter.get_output_details()
+  tf.logging.info('input details: {}'.format(input_details))
+  tf.logging.info('output details: {}'.format(output_details))
+
+  # test the model with given inputs
+  for idx in range(FLAGS.nb_repts_warmup + FLAGS.nb_repts):
+    if idx == FLAGS.nb_repts_warmup:
+      time_beg = timer()
+    interpreter.set_tensor(input_details[0]['index'], net_input_data)
+    interpreter.invoke()
+    net_output_data = interpreter.get_tensor(output_details[0]['index'])
+  time_elapsed = (timer() - time_beg) / FLAGS.nb_repts
+  tf.logging.info('outputs from the *.tflite model: {}'.format(net_output_data))
+  tf.logging.info('time consumption of *.tflite model: %.2f ms' % (time_elapsed * 1000))
+
+def is_initialized(sess, var):
+  """Check whether a variable is initialized.
+
+  Args:
+  * sess: TensorFlow session
+  * var: variabile to be checked
+  """
+
+  try:
+    sess.run(var)
+    return True
+  except tf.errors.FailedPreconditionError:
+    return False
+
+def apply_fake_pruning(kernel):
+  """Apply fake pruning to the convolutional kernel.
+
+  Args:
+  * kernel: original convolutional kernel
+
+  Returns:
+  * kernel: randomly pruned convolutional kernel
+  """
+
+  tf.logging.info('kernel shape: {}'.format(kernel.shape))
+  nb_chns = kernel.shape[2]
+  idxs_all = np.arange(nb_chns)
+  np.random.shuffle(idxs_all)
+  idxs_pruned = idxs_all[:int(nb_chns * FLAGS.fake_prune_ratio)]
+  kernel[:, :, idxs_pruned, :] = 0.0
+
+  return kernel
+
+def replace_dropout_layers():
+  """Replace dropout layers with identity mappings.
+
+  Returns:
+  * op_outputs_old: output nodes to be swapped in the old graph
+  * op_outputs_new: output nodes to be swapped in the new graph
+  """
+
+  pattern_div = re.compile('/dropout/div')
+  pattern_mul = re.compile('/dropout/mul')
+  op_outputs_old, op_outputs_new = [], []
+  for op in tf.get_default_graph().get_operations():
+    if re.search(pattern_div, op.name) is not None:
+      x = tf.identity(op.inputs[0])
+      op_outputs_new += [x]
+    if re.search(pattern_mul, op.name) is not None:
+      op_outputs_old += [op.outputs[0]]
+
+  return op_outputs_old, op_outputs_new
+
+def insert_alt_routines(sess, graph_trans_mthd):
+  """Insert alternative rountines for convolutional layers.
+
+  Args:
+  * sess: TensorFlow session
+  * graph_trans_mthd: graph transformation method
+
+  Returns:
+  * op_outputs_old: output nodes to be swapped in the old graph
+  * op_outputs_new: output nodes to be swapped in the new graph
+  """
+
+  pattern = re.compile('Conv2D$')
+  op_outputs_old, op_outputs_new = [], []
+  for op in tf.get_default_graph().get_operations():
+    if re.search(pattern, op.name) is not None:
+      # skip un-initialized variables, which is not needed in the final *.pb file
+      if not is_initialized(sess, op.inputs[1]):
+        continue
+
+      # detect which channels to be pruned
+      tf.logging.info('transforming OP: ' + op.name)
+      kernel = sess.run(op.inputs[1])
+      if FLAGS.enbl_fake_prune:
+        kernel = apply_fake_pruning(kernel)
+      kernel_chn_in = kernel.shape[2]
+      strides = op.get_attr('strides')
+      padding = op.get_attr('padding').decode('utf-8')
+      data_format = op.get_attr('data_format').decode('utf-8')
+      dilations = op.get_attr('dilations')
+      nnzs = np.nonzero(np.sum(np.abs(kernel), axis=(0, 1, 3)))[0]
+      tf.logging.info('reducing %d channels to %d' % (kernel_chn_in, nnzs.size))
+      kernel_gthr = np.zeros((1, 1, kernel_chn_in, nnzs.size))
+      kernel_gthr[0, 0, nnzs, np.arange(nnzs.size)] = 1.0
+      kernel_shrk = kernel[:, :, nnzs, :]
+
+      # replace channel pruned convolutional with cheaper operations
+      if graph_trans_mthd == 'gather':
+        x = tf.gather(op.inputs[0], nnzs, axis=1)
+        x = tf.nn.conv2d(
+          x, kernel_shrk, strides, padding, data_format=data_format, dilations=dilations)
+      elif graph_trans_mthd == '1x1_conv':
+        x = tf.nn.conv2d(op.inputs[0], kernel_gthr, [1, 1, 1, 1], 'SAME', data_format=data_format)
+        x = tf.nn.conv2d(
+          x, kernel_shrk, strides, padding, data_format=data_format, dilations=dilations)
+      else:
+        raise ValueError('unrecognized graph transformation method: ' + graph_trans_mthd)
+
+      # obtain old and new routines' outputs
+      op_outputs_old += [op.outputs[0]]
+      op_outputs_new += [x]
+
+  return op_outputs_old, op_outputs_new
+
+def export_pb_tflite_model(net, file_path_meta, file_path_pb, file_path_tflite, edit_graph):
+  """Export *.pb & *.tflite models from checkpoint files.
+
+  Args:
+  * net: network configurations
+  * file_path_meta: file path to the *.meta data
+  * file_path_pb: file path to the *.pb model
+  * file_path_tflite: file path to the *.tflite model
+  * edit_graph: whether the graph should be edited
+  """
+
+  # convert checkpoint files to a *.pb model
+  with tf.Graph().as_default() as graph:
+    sess = tf.Session()
+
+    # restore the graph with inputs replaced
+    net_input = tf.placeholder(tf.float32, shape=net['input_shape'], name=net['input_name'])
+    saver = tf.train.import_meta_graph(
+      file_path_meta, input_map={net['input_name_ckpt']: net_input})
+    saver.restore(sess, file_path_meta.replace('.meta', ''))
+
+    # obtain the data format and determine which graph transformation method to be used
+    data_format = get_data_format(sess)
+    graph_trans_mthd = 'gather' if data_format == 'NCHW' else '1x1_conv'
+
+    # obtain the output node
+    net_logits = tf.get_collection(FLAGS.output_coll)[0]
+    net_output = tf.nn.softmax(net_logits, name=net['output_name'])
+    tf.logging.info('input: {} / output: {}'.format(net_input.name, net_output.name))
+    tf.logging.info('input\'s shape: {}'.format(net_input.shape))
+    tf.logging.info('output\'s shape: {}'.format(net_output.shape))
+
+    # replace dropout layers with identity mappings (TF-Lite does not support dropout layers)
+    op_outputs_old, op_outputs_new = replace_dropout_layers()
+    sess.close()
+    graph_editor.swap_outputs(op_outputs_old, op_outputs_new)
+    sess = tf.Session()  # open a new session
+    saver.restore(sess, file_path_meta.replace('.meta', ''))
+
+    # edit the graph by inserting alternative routines for each convolutional layer
+    if edit_graph:
+      op_outputs_old, op_outputs_new = insert_alt_routines(sess, graph_trans_mthd)
+      sess.close()
+      graph_editor.swap_outputs(op_outputs_old, op_outputs_new)
+      sess = tf.Session()  # open a new session
+      saver.restore(sess, file_path_meta.replace('.meta', ''))
+
+    # write the original grpah to *.pb file
+    graph_def = graph.as_graph_def()
+    graph_def = tf.graph_util.convert_variables_to_constants(sess, graph_def, [net['output_name']])
+    file_name_pb = os.path.basename(file_path_pb)
+    tf.train.write_graph(graph_def, FLAGS.model_dir, file_name_pb, as_text=False)
+    tf.logging.info(file_path_pb + ' generated')
+    test_pb_model(file_path_pb, net['input_name'], net['output_name'], net['input_data'])
+
+  # convert the *.pb model to a *.tflite model (only NHWC is supported)
+  if data_format == 'NHWC':
+    convert_pb_model_to_tflite(file_path_pb, file_path_tflite, net['input_name'], net['output_name'])
+    tf.logging.info(file_path_tflite + ' generated')
+    test_tflite_model(file_path_tflite, net['input_data'])
+  else:
+    tf.logging.warning('*.tflite model not generated since NCHW is not supported by TF-Lite')
+
+def main(unused_argv):
+  """Main entry.
+
+  Args:
+  * unused_argv: unused arguments (after FLAGS is parsed)
+  """
+
+  try:
+    # setup the TF logging routine
+    tf.logging.set_verbosity(tf.logging.INFO)
+
+    # network configurations
+    file_path_meta = get_file_path_meta()
+    input_name, input_shape = get_input_name_n_shape(file_path_meta)
+    net = {
+      'input_name_ckpt': input_name,  # used to import the model from checkpoint files
+      'input_name': 'net_input',  # used to export the model to *.pb & *.tflite files
+      'input_shape': input_shape,
+      'output_name': 'net_output'
+    }
+    net['input_data'] = np.zeros(tuple([1] + list(net['input_shape'][1:])), dtype=np.float32)
+
+    # generate *.pb & *.tflite files for the original model
+    file_path_pb = os.path.join(FLAGS.model_dir, 'model_original.pb')
+    file_path_tflite = os.path.join(FLAGS.model_dir, 'model_original.tflite')
+    export_pb_tflite_model(net, file_path_meta, file_path_pb, file_path_tflite, edit_graph=False)
+
+    # generate *.pb & *.tflite files for the transformed model
+    file_path_pb = os.path.join(FLAGS.model_dir, 'model_transformed.pb')
+    file_path_tflite = os.path.join(FLAGS.model_dir, 'model_transformed.tflite')
+    export_pb_tflite_model(net, file_path_meta, file_path_pb, file_path_tflite, edit_graph=True)
+
+    # exit normally
+    return 0
+  except ValueError:
+    traceback.print_exc()
+    return 1  # exit with errors
+
+if __name__ == '__main__':
+  tf.app.run()
diff --git a/tools/conversion/export_pb_tflite_models.py b/tools/conversion/export_pb_tflite_models.py
index 474cb4c..cb5dfb9 100644
--- a/tools/conversion/export_pb_tflite_models.py
+++ b/tools/conversion/export_pb_tflite_models.py
@@ -14,7 +14,14 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Export *.pb & *.tflite models from checkpoint files."""
+"""Export *.pb & *.tflite models from checkpoint files.
+
+Description:
+* To export compressed *.pb & *.tflite models trained with channel pruning based algorithms,
+    set <enbl_chn_prune> to True.
+* To export compressed *.pb & *.tflite models trained with the <UniformQuantTFLearner> learner,
+    set <enbl_uni_quant> to True.
+"""
 
 import os
 import re
@@ -22,22 +29,36 @@
 import numpy as np
 import tensorflow as tf
 from tensorflow.contrib import graph_editor
+from tensorflow.contrib.lite.python import lite_constants
 
 FLAGS = tf.app.flags.FLAGS
 
+# common configurations
 tf.app.flags.DEFINE_string('log_dir', './logs', 'logging directory')
 tf.app.flags.DEFINE_string('model_dir', './models', 'model directory')
 tf.app.flags.DEFINE_string('input_coll', 'images_final', 'input tensor\'s collection')
 tf.app.flags.DEFINE_string('output_coll', 'logits_final', 'output tensor\'s collection')
 
-def get_file_path_meta():
-  """Get the file path to the *.meta data.
+# channel-pruning-related configurations
+tf.app.flags.DEFINE_boolean('enbl_chn_prune', False,
+                            'enable exporting models with pruned channels removed')
+tf.app.flags.DEFINE_boolean('enbl_fake_prune', False, 'enable fake pruning (for speed test only)')
+tf.app.flags.DEFINE_float('fake_prune_ratio', 0.5, 'fake pruning ratio')
+
+# uniform-quantization-related configurations
+tf.app.flags.DEFINE_boolean('enbl_uni_quant', False,
+                            'enable exporting models with uniform quantization operations applied')
+tf.app.flags.DEFINE_boolean('enbl_fake_quant', False,
+                            'enable post-training quantization (may have extra performance loss)')
+
+def get_meta_path():
+  """Get the path to the *.meta file.
 
   Returns:
-  * file_path: file path to the *.meta data
+  * file_path: path to the *.meta file
   """
 
-  pattern = re.compile('model.ckpt.meta$')
+  pattern = re.compile(r'model\.ckpt\.meta$')  # file name must be: *model.ckpt.meta
   for file_name in os.listdir(FLAGS.model_dir):
     if re.search(pattern, file_name) is not None:
       file_path = os.path.join(FLAGS.model_dir, file_name)
@@ -45,11 +66,11 @@ def get_file_path_meta():
 
   return file_path
 
-def get_input_name_n_shape(file_path):
+def get_input_name_n_shape(meta_path):
   """Get the input tensor's name & shape from *.meta file.
 
   Args:
-  * file_path: file path to the *.meta data
+  * meta_path: path to the *.meta file
 
   Returns:
   * input_name: input tensor's name
@@ -57,80 +78,30 @@ def get_input_name_n_shape(file_path):
   """
 
   with tf.Graph().as_default():
-    tf.train.import_meta_graph(file_path)
+    tf.train.import_meta_graph(meta_path)
     net_input = tf.get_collection(FLAGS.input_coll)[0]
     input_name = net_input.name
     input_shape = net_input.shape
 
   return input_name, input_shape
 
-def convert_pb_model_to_tflite(file_path_pb, file_path_tflite, net_input_name, net_output_name):
-  """Convert *.pb model to a *.tflite model.
-
-  Args:
-  * file_path_pb: file path to the *.pb model
-  * file_path_tflite: file path to the *.tflite model
-  * net_input_name: network's input node's name
-  * net_output_name: network's output node's name
-  """
-
-  tf.logging.info(file_path_pb + ' -> ' + file_path_tflite)
-  with tf.Graph().as_default():
-    converter = tf.contrib.lite.TocoConverter.from_frozen_graph(
-      file_path_pb, [net_input_name], [net_output_name])
-    tflite_model = converter.convert()
-    with tf.gfile.GFile(file_path_tflite, 'wb') as o_file:
-      o_file.write(tflite_model)
-
-def test_pb_model(file_path, net_input_name, net_output_name, net_input_data):
-  """Test the *.pb model.
-
-  Args:
-  * file_path: file path to the *.pb model
-  * net_input_name: network's input node's name
-  * net_output_name: network's output node's name
-  * net_input_data: network's input node's data
-  """
-
-  with tf.Graph().as_default() as graph:
-    sess = tf.Session()
+def get_data_format():
+  """Get the data format of convolutional layers.
 
-    # restore the model
-    graph_def = tf.GraphDef()
-    with tf.gfile.GFile(file_path, 'rb') as i_file:
-      graph_def.ParseFromString(i_file.read())
-    tf.import_graph_def(graph_def)
-
-    # obtain input & output nodes and then test the model
-    net_input = graph.get_tensor_by_name('import/' + net_input_name + ':0')
-    net_output = graph.get_tensor_by_name('import/' + net_output_name + ':0')
-    tf.logging.info('input: {} / output: {}'.format(net_input.name, net_output.name))
-    net_output_data = sess.run(net_output, feed_dict={net_input: net_input_data})
-    tf.logging.info('outputs from the *.pb model: {}'.format(net_output_data))
-
-def test_tflite_model(file_path, net_input_data):
-  """Test the *.tflite model.
-
-  Args:
-  * file_path: file path to the *.tflite model
-  * net_input_data: network's input node's data
+  Returns:
+  * data_format: data format of convolutional layers
   """
 
-  # restore the model and allocate tensors
-  interpreter = tf.contrib.lite.Interpreter(model_path=file_path)
-  interpreter.allocate_tensors()
-
-  # get input & output tensors
-  input_details = interpreter.get_input_details()
-  output_details = interpreter.get_output_details()
-  tf.logging.info('input details: {}'.format(input_details))
-  tf.logging.info('output details: {}'.format(output_details))
+  data_format = None
+  pattern = re.compile(r'Conv2D$')
+  for op in tf.get_default_graph().get_operations():
+    if re.search(pattern, op.name) is not None:
+      data_format = op.get_attr('data_format').decode('utf-8')
+      tf.logging.info('data format: ' + data_format)
+      break
+  assert data_format is not None, 'unable to determine <data_format>; convolutional layer not found'
 
-  # test the model with given inputs
-  interpreter.set_tensor(input_details[0]['index'], net_input_data)
-  interpreter.invoke()
-  net_output_data = interpreter.get_tensor(output_details[0]['index'])
-  tf.logging.info('outputs from the *.tflite model: {}'.format(net_output_data))
+  return data_format
 
 def is_initialized(sess, var):
   """Check whether a variable is initialized.
@@ -146,6 +117,25 @@ def is_initialized(sess, var):
   except tf.errors.FailedPreconditionError:
     return False
 
+def apply_fake_pruning(kernel):
+  """Apply fake pruning to the convolutional kernel.
+
+  Args:
+  * kernel: original convolutional kernel
+
+  Returns:
+  * kernel: randomly pruned convolutional kernel
+  """
+
+  tf.logging.info('kernel shape: {}'.format(kernel.shape))
+  nb_chns = kernel.shape[2]
+  idxs_all = np.arange(nb_chns)
+  np.random.shuffle(idxs_all)
+  idxs_pruned = idxs_all[:int(nb_chns * FLAGS.fake_prune_ratio)]
+  kernel[:, :, idxs_pruned, :] = 0.0
+
+  return kernel
+
 def replace_dropout_layers():
   """Replace dropout layers with identity mappings.
 
@@ -166,11 +156,12 @@ def replace_dropout_layers():
 
   return op_outputs_old, op_outputs_new
 
-def insert_alt_routines(sess):
+def insert_alt_routines(sess, graph_trans_mthd):
   """Insert alternative rountines for convolutional layers.
 
   Args:
   * sess: TensorFlow session
+  * graph_trans_mthd: graph transformation method
 
   Returns:
   * op_outputs_old: output nodes to be swapped in the old graph
@@ -185,9 +176,11 @@ def insert_alt_routines(sess):
       if not is_initialized(sess, op.inputs[1]):
         continue
 
-      # insert alternative routines using tf.nn.conv2d
+      # detect which channels to be pruned
       tf.logging.info('transforming OP: ' + op.name)
       kernel = sess.run(op.inputs[1])
+      if FLAGS.enbl_fake_prune:
+        kernel = apply_fake_pruning(kernel)
       kernel_chn_in = kernel.shape[2]
       strides = op.get_attr('strides')
       padding = op.get_attr('padding').decode('utf-8')
@@ -198,35 +191,133 @@ def insert_alt_routines(sess):
       kernel_gthr = np.zeros((1, 1, kernel_chn_in, nnzs.size))
       kernel_gthr[0, 0, nnzs, np.arange(nnzs.size)] = 1.0
       kernel_shrk = kernel[:, :, nnzs, :]
-      x = tf.nn.conv2d(op.inputs[0], kernel_gthr, [1, 1, 1, 1], 'SAME', data_format=data_format)
-      x = tf.nn.conv2d(
-        x, kernel_shrk, strides, padding, data_format=data_format, dilations=dilations)
 
+      # replace channel pruned convolutional with cheaper operations
+      if graph_trans_mthd == 'gather':
+        x = tf.gather(op.inputs[0], nnzs, axis=1)
+        x = tf.nn.conv2d(
+          x, kernel_shrk, strides, padding, data_format=data_format, dilations=dilations)
+      elif graph_trans_mthd == '1x1_conv':
+        x = tf.nn.conv2d(op.inputs[0], kernel_gthr, [1, 1, 1, 1], 'SAME', data_format=data_format)
+        x = tf.nn.conv2d(
+          x, kernel_shrk, strides, padding, data_format=data_format, dilations=dilations)
+      else:
+        raise ValueError('unrecognized graph transformation method: ' + graph_trans_mthd)
+
+      # obtain old and new routines' outputs
       op_outputs_old += [op.outputs[0]]
       op_outputs_new += [x]
 
   return op_outputs_old, op_outputs_new
 
-def export_pb_tflite_model(net, file_path_meta, file_path_pb, file_path_tflite, edit_graph):
+def convert_pb_model_to_tflite(net, pb_path, tflite_path):
+  """Convert the *.pb model to a *.tflite model.
+
+  Args:
+  * net: network configurations
+  * pb_path: path to the *.pb file
+  * tflite_path: path to the *.tflite file
+  """
+
+  # setup a TFLite converter
+  tf.logging.info(pb_path + ' -> ' + tflite_path)
+  converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(
+    pb_path, [net['input_name']], [net['output_name']])
+  if FLAGS.enbl_uni_quant:
+    converter.inference_type = lite_constants.QUANTIZED_UINT8
+    converter.quantized_input_stats = {net['input_name']: (0., 1.)}
+  if FLAGS.enbl_fake_quant:
+    converter.post_training_quantize = True
+    converter.default_ranges_stats = (0, 6)
+
+  # convert the *.pb model to a *.tflite model
+  try:
+    tflite_model = converter.convert()
+    with open(tflite_path, 'wb') as o_file:
+      o_file.write(tflite_model)
+    tf.logging.info(tflite_path + ' generate')
+  except Exception as err:
+    tf.logging.info('unable to generate a *.tflite model')
+    raise err
+
+def test_pb_model(file_path, net_input_name, net_output_name, net_input_data):
+  """Test the *.pb model.
+
+  Args:
+  * file_path: file path to the *.pb model
+  * net_input_name: network's input node's name
+  * net_output_name: network's output node's name
+  * net_input_data: network's input node's data
+  """
+
+  with tf.Graph().as_default() as graph:
+    sess = tf.Session()
+
+    # restore the model
+    graph_def = tf.GraphDef()
+    with tf.gfile.GFile(file_path, 'rb') as i_file:
+      graph_def.ParseFromString(i_file.read())
+    tf.import_graph_def(graph_def)
+
+    # obtain input & output nodes and then test the model
+    net_input = graph.get_tensor_by_name('import/' + net_input_name + ':0')
+    net_output = graph.get_tensor_by_name('import/' + net_output_name + ':0')
+    tf.logging.info('input: {} / output: {}'.format(net_input.name, net_output.name))
+    net_output_data = sess.run(net_output, feed_dict={net_input: net_input_data})
+    tf.logging.info('outputs from the *.pb model: {}'.format(net_output_data))
+
+def test_tflite_model(file_path, net_input_data):
+  """Test the *.tflite model.
+
+  Args:
+  * file_path: file path to the *.tflite model
+  * net_input_data: network's input node's data
+  """
+
+  # restore the model and allocate tensors
+  interpreter = tf.contrib.lite.Interpreter(model_path=file_path)
+  interpreter.allocate_tensors()
+
+  # get input & output tensors
+  input_details = interpreter.get_input_details()
+  output_details = interpreter.get_output_details()
+  tf.logging.info('input details: {}'.format(input_details))
+  tf.logging.info('output details: {}'.format(output_details))
+
+  # test the model with given inputs
+  if not FLAGS.enbl_uni_quant:
+    interpreter.set_tensor(input_details[0]['index'], net_input_data)
+  else:
+    interpreter.set_tensor(input_details[0]['index'], net_input_data.astype(np.uint8))
+  interpreter.invoke()
+  net_output_data = interpreter.get_tensor(output_details[0]['index'])
+  tf.logging.info('outputs from the *.tflite model: {}'.format(net_output_data))
+
+def export_pb_tflite_model(net, meta_path, pb_path, tflite_path):
   """Export *.pb & *.tflite models from checkpoint files.
 
   Args:
   * net: network configurations
-  * file_path_meta: file path to the *.meta data
-  * file_path_pb: file path to the *.pb model
-  * file_path_tflite: file path to the *.tflite model
-  * edit_graph: whether the graph should be edited
+  * meta_path: path to the *.meta file
+  * pb_path: path to the *.pb file
+  * tflite_path: path to the *.tflite file
   """
 
   # convert checkpoint files to a *.pb model
   with tf.Graph().as_default() as graph:
-    sess = tf.Session()
+    config = tf.ConfigProto()
+    config.gpu_options.allow_growth = True  # pylint: disable=no-member
+    sess = tf.Session(config=config)
 
     # restore the graph with inputs replaced
     net_input = tf.placeholder(tf.float32, shape=net['input_shape'], name=net['input_name'])
     saver = tf.train.import_meta_graph(
-      file_path_meta, input_map={net['input_name_ckpt']: net_input})
-    saver.restore(sess, file_path_meta.replace('.meta', ''))
+      meta_path, input_map={net['input_name_ckpt']: net_input})
+    saver.restore(sess, meta_path.replace('.meta', ''))
+
+    # obtain the data format and determine which graph transformation method to be used
+    data_format = get_data_format()
+    graph_trans_mthd = 'gather' if data_format == 'NCHW' else '1x1_conv'
 
     # obtain the output node
     net_logits = tf.get_collection(FLAGS.output_coll)[0]
@@ -239,31 +330,30 @@ def export_pb_tflite_model(net, file_path_meta, file_path_pb, file_path_tflite,
     op_outputs_old, op_outputs_new = replace_dropout_layers()
     sess.close()
     graph_editor.swap_outputs(op_outputs_old, op_outputs_new)
-    sess = tf.Session()  # open a new session
-    saver.restore(sess, file_path_meta.replace('.meta', ''))
+    sess = tf.Session(config=config)  # open a new session
+    saver.restore(sess, meta_path.replace('.meta', ''))
 
     # edit the graph by inserting alternative routines for each convolutional layer
-    if edit_graph:
-      op_outputs_old, op_outputs_new = insert_alt_routines(sess)
+    if FLAGS.enbl_chn_prune:
+      op_outputs_old, op_outputs_new = insert_alt_routines(sess, graph_trans_mthd)
       sess.close()
       graph_editor.swap_outputs(op_outputs_old, op_outputs_new)
-      sess = tf.Session()  # open a new session
-      saver.restore(sess, file_path_meta.replace('.meta', ''))
+      sess = tf.Session(config=config)  # open a new session
+      saver.restore(sess, meta_path.replace('.meta', ''))
 
     # write the original grpah to *.pb file
     graph_def = graph.as_graph_def()
     graph_def = tf.graph_util.convert_variables_to_constants(sess, graph_def, [net['output_name']])
-    file_name_pb = os.path.basename(file_path_pb)
+    file_name_pb = os.path.basename(pb_path)
     tf.train.write_graph(graph_def, FLAGS.model_dir, file_name_pb, as_text=False)
-    tf.logging.info(file_path_pb + ' generated')
+    tf.logging.info(pb_path + ' generated')
 
   # convert the *.pb model to a *.tflite model
-  convert_pb_model_to_tflite(file_path_pb, file_path_tflite, net['input_name'], net['output_name'])
-  tf.logging.info(file_path_tflite + ' generated')
+  convert_pb_model_to_tflite(net, pb_path, tflite_path)
 
   # test *.pb & *.tflite models
-  test_pb_model(file_path_pb, net['input_name'], net['output_name'], net['input_data'])
-  test_tflite_model(file_path_tflite, net['input_data'])
+  test_pb_model(pb_path, net['input_name'], net['output_name'], net['input_data'])
+  test_tflite_model(tflite_path, net['input_data'])
 
 def main(unused_argv):
   """Main entry.
@@ -277,25 +367,20 @@ def main(unused_argv):
     tf.logging.set_verbosity(tf.logging.INFO)
 
     # network configurations
-    file_path_meta = get_file_path_meta()
-    input_name, input_shape = get_input_name_n_shape(file_path_meta)
+    meta_path = get_meta_path()
+    input_name, input_shape = get_input_name_n_shape(meta_path)
     net = {
       'input_name_ckpt': input_name,  # used to import the model from checkpoint files
       'input_name': 'net_input',  # used to export the model to *.pb & *.tflite files
       'input_shape': input_shape,
       'output_name': 'net_output'
     }
-    net['input_data'] = np.zeros(tuple([1] + list(net['input_shape'][1:])), dtype=np.float32)
-
-    # generate *.pb & *.tflite files for the original model
-    file_path_pb = os.path.join(FLAGS.model_dir, 'model_original.pb')
-    file_path_tflite = os.path.join(FLAGS.model_dir, 'model_original.tflite')
-    export_pb_tflite_model(net, file_path_meta, file_path_pb, file_path_tflite, edit_graph=False)
+    net['input_data'] = np.random.random(size=tuple([1] + list(net['input_shape'])[1:]))
 
-    # generate *.pb & *.tflite files for the transformed model
-    file_path_pb = os.path.join(FLAGS.model_dir, 'model_transformed.pb')
-    file_path_tflite = os.path.join(FLAGS.model_dir, 'model_transformed.tflite')
-    export_pb_tflite_model(net, file_path_meta, file_path_pb, file_path_tflite, edit_graph=True)
+    # generate *.pb & *.tflite files
+    pb_path = os.path.join(FLAGS.model_dir, 'model.pb')
+    tflite_path = os.path.join(FLAGS.model_dir, 'model.tflite')
+    export_pb_tflite_model(net, meta_path, pb_path, tflite_path)
 
     # exit normally
     return 0
diff --git a/tools/conversion/export_quant_tflite_model.py b/tools/conversion/export_quant_tflite_model.py
index 9fb5f6f..b317e23 100644
--- a/tools/conversion/export_quant_tflite_model.py
+++ b/tools/conversion/export_quant_tflite_model.py
@@ -32,6 +32,15 @@
 tf.app.flags.DEFINE_string('output_coll', 'logits_final', 'output tensor\'s collection')
 tf.app.flags.DEFINE_boolean('enbl_post_quant', False, 'enable post-training quantization')
 
+# For quantization scaling - see https://www.tensorflow.org/lite/convert/cmdline_reference
+tf.app.flags.DEFINE_integer('mean_values', 128, 'mean float for inputs (de)quantization')
+tf.app.flags.DEFINE_float('std_dev_values', 127., 'scale float for inputs (de)quantization')
+tf.app.flags.DEFINE_integer('default_ranges_min', 0,
+                          'Default value for the min range values used for all arrays without a specified range.')
+tf.app.flags.DEFINE_integer('default_ranges_max', 6,
+                          'Default value for the max range values used for all arrays without a specified range.')
+
+
 def get_file_path_meta():
   """Get the file path to the *.meta data.
 
@@ -87,14 +96,15 @@ def convert_pb_model_to_tflite(net, file_path_pb, file_path_tflite, enbl_quant):
   else:
     arg_list += [
       '--inference_type QUANTIZED_UINT8',
-      '--mean_values 128',
-      '--std_dev_values 127']
+      '--mean_values %d'%FLAGS.mean_values,
+      '--std_dev_values %f'%FLAGS.std_dev_values]
     if FLAGS.enbl_post_quant:
       arg_list += [
-        '--default_ranges_min 0',
-        '--default_ranges_max 6']
+        '--default_ranges_min %d'%FLAGS.default_ranges_min,
+        '--default_ranges_max %d'%FLAGS.default_ranges_max]
   cmd_str = ' '.join(['tflite_convert'] + arg_list)
-  subprocess.call(cmd_str, shell=True)
+  tf.logging.info('Executing: %s'%cmd_str)
+  subprocess.call(cmd_str.split(), shell=False)
   tf.logging.info(file_path_tflite + ' generated')
 
 def test_pb_model(file_path, net_input_name, net_output_name, net_input_data):
@@ -145,6 +155,8 @@ def test_tflite_model(file_path, net_input_data):
   interpreter.set_tensor(input_details[0]['index'], net_input_data)
   interpreter.invoke()
   net_output_data = interpreter.get_tensor(output_details[0]['index'])
+  if output_details[0]['quantization'][0] != 0:
+    net_output_data = (net_output_data - output_details[0]['quantization'][1])*output_details[0]['quantization'][0]
   tf.logging.info('outputs from the *.tflite model: {}'.format(net_output_data))
 
 def is_initialized(sess, var):
@@ -217,7 +229,7 @@ def export_pb_tflite_model(net, file_path_meta, file_path_pb, file_paths_tflite)
     sess = tf.Session(config=config)  # open a new session
     saver.restore(sess, file_path_meta.replace('.meta', ''))
 
-    # write the original grpah to *.pb file
+    # write the original graph to *.pb file
     graph_def = graph.as_graph_def()
     graph_def = tf.graph_util.convert_variables_to_constants(sess, graph_def, [net['output_name']])
     file_name_pb = os.path.basename(file_path_pb)
@@ -231,7 +243,8 @@ def export_pb_tflite_model(net, file_path_meta, file_path_pb, file_paths_tflite)
   # test *.pb & *.tflite models
   test_pb_model(file_path_pb, net['input_name'], net['output_name'], net['input_data'])
   test_tflite_model(file_paths_tflite['float'], net['input_data'])
-  test_tflite_model(file_paths_tflite['quant'], net['input_data'].astype(np.uint8))
+  net['input_data'] = ((net['input_data'] * FLAGS.std_dev_values) + FLAGS.mean_values).astype(np.uint8)
+  test_tflite_model(file_paths_tflite['quant'], net['input_data'])
 
 def main(unused_argv):
   """Main entry.
@@ -253,7 +266,7 @@ def main(unused_argv):
       'input_shape': input_shape,
       'output_name': 'net_output'
     }
-    net['input_data'] = np.zeros(tuple([1] + list(net['input_shape'])[1:]), dtype=np.float32)
+    net['input_data'] = np.random.random(tuple([1] + list(net['input_shape'])[1:])).astype(np.float32)
 
     # generate *.pb & *.tflite files
     file_path_pb = os.path.join(FLAGS.model_dir, 'model_original.pb')
diff --git a/tools/graph_tools/add_to_collection.py b/tools/graph_tools/add_to_collection.py
new file mode 100644
index 0000000..90bc2d8
--- /dev/null
+++ b/tools/graph_tools/add_to_collection.py
@@ -0,0 +1,100 @@
+# Tencent is pleased to support the open source community by making PocketFlow available.
+#
+# Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Add a list of tensors to specified collections (useful when exporting *.pb & *.tflite models)."""
+
+import os
+import re
+import traceback
+import tensorflow as tf
+
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_string('model_dir_in', './models_in', 'input model directory')
+tf.app.flags.DEFINE_string('model_dir_out', './models_out', 'output model directory')
+tf.app.flags.DEFINE_string('tensor_names', None, 'list of tensors names (comma-separated)')
+tf.app.flags.DEFINE_string('coll_names', None, 'list of collection names (comma-separated)')
+
+'''
+Example: SSD (VGG-16) @ Pascal VOC
+
+Input:
+* data/IteratorGetNext:1 / (?, 300, 300, 3) / images
+Output:
+* quant_model/ssd300/multibox_head/cls_5/Conv2D:0 / (?, 1, 1, 84) / cls_preds
+* quant_model/ssd300/multibox_head/loc_5/Conv2D:0 / (?, 1, 1, 16) / loc_preds
+'''
+
+def main(unused_argv):
+  """Main entry.
+
+  Args:
+  * unused_argv: unused arguments (after FLAGS is parsed)
+  """
+
+  try:
+    # setup the TF logging routine
+    tf.logging.set_verbosity(tf.logging.INFO)
+
+    # add a list of tensors to specified collections
+    with tf.Graph().as_default() as graph:
+      # create a TensorFlow session
+      config = tf.ConfigProto()
+      config.gpu_options.allow_growth = True
+      sess = tf.Session(config=config)
+
+      # restore a model from *.ckpt files
+      ckpt_path = tf.train.latest_checkpoint(FLAGS.model_dir_in)
+      meta_path = ckpt_path + '.meta'
+      saver = tf.train.import_meta_graph(meta_path)
+      saver.restore(sess, ckpt_path)
+
+      # parse tensor & collection names
+      tensor_names = [sub_str.strip() for sub_str in FLAGS.tensor_names.split(',')]
+      coll_names = [sub_str.strip() for sub_str in FLAGS.coll_names.split(',')]
+      assert len(tensor_names) == len(coll_names), \
+        '# of tensors and collections does not match: %d (tensor) vs. %d (collection)' \
+        % (len(tensor_names), len(coll_names))
+
+      # obtain the full list of tensors in the graph
+      tensors = set()
+      for op in graph.get_operations():
+        tensors |= set(op.inputs) | set(op.outputs)
+      tensors = list(tensors)
+      tensors.sort(key=lambda x: x.name)
+
+      # find tensors and add them to corresponding collections
+      for tensor in tensors:
+        if tensor.name in tensor_names:
+          tf.logging.info('tensor: {} / {}'.format(tensor.name, tensor.shape))
+          coll_name = coll_names[tensor_names.index(tensor.name)]
+          tf.add_to_collection(coll_name, tensor)
+          tf.logging.info('added tensor <{}> to collection <{}>'.format(tensor.name, coll_name))
+
+      # save the modified model
+      vars_list = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)
+      saver_new = tf.train.Saver(vars_list)
+      save_path = saver_new.save(sess, os.path.join(FLAGS.model_dir_out, 'model.ckpt'))
+      tf.logging.info('model saved to ' + save_path)
+
+    # exit normally
+    return 0
+  except ValueError:
+    traceback.print_exc()
+    return 1  # exit with errors
+
+if __name__ == '__main__':
+  tf.app.run()
diff --git a/utils/external/faster_rcnn_tensorflow/configs/cfgs.py b/utils/external/faster_rcnn_tensorflow/configs/cfgs.py
new file mode 100644
index 0000000..e3b7a8d
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/configs/cfgs.py
@@ -0,0 +1,93 @@
+# -*- coding: utf-8 -*-
+from __future__ import division, print_function, absolute_import
+import os
+import tensorflow as tf
+
+# ------------------------------------------------
+VERSION = 'MobileNetV2'
+NET_NAME = 'MobilenetV2' #'MobilenetV2'
+ADD_BOX_IN_TENSORBOARD = True
+
+if NET_NAME.startswith("resnet"):
+    weights_name = NET_NAME
+elif NET_NAME.startswith("MobilenetV2"):
+    weights_name = "mobilenet/mobilenet_v2_1.0_224"
+else:
+    raise Exception('net name must in [resnet_v1_101, resnet_v1_50, MobilenetV2]')
+
+# ------------------------------------------ Train config
+RESTORE_FROM_RPN = False
+IS_FILTER_OUTSIDE_BOXES = True
+FIXED_BLOCKS = 1  # allow 0~3
+
+RPN_LOCATION_LOSS_WEIGHT = 1.
+RPN_CLASSIFICATION_LOSS_WEIGHT = 1.0
+
+FAST_RCNN_LOCATION_LOSS_WEIGHT = 1.0
+FAST_RCNN_CLASSIFICATION_LOSS_WEIGHT = 1.0
+RPN_SIGMA = 3.0
+FASTRCNN_SIGMA = 1.0
+
+MUTILPY_BIAS_GRADIENT = None   # 2.0  # if None, will not multipy
+GRADIENT_CLIPPING_BY_NORM = None   # 10.0  if None, will not clip
+
+EPSILON = 1e-5
+# LR = 0.001  # ResNet
+# DECAY_STEP = [50000, 70000]  # ResNet
+LR = 0.0003 # MobileNet\
+DECAY_STEP = [50000, 100000]  # MobileNet
+MAX_ITERATION = 200000
+
+# -------------------------------------------- Data_preprocess_config
+DATASET_NAME = 'pascal'  # 'ship', 'spacenet', 'pascal', 'coco'
+PIXEL_MEAN = [123.68, 116.779, 103.939]  # R, G, B. In tf, channel is RGB. In openCV, channel is BGR
+IMG_SHORT_SIDE_LEN = 600
+IMG_MAX_LENGTH = 1000
+CLASS_NUM = 20
+
+# --------------------------------------------- Network_config
+BATCH_SIZE = 1
+INITIALIZER = tf.random_normal_initializer(mean=0.0, stddev=0.01)
+BBOX_INITIALIZER = tf.random_normal_initializer(mean=0.0, stddev=0.001)
+WEIGHT_DECAY = 0.00004 if NET_NAME.startswith('Mobilenet') else 0.0001
+
+# ---------------------------------------------Anchor config
+BASE_ANCHOR_SIZE_LIST = [256]  # can be modified
+ANCHOR_STRIDE = [16]  # can not be modified in most situations
+ANCHOR_SCALES = [0.5, 1., 2.0]  # [4, 8, 16, 32]
+ANCHOR_RATIOS = [0.5, 1., 2.0]
+ROI_SCALE_FACTORS = [10., 10., 5.0, 5.0]
+ANCHOR_SCALE_FACTORS = None
+
+
+# --------------------------------------------RPN config
+KERNEL_SIZE = 3
+RPN_IOU_POSITIVE_THRESHOLD = 0.7
+RPN_IOU_NEGATIVE_THRESHOLD = 0.3
+TRAIN_RPN_CLOOBER_POSITIVES = False
+
+RPN_MINIBATCH_SIZE = 256
+RPN_POSITIVE_RATE = 0.5
+RPN_NMS_IOU_THRESHOLD = 0.7
+RPN_TOP_K_NMS_TRAIN = 12000
+RPN_MAXIMUM_PROPOSAL_TARIN = 2000
+
+RPN_TOP_K_NMS_TEST = 6000  # 5000
+RPN_MAXIMUM_PROPOSAL_TEST = 300  # 300
+
+
+# -------------------------------------------Fast-RCNN config
+ROI_SIZE = 14
+ROI_POOL_KERNEL_SIZE = 2
+USE_DROPOUT = False
+KEEP_PROB = 1.0
+SHOW_SCORE_THRSHOLD = 0.5  # only show in tensorboard
+
+FAST_RCNN_NMS_IOU_THRESHOLD = 0.3  # 0.6
+FAST_RCNN_NMS_MAX_BOXES_PER_CLASS = 100
+FAST_RCNN_IOU_POSITIVE_THRESHOLD = 0.5
+FAST_RCNN_IOU_NEGATIVE_THRESHOLD = 0.0   # 0.1 < IOU < 0.5 is negative
+FAST_RCNN_MINIBATCH_SIZE = 256  # if is -1, that is train with OHEM
+FAST_RCNN_POSITIVE_RATE = 0.25
+
+ADD_GTBOXES_TO_TRAIN = False
diff --git a/utils/external/faster_rcnn_tensorflow/net/mobilenet_v2_faster_rcnn.py b/utils/external/faster_rcnn_tensorflow/net/mobilenet_v2_faster_rcnn.py
new file mode 100644
index 0000000..4a35957
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/net/mobilenet_v2_faster_rcnn.py
@@ -0,0 +1,127 @@
+# -*- coding: utf-8 -*-
+
+from __future__ import absolute_import, print_function, division
+import tensorflow.contrib.slim as slim
+import tensorflow as tf
+
+from utils.external import mobilenet_v2
+from utils.external.mobilenet import training_scope
+from utils.external.mobilenet_v2 import op
+from utils.external.mobilenet_v2  import ops
+expand_input = ops.expand_input_by_factor
+
+V2_BASE_DEF = dict(
+    defaults={
+        # Note: these parameters of batch norm affect the architecture
+        # that's why they are here and not in training_scope.
+        (slim.batch_norm,): {'center': True, 'scale': True},
+        (slim.conv2d, slim.fully_connected, slim.separable_conv2d): {
+            'normalizer_fn': slim.batch_norm, 'activation_fn': tf.nn.relu6
+        },
+        (ops.expanded_conv,): {
+            'expansion_size': expand_input(6),
+            'split_expansion': 1,
+            'normalizer_fn': slim.batch_norm,
+            'residual': True
+        },
+        (slim.conv2d, slim.separable_conv2d): {'padding': 'SAME'}
+    },
+    spec=[
+        op(slim.conv2d, stride=2, num_outputs=32, kernel_size=[3, 3]),
+        op(ops.expanded_conv,
+           expansion_size=expand_input(1, divisible_by=1),
+           num_outputs=16, scope='expanded_conv'),
+        op(ops.expanded_conv, stride=2, num_outputs=24, scope='expanded_conv_1'),
+        op(ops.expanded_conv, stride=1, num_outputs=24, scope='expanded_conv_2'),
+        op(ops.expanded_conv, stride=2, num_outputs=32, scope='expanded_conv_3'),
+        op(ops.expanded_conv, stride=1, num_outputs=32, scope='expanded_conv_4'),
+        op(ops.expanded_conv, stride=1, num_outputs=32, scope='expanded_conv_5'),
+        op(ops.expanded_conv, stride=2, num_outputs=64, scope='expanded_conv_6'),
+        op(ops.expanded_conv, stride=1, num_outputs=64, scope='expanded_conv_7'),
+        op(ops.expanded_conv, stride=1, num_outputs=64, scope='expanded_conv_8'),
+        op(ops.expanded_conv, stride=1, num_outputs=64, scope='expanded_conv_9'),
+        op(ops.expanded_conv, stride=1, num_outputs=96, scope='expanded_conv_10'),
+        op(ops.expanded_conv, stride=1, num_outputs=96, scope='expanded_conv_11'),
+        op(ops.expanded_conv, stride=1, num_outputs=96, scope='expanded_conv_12')
+    ],
+)
+
+
+V2_HEAD_DEF = dict(
+    defaults={
+        # Note: these parameters of batch norm affect the architecture
+        # that's why they are here and not in training_scope.
+        (slim.batch_norm,): {'center': True, 'scale': True},
+        (slim.conv2d, slim.fully_connected, slim.separable_conv2d): {
+            'normalizer_fn': slim.batch_norm, 'activation_fn': tf.nn.relu6
+        },
+        (ops.expanded_conv,): {
+            'expansion_size': expand_input(6),
+            'split_expansion': 1,
+            'normalizer_fn': slim.batch_norm,
+            'residual': True
+        },
+        (slim.conv2d, slim.separable_conv2d): {'padding': 'SAME'}
+    },
+    spec=[
+        op(ops.expanded_conv, stride=2, num_outputs=160, scope='expanded_conv_13'),
+        op(ops.expanded_conv, stride=1, num_outputs=160, scope='expanded_conv_14'),
+        op(ops.expanded_conv, stride=1, num_outputs=160, scope='expanded_conv_15'),
+        op(ops.expanded_conv, stride=1, num_outputs=320, scope='expanded_conv_16'),
+        op(slim.conv2d, stride=1, kernel_size=[1, 1], num_outputs=1280, scope='Conv_1')
+    ],
+)
+
+def mobilenetv2_scope(is_training=True,
+                      trainable=True,
+                      weight_decay=0.00004,
+                      stddev=0.09,
+                      dropout_keep_prob=0.8,
+                      bn_decay=0.997):
+  """Defines Mobilenet training scope.
+  In default. We do not use BN
+
+  ReWrite the scope.
+  """
+  batch_norm_params = {
+      'is_training': False,
+      'trainable': False,
+      'decay': bn_decay,
+  }
+  with slim.arg_scope(training_scope(is_training=is_training, weight_decay=weight_decay)):
+      with slim.arg_scope([slim.conv2d, slim.fully_connected, slim.separable_conv2d],
+                          trainable=trainable):
+          with slim.arg_scope([slim.batch_norm], **batch_norm_params) as sc:
+              return sc
+
+
+
+def mobilenetv2_base(img_batch, is_training=True):
+
+    with slim.arg_scope(mobilenetv2_scope(is_training=is_training, trainable=True)):
+
+        feature_to_crop, endpoints = mobilenet_v2.mobilenet_base(input_tensor=img_batch,
+                                                      num_classes=None,
+                                                      is_training=False,
+                                                      depth_multiplier=1.0,
+                                                      scope='MobilenetV2',
+                                                      conv_defs=V2_BASE_DEF,
+                                                      finegrain_classification_mode=False)
+
+        # feature_to_crop = tf.Print(feature_to_crop, [tf.shape(feature_to_crop)], summarize=10, message='rpn_shape')
+        return feature_to_crop
+
+
+def mobilenetv2_head(inputs, is_training=True):
+    with slim.arg_scope(mobilenetv2_scope(is_training=is_training, trainable=True)):
+        net, _ = mobilenet_v2.mobilenet(input_tensor=inputs,
+                                        num_classes=None,
+                                        is_training=False,
+                                        depth_multiplier=1.0,
+                                        scope='MobilenetV2',
+                                        conv_defs=V2_HEAD_DEF,
+                                        finegrain_classification_mode=False)
+
+        net = tf.squeeze(net, [1, 2])
+
+        return net
\ No newline at end of file
diff --git a/utils/external/faster_rcnn_tensorflow/net/resnet_faster_rcnn.py b/utils/external/faster_rcnn_tensorflow/net/resnet_faster_rcnn.py
new file mode 100644
index 0000000..9d03cfe
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/net/resnet_faster_rcnn.py
@@ -0,0 +1,170 @@
+# -*- coding: utf-8 -*-
+
+from __future__ import absolute_import, print_function, division
+
+
+import tensorflow as tf
+import tensorflow.contrib.slim as slim
+from utils.external.faster_rcnn_tensorflow.configs import cfgs
+from tensorflow.contrib.slim.nets import resnet_v1
+from tensorflow.contrib.slim.nets import resnet_utils
+from tensorflow.contrib.slim.python.slim.nets.resnet_v1 import resnet_v1_block
+
+# import tfplot as tfp
+
+def resnet_arg_scope(
+        is_training=True, weight_decay=cfgs.WEIGHT_DECAY, batch_norm_decay=0.997,
+        batch_norm_epsilon=1e-5, batch_norm_scale=True):
+    '''
+
+    In Default, we do not use BN to train resnet, since batch_size is too small.
+    So is_training is False and trainable is False in the batch_norm params.
+
+    '''
+    batch_norm_params = {
+        'is_training': False, 'decay': batch_norm_decay,
+        'epsilon': batch_norm_epsilon, 'scale': batch_norm_scale,
+        'trainable': False,
+        'updates_collections': tf.GraphKeys.UPDATE_OPS
+    }
+
+    with slim.arg_scope(
+            [slim.conv2d],
+            weights_regularizer=slim.l2_regularizer(weight_decay),
+            weights_initializer=slim.variance_scaling_initializer(),
+            trainable=is_training,
+            activation_fn=tf.nn.relu,
+            normalizer_fn=slim.batch_norm,
+            normalizer_params=batch_norm_params):
+        with slim.arg_scope([slim.batch_norm], **batch_norm_params) as arg_sc:
+            return arg_sc
+
+
+# def add_heatmap(feature_maps, name):
+#     '''
+#
+#     :param feature_maps:[B, H, W, C]
+#     :return:
+#     '''
+#
+#     def figure_attention(activation):
+#         fig, ax = tfp.subplots()
+#         im = ax.imshow(activation, cmap='jet')
+#         fig.colorbar(im)
+#         return fig
+#
+#     heatmap = tf.reduce_sum(feature_maps, axis=-1)
+#     heatmap = tf.squeeze(heatmap, axis=0)
+#     tfp.summary.plot(name, figure_attention, [heatmap])
+
+
+def resnet_base(img_batch, scope_name, is_training=True):
+    '''
+    this code is derived from light-head rcnn.
+    https://github.com/zengarden/light_head_rcnn
+
+    It is convenient to freeze blocks. So we adapt this mode.
+    '''
+    if scope_name == 'resnet_v1_50':
+        middle_num_units = 6
+    elif scope_name == 'resnet_v1_101':
+        middle_num_units = 23
+    else:
+        raise NotImplementedError('We only support resnet_v1_50 or resnet_v1_101. Check your network name....yjr')
+
+    blocks = [resnet_v1_block('block1', base_depth=64, num_units=3, stride=2),
+              resnet_v1_block('block2', base_depth=128, num_units=4, stride=2),
+              # use stride 1 for the last conv4 layer.
+
+              resnet_v1_block('block3', base_depth=256, num_units=middle_num_units, stride=1)]
+              # when use fpn . stride list is [1, 2, 2]
+
+    with slim.arg_scope(resnet_arg_scope(is_training=False)):
+        with tf.variable_scope(scope_name, scope_name):
+            # Do the first few layers manually, because 'SAME' padding can behave inconsistently
+            # for images of different sizes: sometimes 0, sometimes 1
+            net = resnet_utils.conv2d_same(
+                img_batch, 64, 7, stride=2, scope='conv1')
+            net = tf.pad(net, [[0, 0], [1, 1], [1, 1], [0, 0]])
+            net = slim.max_pool2d(
+                net, [3, 3], stride=2, padding='VALID', scope='pool1')
+
+    not_freezed = [False] * cfgs.FIXED_BLOCKS + (4-cfgs.FIXED_BLOCKS)*[True]
+    # Fixed_Blocks can be 1~3
+
+    with slim.arg_scope(resnet_arg_scope(is_training=(is_training and not_freezed[0]))):
+        C2, _ = resnet_v1.resnet_v1(net,
+                                    blocks[0:1],
+                                    global_pool=False,
+                                    include_root_block=False,
+                                    scope=scope_name)
+
+    # C2 = tf.Print(C2, [tf.shape(C2)], summarize=10, message='C2_shape')
+    # add_heatmap(C2, 'Layer/C2')
+
+    with slim.arg_scope(resnet_arg_scope(is_training=(is_training and not_freezed[1]))):
+        C3, _ = resnet_v1.resnet_v1(C2,
+                                    blocks[1:2],
+                                    global_pool=False,
+                                    include_root_block=False,
+                                    scope=scope_name)
+    # add_heatmap(C3, name='Layer/C3')
+    # C3 = tf.Print(C3, [tf.shape(C3)], summarize=10, message='C3_shape')
+
+    with slim.arg_scope(resnet_arg_scope(is_training=(is_training and not_freezed[2]))):
+        C4, _ = resnet_v1.resnet_v1(C3,
+                                    blocks[2:3],
+                                    global_pool=False,
+                                    include_root_block=False,
+                                    scope=scope_name)
+    # add_heatmap(C4, name='Layer/C4')
+    # C4 = tf.Print(C4, [tf.shape(C4)], summarize=10, message='C4_shape')
+    return C4
+
+
+def restnet_head(input, is_training, scope_name):
+    block4 = [resnet_v1_block('block4', base_depth=512, num_units=3, stride=1)]
+
+    with slim.arg_scope(resnet_arg_scope(is_training=is_training)):
+        C5, _ = resnet_v1.resnet_v1(input,
+                                    block4,
+                                    global_pool=False,
+                                    include_root_block=False,
+                                    scope=scope_name)
+        # C5 = tf.Print(C5, [tf.shape(C5)], summarize=10, message='C5_shape')
+        C5_flatten = tf.reduce_mean(C5, axis=[1, 2], keepdims=False, name='global_average_pooling')
+        # C5_flatten = tf.Print(C5_flatten, [tf.shape(C5_flatten)], summarize=10, message='C5_flatten_shape')
+
+    # global average pooling C5 to obtain fc layers
+    return C5_flatten
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/utils/external/faster_rcnn_tensorflow/preprocessing/faster_rcnn_preprocessing.py b/utils/external/faster_rcnn_tensorflow/preprocessing/faster_rcnn_preprocessing.py
new file mode 100644
index 0000000..0752ba2
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/preprocessing/faster_rcnn_preprocessing.py
@@ -0,0 +1,124 @@
+# -*- coding: utf-8 -*-
+
+from __future__ import absolute_import
+from __future__ import print_function
+from __future__ import division
+
+from utils.external.faster_rcnn_tensorflow.configs import cfgs
+
+import tensorflow as tf
+
+import numpy as np
+
+
+def max_length_limitation(length, length_limitation):
+    return tf.cond(tf.less(length, length_limitation),
+                   true_fn=lambda: length,
+                   false_fn=lambda: length_limitation)
+
+def short_side_resize(img_tensor, labels, bboxes, target_shortside_len, length_limitation=1200):
+    '''
+
+    :param img_tensor:[h, w, c], gtboxes_and_label:[-1, 5].  gtboxes: [xmin, ymin, xmax, ymax]
+    :param target_shortside_len:
+    :param length_limitation: set max length to avoid OUT OF MEMORY
+    :return:
+    '''
+    img_h, img_w = tf.shape(img_tensor)[0], tf.shape(img_tensor)[1]
+
+    new_h, new_w = tf.cond(tf.less(img_h, img_w),
+                           true_fn=lambda: (target_shortside_len,
+                                            max_length_limitation(target_shortside_len * img_w // img_h, length_limitation)),
+                           false_fn=lambda: (max_length_limitation(target_shortside_len * img_h // img_w, length_limitation),
+                                             target_shortside_len))
+
+    img_tensor = tf.expand_dims(img_tensor, axis=0)
+    img_tensor = tf.image.resize_bilinear(img_tensor, [new_h, new_w])
+    ymin, xmin, ymax, xmax = tf.unstack(bboxes, axis=1)
+
+    img_h = tf.cast(img_h, tf.float32)
+    img_w = tf.cast(img_w,tf.float32)
+    new_h = tf.cast(new_h,tf.float32)
+    new_w = tf.cast(new_w,tf.float32)
+
+    ymin = ymin * img_h
+    ymax = ymax * img_h
+    xmin = xmin * img_w
+    xmax = xmax * img_w
+
+    new_xmin, new_ymin = xmin * new_w // img_w, ymin * new_h // img_h
+    new_xmax, new_ymax = xmax * new_w // img_w, ymax * new_h // img_h
+
+    img_tensor = tf.squeeze(img_tensor, axis=0)  # ensure image tensor rank is 3
+
+    return img_tensor, labels, tf.cast(tf.transpose(tf.stack([new_xmin, new_ymin, new_xmax, new_ymax], axis=0)),dtype=tf.int32)
+
+
+def short_side_resize_for_inference_data(img_tensor, target_shortside_len, length_limitation=1200):
+    img_h, img_w = tf.shape(img_tensor)[0], tf.shape(img_tensor)[1]
+
+    new_h, new_w = tf.cond(tf.less(img_h, img_w),
+                           true_fn=lambda: (target_shortside_len,
+                                            max_length_limitation(target_shortside_len * img_w // img_h, length_limitation)),
+                           false_fn=lambda: (max_length_limitation(target_shortside_len * img_h // img_w, length_limitation),
+                                             target_shortside_len))
+
+    img_tensor = tf.expand_dims(img_tensor, axis=0)
+    img_tensor = tf.image.resize_bilinear(img_tensor, [new_h, new_w])
+
+    img_tensor = tf.squeeze(img_tensor, axis=0)  # ensure image tensor rank is 3
+    return img_tensor
+
+def flip_left_to_right(img_tensor, labels, bboxes):
+
+  h, w = tf.shape(img_tensor)[0], tf.shape(img_tensor)[1]
+  img_tensor = tf.image.flip_left_right(img_tensor)
+  xmin, ymin, xmax, ymax= tf.unstack(bboxes, axis=1)
+  new_xmax = w - xmin
+  new_xmin = w - xmax
+
+  return img_tensor, labels, tf.transpose(tf.stack([new_xmin, ymin, new_xmax, ymax], axis=0))
+
+def random_flip_left_right(img_tensor, labels, bboxes):
+    img_tensor, labels, bboxes= tf.cond(tf.less(tf.random_uniform(shape=[], minval=0, maxval=1), 0.5),
+                                            lambda: flip_left_to_right(img_tensor, labels, bboxes),
+                                            lambda: (img_tensor, labels, bboxes))
+
+    return img_tensor, labels, bboxes
+
+
+def preprocess_for_train(image, labels, bboxes, out_shape, data_format='channels_first', scope='ssd_preprocessing_train', output_rgb=True):
+  img = tf.cast(image, tf.float32)
+  img, labels, bboxes = short_side_resize(img_tensor = img, labels = labels, bboxes = bboxes,
+                                             target_shortside_len =cfgs.IMG_SHORT_SIDE_LEN, length_limitation=cfgs.IMG_MAX_LENGTH)
+  img, labels, bboxes = random_flip_left_right(img_tensor=img,labels = labels, bboxes = bboxes)
+
+  img = img - tf.constant([[cfgs.PIXEL_MEAN]])  # sub pixel mean at last
+  return img, labels, bboxes
+
+def preprocess_for_eval(image, out_shape, data_format='channels_first', scope='ssd_preprocessing_eval', output_rgb=True):
+  img = tf.cast(image, tf.float32)
+  img = short_side_resize_for_inference_data(img_tensor=img,
+                                             target_shortside_len=cfgs.IMG_SHORT_SIDE_LEN,
+                                             length_limitation=cfgs.IMG_MAX_LENGTH)
+  img = img - tf.constant(cfgs.PIXEL_MEAN)
+  return img
+
+def preprocess_image(image, labels, bboxes, out_shape, is_training=False, data_format='channels_first', output_rgb=True):
+  """Preprocesses the given image.
+
+  Args:
+    image: A `Tensor` representing an image of arbitrary size.
+    labels: A `Tensor` containing all labels for all bboxes of this image.
+    bboxes: A `Tensor` containing all bboxes of this image, in range [0., 1.] with shape [num_bboxes, 4].
+    out_shape: The height and width of the image after preprocessing.
+    is_training: Wether we are in training phase.
+    data_format: The data_format of the desired output image.
+
+  Returns:
+    A preprocessed image.
+  """
+  if is_training:
+    return preprocess_for_train(image, labels, bboxes, out_shape, data_format=data_format, output_rgb=output_rgb)
+  else:
+    return preprocess_for_eval(image, out_shape, data_format=data_format, output_rgb=output_rgb)
\ No newline at end of file
diff --git a/utils/external/faster_rcnn_tensorflow/utility/__init__.py b/utils/external/faster_rcnn_tensorflow/utility/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/utils/external/faster_rcnn_tensorflow/utility/anchor_target_layer_without_boxweight.py b/utils/external/faster_rcnn_tensorflow/utility/anchor_target_layer_without_boxweight.py
new file mode 100644
index 0000000..1468402
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/anchor_target_layer_without_boxweight.py
@@ -0,0 +1,154 @@
+# --------------------------------------------------------
+# Faster R-CNN
+# Copyright (c) 2015 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Written by Ross Girshick and Xinlei Chen
+# --------------------------------------------------------
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+from utils.external.faster_rcnn_tensorflow.configs import cfgs
+import numpy as np
+import numpy.random as npr
+
+from utils.external.faster_rcnn_tensorflow.utility import encode_and_decode
+
+def bbox_overlaps(boxes, query_boxes):
+  """
+  Parameters
+  ----------
+  boxes: (N, 4) ndarray of float
+  query_boxes: (K, 4) ndarray of float
+  Returns
+  -------
+  overlaps: (N, K) ndarray of overlap between boxes and query_boxes
+  """
+  N = boxes.shape[0]
+  K = query_boxes.shape[0]
+  overlaps = np.zeros((N, K))
+  for k in range(K):
+    box_area = (
+        (query_boxes[k, 2] - query_boxes[k, 0] + 1) *
+        (query_boxes[k, 3] - query_boxes[k, 1] + 1)
+    )
+    for n in range(N):
+      iw = (
+          min(boxes[n, 2], query_boxes[k, 2]) -
+          max(boxes[n, 0], query_boxes[k, 0]) + 1
+      )
+      if iw > 0:
+        ih = (
+            min(boxes[n, 3], query_boxes[k, 3]) -
+            max(boxes[n, 1], query_boxes[k, 1]) + 1
+        )
+        if ih > 0:
+          ua = float(
+            (boxes[n, 2] - boxes[n, 0] + 1) *
+            (boxes[n, 3] - boxes[n, 1] + 1) +
+            box_area - iw * ih
+          )
+          overlaps[n, k] = iw * ih / ua
+  return overlaps
+
+def anchor_target_layer(
+        gt_boxes, img_shape, all_anchors, is_restrict_bg=False):
+    """Same as the anchor target layer in original Fast/er RCNN """
+
+    total_anchors = all_anchors.shape[0]
+    img_h, img_w = img_shape[1], img_shape[2]
+    gt_boxes = gt_boxes[:, :-1]  # remove class label
+
+    # allow boxes to sit over the edge by a small amount
+    _allowed_border = 0
+
+    # only keep anchors inside the image
+    inds_inside = np.where(
+        (all_anchors[:, 0] >= -_allowed_border) &
+        (all_anchors[:, 1] >= -_allowed_border) &
+        (all_anchors[:, 2] < img_w + _allowed_border) &  # width
+        (all_anchors[:, 3] < img_h + _allowed_border)  # height
+    )[0]
+
+    anchors = all_anchors[inds_inside, :]
+
+    # label: 1 is positive, 0 is negative, -1 is dont care
+    labels = np.empty((len(inds_inside),), dtype=np.float32)
+    labels.fill(-1)
+
+    # overlaps between the anchors and the gt boxes
+    overlaps = bbox_overlaps(anchors,gt_boxes)
+    argmax_overlaps = overlaps.argmax(axis=1)
+    max_overlaps = overlaps[np.arange(len(inds_inside)), argmax_overlaps]
+    gt_argmax_overlaps = overlaps.argmax(axis=0)
+    gt_max_overlaps = overlaps[
+        gt_argmax_overlaps, np.arange(overlaps.shape[1])]
+    gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0]
+
+    if not cfgs.TRAIN_RPN_CLOOBER_POSITIVES:
+        labels[max_overlaps < cfgs.RPN_IOU_NEGATIVE_THRESHOLD] = 0
+
+    labels[gt_argmax_overlaps] = 1
+    labels[max_overlaps >= cfgs.RPN_IOU_POSITIVE_THRESHOLD] = 1
+
+    if cfgs.TRAIN_RPN_CLOOBER_POSITIVES:
+        labels[max_overlaps < cfgs.RPN_IOU_NEGATIVE_THRESHOLD] = 0
+
+    num_fg = int(cfgs.RPN_MINIBATCH_SIZE * cfgs.RPN_POSITIVE_RATE)
+    fg_inds = np.where(labels == 1)[0]
+    if len(fg_inds) > num_fg:
+        disable_inds = npr.choice(
+            fg_inds, size=(len(fg_inds) - num_fg), replace=False)
+        labels[disable_inds] = -1
+
+    num_bg = cfgs.RPN_MINIBATCH_SIZE - np.sum(labels == 1)
+    if is_restrict_bg:
+        num_bg = max(num_bg, num_fg * 1.5)
+    bg_inds = np.where(labels == 0)[0]
+    if len(bg_inds) > num_bg:
+        disable_inds = npr.choice(
+            bg_inds, size=(len(bg_inds) - num_bg), replace=False)
+        labels[disable_inds] = -1
+
+    bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :])
+
+    # map up to original set of anchors
+    labels = _unmap(labels, total_anchors, inds_inside, fill=-1)
+    bbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, fill=0)
+
+    # labels = labels.reshape((1, height, width, A))
+    rpn_labels = labels.reshape((-1, 1))
+
+    # bbox_targets
+    bbox_targets = bbox_targets.reshape((-1, 4))
+    rpn_bbox_targets = bbox_targets
+
+    return rpn_labels, rpn_bbox_targets
+
+
+def _unmap(data, count, inds, fill=0):
+    """ Unmap a subset of item (data) back to the original set of items (of
+    size count) """
+    if len(data.shape) == 1:
+        ret = np.empty((count,), dtype=np.float32)
+        ret.fill(fill)
+        ret[inds] = data
+    else:
+        ret = np.empty((count,) + data.shape[1:], dtype=np.float32)
+        ret.fill(fill)
+        ret[inds, :] = data
+    return ret
+
+
+def _compute_targets(ex_rois, gt_rois):
+    """Compute bounding-box regression targets for an image."""
+    # targets = bbox_transform(ex_rois, gt_rois[:, :4]).astype(
+    #     np.float32, copy=False)
+    targets = encode_and_decode.encode_boxes(unencode_boxes=gt_rois,
+                                             reference_boxes=ex_rois,
+                                             scale_factors=cfgs.ANCHOR_SCALE_FACTORS)
+    # targets = encode_and_decode.encode_boxes(ex_rois=ex_rois,
+    #                                          gt_rois=gt_rois,
+    #                                          scale_factor=None)
+    return targets
diff --git a/utils/external/faster_rcnn_tensorflow/utility/anchor_utils.py b/utils/external/faster_rcnn_tensorflow/utility/anchor_utils.py
new file mode 100644
index 0000000..5d3dc06
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/anchor_utils.py
@@ -0,0 +1,66 @@
+# -*- coding: utf-8 -*-
+from __future__ import absolute_import, print_function, division
+
+import tensorflow as tf
+
+def make_anchors(base_anchor_size, anchor_scales, anchor_ratios,
+                 featuremap_height, featuremap_width,
+                 stride, name='make_anchors'):
+    '''
+    :param base_anchor_size:256
+    :param anchor_scales:
+    :param anchor_ratios:
+    :param featuremap_height:
+    :param featuremap_width:
+    :param stride:
+    :return:
+    '''
+    with tf.variable_scope(name):
+        base_anchor = tf.constant([0, 0, base_anchor_size, base_anchor_size], tf.float32)  # [x_center, y_center, w, h]
+
+        ws, hs = enum_ratios(enum_scales(base_anchor, anchor_scales),
+                             anchor_ratios)  # per locations ws and hs
+
+        x_centers = tf.range(featuremap_width, dtype=tf.float32) * stride
+        y_centers = tf.range(featuremap_height, dtype=tf.float32) * stride
+
+        x_centers, y_centers = tf.meshgrid(x_centers, y_centers)
+
+        ws, x_centers = tf.meshgrid(ws, x_centers)
+        hs, y_centers = tf.meshgrid(hs, y_centers)
+
+        anchor_centers = tf.stack([x_centers, y_centers], 2)
+        anchor_centers = tf.reshape(anchor_centers, [-1, 2])
+
+        box_sizes = tf.stack([ws, hs], axis=2)
+        box_sizes = tf.reshape(box_sizes, [-1, 2])
+        # anchors = tf.concat([anchor_centers, box_sizes], axis=1)
+        anchors = tf.concat([anchor_centers - 0.5*box_sizes,
+                             anchor_centers + 0.5*box_sizes], axis=1)
+        return anchors
+
+
+def enum_scales(base_anchor, anchor_scales):
+
+    anchor_scales = base_anchor * tf.constant(anchor_scales, dtype=tf.float32, shape=(len(anchor_scales), 1))
+
+    return anchor_scales
+
+
+def enum_ratios(anchors, anchor_ratios):
+    '''
+    ratio = h /w
+    :param anchors:
+    :param anchor_ratios:
+    :return:
+    '''
+    ws = anchors[:, 2]  # for base anchor: w == h
+    hs = anchors[:, 3]
+    sqrt_ratios = tf.sqrt(tf.constant(anchor_ratios))
+
+    ws = tf.reshape(ws / sqrt_ratios[:, tf.newaxis], [-1, 1])
+    hs = tf.reshape(hs * sqrt_ratios[:, tf.newaxis], [-1, 1])
+
+    return hs, ws
+
+
diff --git a/utils/external/faster_rcnn_tensorflow/utility/boxes_utils.py b/utils/external/faster_rcnn_tensorflow/utility/boxes_utils.py
new file mode 100644
index 0000000..e9b1885
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/boxes_utils.py
@@ -0,0 +1,110 @@
+# -*- coding: utf-8 -*-
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+def ious_calu(boxes_1, boxes_2):
+    '''
+
+    :param boxes_1: [N, 4] [xmin, ymin, xmax, ymax]
+    :param boxes_2: [M, 4] [xmin, ymin. xmax, ymax]
+    :return:
+    '''
+    boxes_1 = tf.cast(boxes_1, tf.float32)
+    boxes_2 = tf.cast(boxes_2, tf.float32)
+    xmin_1, ymin_1, xmax_1, ymax_1 = tf.split(boxes_1, 4, axis=1)  # xmin_1 shape is [N, 1]..
+    xmin_2, ymin_2, xmax_2, ymax_2 = tf.unstack(boxes_2, axis=1)  # xmin_2 shape is [M, ]..
+
+    max_xmin = tf.maximum(xmin_1, xmin_2)
+    min_xmax = tf.minimum(xmax_1, xmax_2)
+
+    max_ymin = tf.maximum(ymin_1, ymin_2)
+    min_ymax = tf.minimum(ymax_1, ymax_2)
+
+    overlap_h = tf.maximum(0., min_ymax - max_ymin)  # avoid h < 0
+    overlap_w = tf.maximum(0., min_xmax - max_xmin)
+
+    overlaps = overlap_h * overlap_w
+
+    area_1 = (xmax_1 - xmin_1) * (ymax_1 - ymin_1)  # [N, 1]
+    area_2 = (xmax_2 - xmin_2) * (ymax_2 - ymin_2)  # [M, ]
+
+    ious = overlaps / (area_1 + area_2 - overlaps)
+
+    return ious
+
+
+def clip_boxes_to_img_boundaries(decode_boxes, img_shape):
+    '''
+
+    :param decode_boxes:
+    :return: decode boxes, and already clip to boundaries
+    '''
+
+    with tf.name_scope('clip_boxes_to_img_boundaries'):
+
+        # xmin, ymin, xmax, ymax = tf.unstack(decode_boxes, axis=1)
+        xmin = decode_boxes[:, 0]
+        ymin = decode_boxes[:, 1]
+        xmax = decode_boxes[:, 2]
+        ymax = decode_boxes[:, 3]
+        img_h, img_w = img_shape[1], img_shape[2]
+
+        img_h, img_w = tf.cast(img_h, tf.float32), tf.cast(img_w, tf.float32)
+
+        xmin = tf.maximum(tf.minimum(xmin, img_w-1.), 0.)
+        ymin = tf.maximum(tf.minimum(ymin, img_h-1.), 0.)
+
+        xmax = tf.maximum(tf.minimum(xmax, img_w-1.), 0.)
+        ymax = tf.maximum(tf.minimum(ymax, img_h-1.), 0.)
+
+        return tf.transpose(tf.stack([xmin, ymin, xmax, ymax]))
+
+
+def filter_outside_boxes(boxes, img_h, img_w):
+    '''
+    :param anchors:boxes with format [xmin, ymin, xmax, ymax]
+    :param img_h: height of image
+    :param img_w: width of image
+    :return: indices of anchors that inside the image boundary
+    '''
+
+    with tf.name_scope('filter_outside_boxes'):
+        xmin, ymin, xmax, ymax = tf.unstack(boxes, axis=1)
+
+        xmin_index = tf.greater_equal(xmin, 0)
+        ymin_index = tf.greater_equal(ymin, 0)
+        xmax_index = tf.less_equal(xmax, tf.cast(img_w, tf.float32))
+        ymax_index = tf.less_equal(ymax, tf.cast(img_h, tf.float32))
+
+        indices = tf.transpose(tf.stack([xmin_index, ymin_index, xmax_index, ymax_index]))
+        indices = tf.cast(indices, dtype=tf.int32)
+        indices = tf.reduce_sum(indices, axis=1)
+        indices = tf.where(tf.equal(indices, 4))
+        # indices = tf.equal(indices, 4)
+        return tf.reshape(indices, [-1])
+
+
+def padd_boxes_with_zeros(boxes, scores, max_num_of_boxes):
+
+    '''
+    num of boxes less than max num of boxes, so it need to pad with zeros[0, 0, 0, 0]
+    :param boxes:
+    :param scores: [-1]
+    :param max_num_of_boxes:
+    :return:
+    '''
+
+    pad_num = tf.cast(max_num_of_boxes, tf.int32) - tf.shape(boxes)[0]
+
+    zero_boxes = tf.zeros(shape=[pad_num, 4], dtype=boxes.dtype)
+    zero_scores = tf.zeros(shape=[pad_num], dtype=scores.dtype)
+
+    final_boxes = tf.concat([boxes, zero_boxes], axis=0)
+
+    final_scores = tf.concat([scores, zero_scores], axis=0)
+
+    return final_boxes, final_scores
\ No newline at end of file
diff --git a/utils/external/faster_rcnn_tensorflow/utility/coco_dict.py b/utils/external/faster_rcnn_tensorflow/utility/coco_dict.py
new file mode 100644
index 0000000..d4c190f
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/coco_dict.py
@@ -0,0 +1,54 @@
+# -*- coding: utf-8 -*-
+
+from __future__ import absolute_import, print_function, division
+
+class_names = [
+    'back_ground', 'person', 'bicycle', 'car', 'motorcycle',
+    'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light',
+    'fire hydrant', 'stop sign', 'parking meter', 'bench',
+    'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant',
+    'bear', 'zebra', 'giraffe', 'backpack', 'umbrella',
+    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard',
+    'sports ball', 'kite', 'baseball bat', 'baseball glove',
+    'skateboard', 'surfboard', 'tennis racket', 'bottle',
+    'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
+    'banana', 'apple', 'sandwich', 'orange', 'broccoli',
+    'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair',
+    'couch', 'potted plant', 'bed', 'dining table', 'toilet',
+    'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
+    'microwave', 'oven', 'toaster', 'sink', 'refrigerator',
+    'book', 'clock', 'vase', 'scissors', 'teddy bear',
+    'hair drier', 'toothbrush']
+
+
+classes_originID = {
+    'person': 1, 'bicycle': 2, 'car': 3, 'motorcycle': 4,
+    'airplane': 5, 'bus': 6, 'train': 7, 'truck': 8, 'boat': 9,
+    'traffic light': 10, 'fire hydrant': 11, 'stop sign': 13,
+    'parking meter': 14, 'bench': 15, 'bird': 16, 'cat': 17,
+    'dog': 18, 'horse': 19, 'sheep': 20, 'cow': 21, 'elephant': 22,
+    'bear': 23, 'zebra': 24, 'giraffe': 25, 'backpack': 27,
+    'umbrella': 28, 'handbag': 31, 'tie': 32, 'suitcase': 33,
+    'frisbee': 34, 'skis': 35, 'snowboard': 36, 'sports ball': 37,
+    'kite': 38, 'baseball bat': 39, 'baseball glove': 40,
+    'skateboard': 41, 'surfboard': 42, 'tennis racket': 43,
+    'bottle': 44, 'wine glass': 46, 'cup': 47, 'fork': 48,
+    'knife': 49, 'spoon': 50, 'bowl': 51, 'banana': 52, 'apple': 53,
+    'sandwich': 54, 'orange': 55, 'broccoli': 56, 'carrot': 57,
+    'hot dog': 58, 'pizza': 59, 'donut': 60, 'cake': 61,
+    'chair': 62, 'couch': 63, 'potted plant': 64, 'bed': 65,
+    'dining table': 67, 'toilet': 70, 'tv': 72, 'laptop': 73,
+    'mouse': 74, 'remote': 75, 'keyboard': 76, 'cell phone': 77,
+    'microwave': 78, 'oven': 79, 'toaster': 80, 'sink': 81,
+    'refrigerator': 82, 'book': 84, 'clock': 85, 'vase': 86,
+    'scissors': 87, 'teddy bear': 88, 'hair drier': 89,
+    'toothbrush': 90}
+
+originID_classes = {item: key for key, item in classes_originID.items()}
+NAME_LABEL_MAP = dict(zip(class_names, range(len(class_names))))
+LABEL_NAME_MAP = dict(zip(range(len(class_names)), class_names))
+
+# print (originID_classes)
+
+
+
diff --git a/utils/external/faster_rcnn_tensorflow/utility/draw_box_in_img.py b/utils/external/faster_rcnn_tensorflow/utility/draw_box_in_img.py
new file mode 100644
index 0000000..ad64acb
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/draw_box_in_img.py
@@ -0,0 +1,166 @@
+# -*- coding: utf-8 -*-
+
+from __future__ import absolute_import, print_function, division
+
+import numpy as np
+
+from PIL import Image, ImageDraw, ImageFont
+import cv2
+
+from utils.external.faster_rcnn_tensorflow.configs import cfgs
+
+import utils.external.faster_rcnn_tensorflow.utility.coco_dict as coco_dict
+import utils.external.faster_rcnn_tensorflow.utility.label_dict as label_dict
+if cfgs.DATASET_NAME == 'coco':
+    LABEL_NAME_MAP = coco_dict.LABEL_NAME_MAP
+else:
+    LABEL_NAME_MAP = label_dict.LABEL_NAME_MAP
+
+NOT_DRAW_BOXES = 0
+ONLY_DRAW_BOXES = -1
+ONLY_DRAW_BOXES_WITH_SCORES = -2
+
+STANDARD_COLORS = [
+    'AliceBlue', 'Chartreuse', 'Aqua', 'Aquamarine', 'Azure', 'Beige', 'Bisque',
+    'BlanchedAlmond', 'BlueViolet', 'BurlyWood', 'CadetBlue', 'AntiqueWhite',
+    'Chocolate', 'Coral', 'CornflowerBlue', 'Cornsilk', 'Crimson', 'Cyan',
+    'DarkCyan', 'DarkGoldenRod', 'DarkGrey', 'DarkKhaki', 'DarkOrange',
+    'DarkOrchid', 'DarkSalmon', 'DarkSeaGreen', 'DarkTurquoise', 'DarkViolet',
+    'DeepPink', 'DeepSkyBlue', 'DodgerBlue', 'FireBrick', 'FloralWhite',
+    'ForestGreen', 'Fuchsia', 'Gainsboro', 'GhostWhite', 'Gold', 'GoldenRod',
+    'Salmon', 'Tan', 'HoneyDew', 'HotPink', 'IndianRed', 'Ivory', 'Khaki',
+    'Lavender', 'LavenderBlush', 'LawnGreen', 'LemonChiffon', 'LightBlue',
+    'LightCoral', 'LightCyan', 'LightGoldenRodYellow', 'LightGray', 'LightGrey',
+    'LightGreen', 'LightPink', 'LightSalmon', 'LightSeaGreen', 'LightSkyBlue',
+    'LightSlateGray', 'LightSlateGrey', 'LightSteelBlue', 'LightYellow', 'Lime',
+    'LimeGreen', 'Linen', 'Magenta', 'MediumAquaMarine', 'MediumOrchid',
+    'MediumPurple', 'MediumSeaGreen', 'MediumSlateBlue', 'MediumSpringGreen',
+    'MediumTurquoise', 'MediumVioletRed', 'MintCream', 'MistyRose', 'Moccasin',
+    'NavajoWhite', 'OldLace', 'Olive', 'OliveDrab', 'Orange', 'OrangeRed',
+    'Orchid', 'PaleGoldenRod', 'PaleGreen', 'PaleTurquoise', 'PaleVioletRed',
+    'PapayaWhip', 'PeachPuff', 'Peru', 'Pink', 'Plum', 'PowderBlue', 'Purple',
+    'Red', 'RosyBrown', 'RoyalBlue', 'SaddleBrown', 'Green', 'SandyBrown',
+    'SeaGreen', 'SeaShell', 'Sienna', 'Silver', 'SkyBlue', 'SlateBlue',
+    'SlateGray', 'SlateGrey', 'Snow', 'SpringGreen', 'SteelBlue', 'GreenYellow',
+    'Teal', 'Thistle', 'Tomato', 'Turquoise', 'Violet', 'Wheat', 'White',
+    'WhiteSmoke', 'Yellow', 'YellowGreen', 'LightBlue', 'LightGreen'
+]
+FONT = ImageFont.load_default()
+
+
+def draw_a_rectangel_in_img(draw_obj, box, color, width):
+    '''
+    use draw lines to draw rectangle. since the draw_rectangle func can not modify the width of rectangle
+    :param draw_obj:
+    :param box: [x1, y1, x2, y2]
+    :return:
+    '''
+    x1, y1, x2, y2 = box[0], box[1], box[2], box[3]
+    top_left, top_right = (x1, y1), (x2, y1)
+    bottom_left, bottom_right = (x1, y2), (x2, y2)
+
+    draw_obj.line(xy=[top_left, top_right],
+                  fill=color,
+                  width=width)
+    draw_obj.line(xy=[top_left, bottom_left],
+                  fill=color,
+                  width=width)
+    draw_obj.line(xy=[bottom_left, bottom_right],
+                  fill=color,
+                  width=width)
+    draw_obj.line(xy=[top_right, bottom_right],
+                  fill=color,
+                  width=width)
+
+
+def only_draw_scores(draw_obj, box, score, color):
+
+    x, y = box[0], box[1]
+    draw_obj.rectangle(xy=[x, y-10, x+60, y],
+                       fill=color)
+    draw_obj.text(xy=(x, y),
+                  text="obj:" +str(round(score, 2)),
+                  fill='black',
+                  font=FONT)
+
+
+def draw_label_with_scores(draw_obj, box, label, score, color):
+    x, y = box[0], box[1]
+    draw_obj.rectangle(xy=[x, y-10, x + 60, y],
+                       fill=color)
+
+    txt = LABEL_NAME_MAP[label] + ':' + str(round(score, 2))
+    draw_obj.text(xy=(x, y-10),
+                  text=txt,
+                  fill='black',
+                  font=FONT)
+
+
+def draw_boxes_with_label_and_scores(img_array, boxes, labels, scores):
+
+    img_array = img_array + np.array(cfgs.PIXEL_MEAN)
+    img_array.astype(np.float32)
+    boxes = boxes.astype(np.int64)
+    labels = labels.astype(np.int32)
+    img_array = np.array(img_array * 255 / np.max(img_array), dtype=np.uint8)
+
+    img_obj = Image.fromarray(img_array)
+    raw_img_obj = img_obj.copy()
+
+    draw_obj = ImageDraw.Draw(img_obj)
+    num_of_objs = 0
+    for box, a_label, a_score in zip(boxes, labels, scores):
+
+        if a_label != NOT_DRAW_BOXES:
+            num_of_objs += 1
+            draw_a_rectangel_in_img(draw_obj, box, color=STANDARD_COLORS[a_label], width=3)
+            if a_label == ONLY_DRAW_BOXES:  # -1
+                continue
+            elif a_label == ONLY_DRAW_BOXES_WITH_SCORES:  # -2
+                 only_draw_scores(draw_obj, box, a_score, color='White')
+                 continue
+            else:
+                draw_label_with_scores(draw_obj, box, a_label, a_score, color='White')
+
+    out_img_obj = Image.blend(raw_img_obj, img_obj, alpha=0.6)
+
+    return np.array(out_img_obj)
+
+
+if __name__ == '__main__':
+    img_array = cv2.imread("/home/yjr/PycharmProjects/FPN_TF/tools/inference_image/2.jpg")
+    img_array = np.array(img_array, np.float32) - np.array(cfgs.PIXEL_MEAN)
+    boxes = np.array(
+        [[200, 200, 500, 500],
+         [300, 300, 400, 400],
+         [200, 200, 400, 400]]
+    )
+
+    # test only draw boxes
+    labes = np.ones(shape=[len(boxes), ], dtype=np.float32) * ONLY_DRAW_BOXES
+    scores = np.zeros_like(labes)
+    imm = draw_boxes_with_label_and_scores(img_array, boxes, labes ,scores)
+    # imm = np.array(imm)
+
+    cv2.imshow("te", imm)
+
+    # test only draw scores
+    labes = np.ones(shape=[len(boxes), ], dtype=np.float32) * ONLY_DRAW_BOXES_WITH_SCORES
+    scores = np.random.rand((len(boxes))) * 10
+    imm2 = draw_boxes_with_label_and_scores(img_array, boxes, labes, scores)
+
+    cv2.imshow("te2", imm2)
+    # test draw label and scores
+
+    labels = np.arange(1, 4)
+    imm3 = draw_boxes_with_label_and_scores(img_array, boxes, labels, scores)
+    cv2.imshow("te3", imm3)
+
+    cv2.waitKey(0)
+
+
+
+
+
+
+
diff --git a/utils/external/faster_rcnn_tensorflow/utility/encode_and_decode.py b/utils/external/faster_rcnn_tensorflow/utility/encode_and_decode.py
new file mode 100644
index 0000000..6a93618
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/encode_and_decode.py
@@ -0,0 +1,93 @@
+# -*- coding: utf-8 -*-
+
+from __future__ import absolute_import
+from __future__ import print_function
+from __future__ import division
+
+import tensorflow as tf
+import numpy as np
+
+
+def decode_boxes(encoded_boxes, reference_boxes, scale_factors=None):
+    '''
+
+    :param encoded_boxes:[N, 4]
+    :param reference_boxes: [N, 4] .
+    :param scale_factors: use for scale.
+
+    in the first stage, reference_boxes  are anchors
+    in the second stage, reference boxes are proposals(decode) produced by first stage
+    :return:decode boxes [N, 4]
+    '''
+
+    t_xcenter, t_ycenter, t_w, t_h = tf.unstack(encoded_boxes, axis=1)
+    if scale_factors:
+        t_xcenter /= scale_factors[0]
+        t_ycenter /= scale_factors[1]
+        t_w /= scale_factors[2]
+        t_h /= scale_factors[3]
+
+    reference_xmin, reference_ymin, reference_xmax, reference_ymax = tf.unstack(reference_boxes, axis=1)
+    # reference boxes are anchors in the first stage
+
+    # reference_xcenter = (reference_xmin + reference_xmax) / 2.
+    # reference_ycenter = (reference_ymin + reference_ymax) / 2.
+    reference_w = reference_xmax - reference_xmin
+    reference_h = reference_ymax - reference_ymin
+    reference_xcenter = reference_xmin + reference_w/2.0
+    reference_ycenter = reference_ymin + reference_h/2.0
+
+    predict_xcenter = t_xcenter * reference_w + reference_xcenter
+    predict_ycenter = t_ycenter * reference_h + reference_ycenter
+    predict_w = tf.exp(t_w) * reference_w
+    predict_h = tf.exp(t_h) * reference_h
+
+    predict_xmin = predict_xcenter - predict_w / 2.
+    predict_xmax = predict_xcenter + predict_w / 2.
+    predict_ymin = predict_ycenter - predict_h / 2.
+    predict_ymax = predict_ycenter + predict_h / 2.
+
+    return tf.transpose(tf.stack([predict_xmin, predict_ymin,
+                                  predict_xmax, predict_ymax]))
+
+
+def encode_boxes(unencode_boxes, reference_boxes, scale_factors=None):
+    '''
+
+    :param unencode_boxes: [-1, 4]
+    :param reference_boxes: [-1, 4]
+    :return: encode_boxes [-1, 4]
+    '''
+
+    xmin, ymin, xmax, ymax = unencode_boxes[:, 0], unencode_boxes[:, 1], unencode_boxes[:, 2], unencode_boxes[:, 3]
+
+    reference_xmin, reference_ymin, reference_xmax, reference_ymax = reference_boxes[:, 0], reference_boxes[:, 1], \
+                                                                     reference_boxes[:, 2], reference_boxes[:, 3]
+
+    # x_center = (xmin + xmax) / 2.
+    # y_center = (ymin + ymax) / 2.
+    w = xmax - xmin + 1e-8
+    h = ymax - ymin + 1e-8
+    x_center = xmin + w/2.0
+    y_center = ymin + h/2.0
+
+    # reference_xcenter = (reference_xmin + reference_xmax) / 2.
+    # reference_ycenter = (reference_ymin + reference_ymax) / 2.
+    reference_w = reference_xmax - reference_xmin + 1e-8
+    reference_h = reference_ymax - reference_ymin + 1e-8
+    reference_xcenter = reference_xmin + reference_w/2.0
+    reference_ycenter = reference_ymin + reference_h/2.0
+    # w + 1e-8 to avoid NaN in division and log below
+
+    t_xcenter = (x_center - reference_xcenter) / reference_w
+    t_ycenter = (y_center - reference_ycenter) / reference_h
+    t_w = np.log(w/reference_w)
+    t_h = np.log(h/reference_h)
+
+    if scale_factors:
+        t_xcenter *= scale_factors[0]
+        t_ycenter *= scale_factors[1]
+        t_w *= scale_factors[2]
+        t_h *= scale_factors[3]
+
+    return np.transpose(np.stack([t_xcenter, t_ycenter, t_w, t_h], axis=0))
diff --git a/utils/external/faster_rcnn_tensorflow/utility/label_dict.py b/utils/external/faster_rcnn_tensorflow/utility/label_dict.py
new file mode 100644
index 0000000..3e2f38d
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/label_dict.py
@@ -0,0 +1,74 @@
+# -*- coding: utf-8 -*-
+from __future__ import division, print_function, absolute_import
+
+from utils.external.faster_rcnn_tensorflow.configs import cfgs
+
+if cfgs.DATASET_NAME == 'ship':
+    NAME_LABEL_MAP = {
+        'back_ground': 0,
+        'ship': 1
+    }
+elif cfgs.DATASET_NAME == 'FDDB':
+    NAME_LABEL_MAP = {
+        'back_ground': 0,
+        'face': 1
+    }
+elif cfgs.DATASET_NAME == 'icdar':
+    NAME_LABEL_MAP = {
+        'back_ground': 0,
+        'text': 1
+    }
+elif cfgs.DATASET_NAME.startswith('DOTA'):
+    NAME_LABEL_MAP = {
+        'back_ground': 0,
+        'roundabout': 1,
+        'tennis-court': 2,
+        'swimming-pool': 3,
+        'storage-tank': 4,
+        'soccer-ball-field': 5,
+        'small-vehicle': 6,
+        'ship': 7,
+        'plane': 8,
+        'large-vehicle': 9,
+        'helicopter': 10,
+        'harbor': 11,
+        'ground-track-field': 12,
+        'bridge': 13,
+        'basketball-court': 14,
+        'baseball-diamond': 15
+    }
+elif cfgs.DATASET_NAME == 'pascal':
+    NAME_LABEL_MAP = {
+        'back_ground': 0,
+        'aeroplane': 1,
+        'bicycle': 2,
+        'bird': 3,
+        'boat': 4,
+        'bottle': 5,
+        'bus': 6,
+        'car': 7,
+        'cat': 8,
+        'chair': 9,
+        'cow': 10,
+        'diningtable': 11,
+        'dog': 12,
+        'horse': 13,
+        'motorbike': 14,
+        'person': 15,
+        'pottedplant': 16,
+        'sheep': 17,
+        'sofa': 18,
+        'train': 19,
+        'tvmonitor': 20
+    }
+else:
+    assert 'please set label dict!'
+
+
+def get_label_name_map():
+    reverse_dict = {}
+    for name, label in NAME_LABEL_MAP.items():
+        reverse_dict[label] = name
+    return reverse_dict
+
+LABEL_NAME_MAP = get_label_name_map()
\ No newline at end of file
diff --git a/utils/external/faster_rcnn_tensorflow/utility/loss_utils.py b/utils/external/faster_rcnn_tensorflow/utility/loss_utils.py
new file mode 100644
index 0000000..693c150
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/loss_utils.py
@@ -0,0 +1,140 @@
+# -*- coding: utf-8 -*-
+"""
+@author: jemmy li
+@contact: zengarden2009@gmail.com
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+
+def _smooth_l1_loss_base(bbox_pred, bbox_targets, sigma=1.0):
+    '''
+
+    :param bbox_pred: [-1, 4] in RPN. [-1, cls_num+1, 4] in Fast-rcnn
+    :param bbox_targets: shape is same as bbox_pred
+    :param sigma:
+    :return:
+    '''
+    sigma_2 = sigma**2
+
+    box_diff = bbox_pred - bbox_targets
+
+    abs_box_diff = tf.abs(box_diff)
+
+    smoothL1_sign = tf.stop_gradient(
+        tf.to_float(tf.less(abs_box_diff, 1. / sigma_2)))
+    loss_box = tf.pow(box_diff, 2) * (sigma_2 / 2.0) * smoothL1_sign \
+               + (abs_box_diff - (0.5 / sigma_2)) * (1.0 - smoothL1_sign)
+    return loss_box
+
+def smooth_l1_loss_rpn(bbox_pred, bbox_targets, label, sigma=1.0):
+    '''
+
+    :param bbox_pred: [-1, 4]
+    :param bbox_targets: [-1, 4]
+    :param label: [-1]
+    :param sigma:
+    :return:
+    '''
+    value = _smooth_l1_loss_base(bbox_pred, bbox_targets, sigma=sigma)
+    value = tf.reduce_sum(value, axis=1)  # to sum in axis 1
+    rpn_select = tf.where(tf.greater(label, 0))
+
+    # rpn_select = tf.stop_gradient(rpn_select) # to avoid
+    selected_value = tf.gather(value, rpn_select)
+    non_ignored_mask = tf.stop_gradient(
+        1.0 - tf.to_float(tf.equal(label, -1))) # positve is 1.0 others is 0.0
+
+    bbox_loss = tf.reduce_sum(selected_value) / tf.maximum(1.0, tf.reduce_sum(non_ignored_mask))
+
+    return bbox_loss
+
+
+
+def smooth_l1_loss_rcnn(bbox_pred, bbox_targets, label, num_classes, sigma=1.0):
+    '''
+
+    :param bbox_pred: [-1, (cfgs.CLS_NUM +1) * 4]
+    :param bbox_targets:[-1, (cfgs.CLS_NUM +1) * 4]
+    :param label:[-1]
+    :param num_classes:
+    :param sigma:
+    :return:
+    '''
+
+    outside_mask = tf.stop_gradient(tf.to_float(tf.greater(label, 0)))
+
+    bbox_pred = tf.reshape(bbox_pred, [-1, num_classes, 4])
+    bbox_targets = tf.reshape(bbox_targets, [-1, num_classes, 4])
+
+    value = _smooth_l1_loss_base(bbox_pred,
+                                 bbox_targets,
+                                 sigma=sigma)
+    value = tf.reduce_sum(value, 2)
+    value = tf.reshape(value, [-1, num_classes])
+
+    inside_mask = tf.one_hot(tf.reshape(label, [-1, 1]),
+                             depth=num_classes, axis=1)
+
+    inside_mask = tf.stop_gradient(
+        tf.to_float(tf.reshape(inside_mask, [-1, num_classes])))
+
+    normalizer = tf.to_float(tf.shape(bbox_pred)[0])
+    bbox_loss = tf.reduce_sum(
+        tf.reduce_sum(value * inside_mask, 1)*outside_mask) / normalizer
+
+    return bbox_loss
+
+
+def sum_ohem_loss(cls_score, label, bbox_pred, bbox_targets,
+                  num_classes, num_ohem_samples=256, sigma=1.0):
+    '''
+
+    :param cls_score: [-1, cls_num+1]
+    :param label: [-1]
+    :param bbox_pred: [-1, 4*(cls_num+1)]
+    :param bbox_targets: [-1, 4*(cls_num+1)]
+    :param num_ohem_samples: 256 by default
+    :param num_classes: cls_num+1
+    :param sigma:
+    :return:
+    '''
+
+    cls_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=cls_score, labels=label)  # [-1, ]
+    # cls_loss = tf.Print(cls_loss, [tf.shape(cls_loss)], summarize=10, message='CLS losss shape ****')
+
+    outside_mask = tf.stop_gradient(tf.to_float(tf.greater(label, 0)))
+    bbox_pred = tf.reshape(bbox_pred, [-1, num_classes, 4])
+    bbox_targets = tf.reshape(bbox_targets, [-1, num_classes, 4])
+
+    value = _smooth_l1_loss_base(bbox_pred,
+                                 bbox_targets,
+                                 sigma=sigma)
+    value = tf.reduce_sum(value, 2)
+    value = tf.reshape(value, [-1, num_classes])
+
+    inside_mask = tf.one_hot(tf.reshape(label, [-1, 1]),
+                             depth=num_classes, axis=1)
+
+    inside_mask = tf.stop_gradient(
+        tf.to_float(tf.reshape(inside_mask, [-1, num_classes])))
+    loc_loss = tf.reduce_sum(value * inside_mask, 1)*outside_mask
+    # loc_loss = tf.Print(loc_loss, [tf.shape(loc_loss)], summarize=10, message='loc_loss shape***')
+
+    sum_loss = cls_loss + loc_loss
+
+    num_ohem_samples = tf.stop_gradient(tf.minimum(num_ohem_samples, tf.shape(sum_loss)[0]))
+    _, top_k_indices = tf.nn.top_k(sum_loss, k=num_ohem_samples)
+
+    cls_loss_ohem = tf.gather(cls_loss, top_k_indices)
+    cls_loss_ohem = tf.reduce_mean(cls_loss_ohem)
+
+    loc_loss_ohem = tf.gather(loc_loss, top_k_indices)
+    normalizer = tf.to_float(num_ohem_samples)
+    loc_loss_ohem = tf.reduce_sum(loc_loss_ohem) / normalizer
+
+    return cls_loss_ohem, loc_loss_ohem
+
diff --git a/utils/external/faster_rcnn_tensorflow/utility/proposal_opr.py b/utils/external/faster_rcnn_tensorflow/utility/proposal_opr.py
new file mode 100644
index 0000000..b05dac1
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/proposal_opr.py
@@ -0,0 +1,66 @@
+# encoding: utf-8
+"""
+@author: zeming li
+@contact: zengarden2009@gmail.com
+"""
+
+from utils.external.faster_rcnn_tensorflow.configs import cfgs
+from utils.external.faster_rcnn_tensorflow.utility import encode_and_decode
+from utils.external.faster_rcnn_tensorflow.utility import boxes_utils
+import tensorflow as tf
+import numpy as np
+
+
+def postprocess_rpn_proposals(rpn_bbox_pred, rpn_cls_prob, img_shape, anchors, is_training):
+    '''
+
+    :param rpn_bbox_pred: [-1, 4]
+    :param rpn_cls_prob: [-1, 2]
+    :param img_shape:
+    :param anchors:[-1, 4]
+    :param is_training:
+    :return:
+    '''
+
+    if is_training:
+        pre_nms_topN = cfgs.RPN_TOP_K_NMS_TRAIN
+        post_nms_topN = cfgs.RPN_MAXIMUM_PROPOSAL_TARIN
+        nms_thresh = cfgs.RPN_NMS_IOU_THRESHOLD
+    else:
+        pre_nms_topN = cfgs.RPN_TOP_K_NMS_TEST
+        post_nms_topN = cfgs.RPN_MAXIMUM_PROPOSAL_TEST
+        nms_thresh = cfgs.RPN_NMS_IOU_THRESHOLD
+
+    cls_prob = rpn_cls_prob[:, 1]
+
+    # 1. decode boxes
+    decode_boxes = encode_and_decode.decode_boxes(encoded_boxes=rpn_bbox_pred,
+                                                  reference_boxes=anchors,
+                                                  scale_factors=cfgs.ANCHOR_SCALE_FACTORS)
+
+    # decode_boxes = encode_and_decode.decode_boxes(boxes=anchors,
+    #                                               deltas=rpn_bbox_pred,
+    #                                               scale_factor=None)
+
+    # 2. clip to img boundaries
+    decode_boxes = boxes_utils.clip_boxes_to_img_boundaries(decode_boxes=decode_boxes,
+                                                            img_shape=img_shape)
+
+    # 3. get top N to NMS
+    if pre_nms_topN > 0:
+        pre_nms_topN = tf.minimum(pre_nms_topN, tf.shape(decode_boxes)[0], name='avoid_unenough_boxes')
+        cls_prob, top_k_indices = tf.nn.top_k(cls_prob, k=pre_nms_topN)
+        decode_boxes = tf.gather(decode_boxes, top_k_indices)
+
+    # 4. NMS
+    keep = tf.image.non_max_suppression(
+        boxes=decode_boxes,
+        scores=cls_prob,
+        max_output_size=post_nms_topN,
+        iou_threshold=nms_thresh)
+
+    final_boxes = tf.gather(decode_boxes, keep)
+    final_probs = tf.gather(cls_prob, keep)
+
+    return final_boxes, final_probs
+
diff --git a/utils/external/faster_rcnn_tensorflow/utility/proposal_target_layer.py b/utils/external/faster_rcnn_tensorflow/utility/proposal_target_layer.py
new file mode 100644
index 0000000..ace1c6b
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/proposal_target_layer.py
@@ -0,0 +1,177 @@
+# --------------------------------------------------------
+# Faster R-CNN
+# Copyright (c) 2015 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Written by Ross Girshick
+# --------------------------------------------------------
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from utils.external.faster_rcnn_tensorflow.configs import cfgs
+import numpy as np
+import numpy.random as npr
+
+from utils.external.faster_rcnn_tensorflow.utility import encode_and_decode
+
+def bbox_overlaps(boxes, query_boxes):
+  """
+  Parameters
+  ----------
+  boxes: (N, 4) ndarray of float
+  query_boxes: (K, 4) ndarray of float
+  Returns
+  -------
+  overlaps: (N, K) ndarray of overlap between boxes and query_boxes
+  """
+  N = boxes.shape[0]
+  K = query_boxes.shape[0]
+  overlaps = np.zeros((N, K))
+  for k in range(K):
+    box_area = (
+        (query_boxes[k, 2] - query_boxes[k, 0] + 1) *
+        (query_boxes[k, 3] - query_boxes[k, 1] + 1)
+    )
+    for n in range(N):
+      iw = (
+          min(boxes[n, 2], query_boxes[k, 2]) -
+          max(boxes[n, 0], query_boxes[k, 0]) + 1
+      )
+      if iw > 0:
+        ih = (
+            min(boxes[n, 3], query_boxes[k, 3]) -
+            max(boxes[n, 1], query_boxes[k, 1]) + 1
+        )
+        if ih > 0:
+          ua = float(
+            (boxes[n, 2] - boxes[n, 0] + 1) *
+            (boxes[n, 3] - boxes[n, 1] + 1) +
+            box_area - iw * ih
+          )
+          overlaps[n, k] = iw * ih / ua
+  return overlaps
+
+def proposal_target_layer(rpn_rois, gt_boxes):
+    """
+    Assign object detection proposals to ground-truth targets. Produces proposal
+    classification labels and bounding-box regression targets.
+    """
+    # Proposal ROIs (x1, y1, x2, y2) coming from RPN
+    # gt_boxes (x1, y1, x2, y2, label)
+    if cfgs.ADD_GTBOXES_TO_TRAIN:
+        all_rois = np.vstack((rpn_rois, gt_boxes[:, :-1]))
+    else:
+        all_rois = rpn_rois
+    # np.inf
+    rois_per_image = np.inf if cfgs.FAST_RCNN_MINIBATCH_SIZE == -1 else cfgs.FAST_RCNN_MINIBATCH_SIZE
+
+    fg_rois_per_image = np.round(cfgs.FAST_RCNN_POSITIVE_RATE * rois_per_image)
+
+    # Sample rois with classification labels and bounding box regression
+    labels, rois, bbox_targets = _sample_rois(all_rois, gt_boxes, fg_rois_per_image,
+                                              rois_per_image, cfgs.CLASS_NUM+1)
+    # print(labels.shape, rois.shape, bbox_targets.shape)
+    rois = rois.reshape(-1, 4)
+    labels = labels.reshape(-1)
+    bbox_targets = bbox_targets.reshape(-1, (cfgs.CLASS_NUM+1) * 4)
+
+    return rois, labels, bbox_targets
+
+
+def _get_bbox_regression_labels(bbox_target_data, num_classes):
+    """Bounding-box regression targets (bbox_target_data) are stored in a
+    compact form N x (class, tx, ty, tw, th)
+
+    This function expands those targets into the 4-of-4*K representation used
+    by the network (i.e. only one class has non-zero targets).
+
+    Returns:
+        bbox_target (ndarray): N x 4K blob of regression targets
+    """
+
+    clss = bbox_target_data[:, 0]
+    bbox_targets = np.zeros((clss.size, 4 * num_classes), dtype=np.float32)
+    inds = np.where(clss > 0)[0]
+    for ind in inds:
+        cls = clss[ind]
+        start = int(4 * cls)
+        end = start + 4
+        bbox_targets[ind, start:end] = bbox_target_data[ind, 1:]
+
+    return bbox_targets
+
+
+def _compute_targets(ex_rois, gt_rois, labels):
+    """Compute bounding-box regression targets for an image.
+    that is : [label, tx, ty, tw, th]
+    """
+
+    assert ex_rois.shape[0] == gt_rois.shape[0]
+    assert ex_rois.shape[1] == 4
+    assert gt_rois.shape[1] == 4
+
+    targets = encode_and_decode.encode_boxes(unencode_boxes=gt_rois,
+                                             reference_boxes=ex_rois,
+                                             scale_factors=cfgs.ROI_SCALE_FACTORS)
+    # targets = encode_and_decode.encode_boxes(ex_rois=ex_rois,
+    #                                          gt_rois=gt_rois,
+    #                                          scale_factor=cfgs.ROI_SCALE_FACTORS)
+
+    return np.hstack(
+        (labels[:, np.newaxis], targets)).astype(np.float32, copy=False)
+
+
+def _sample_rois(all_rois, gt_boxes, fg_rois_per_image,
+                 rois_per_image, num_classes):
+    """Generate a random sample of RoIs comprising foreground and background
+    examples.
+
+    all_rois shape is [-1, 4]
+    gt_boxes shape is [-1, 5]. that is [x1, y1, x2, y2, label]
+    """
+    # overlaps: (rois x gt_boxes)
+    overlaps = bbox_overlaps(all_rois,gt_boxes)
+    gt_assignment = overlaps.argmax(axis=1)
+    max_overlaps = overlaps.max(axis=1)
+    labels = gt_boxes[gt_assignment, -1]
+
+    # Select foreground RoIs as those with >= FG_THRESH overlap
+    fg_inds = np.where(max_overlaps >= cfgs.FAST_RCNN_IOU_POSITIVE_THRESHOLD)[0]
+
+    # Guard against the case when an image has fewer than fg_rois_per_image
+    # Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
+    bg_inds = np.where((max_overlaps < cfgs.FAST_RCNN_IOU_POSITIVE_THRESHOLD) &
+                       (max_overlaps >= cfgs.FAST_RCNN_IOU_NEGATIVE_THRESHOLD))[0]
+    # print("first fileter, fg_size: {} || bg_size: {}".format(fg_inds.shape, bg_inds.shape))
+    # Guard against the case when an image has fewer than fg_rois_per_image
+    # foreground RoIs
+    fg_rois_per_this_image = min(fg_rois_per_image, fg_inds.size)
+
+    # Sample foreground regions without replacement
+    if fg_inds.size > 0:
+        fg_inds = npr.choice(fg_inds, size=int(fg_rois_per_this_image), replace=False)
+    # Compute number of background RoIs to take from this image (guarding
+    # against there being fewer than desired)
+    bg_rois_per_this_image = rois_per_image - fg_rois_per_this_image
+    bg_rois_per_this_image = min(bg_rois_per_this_image, bg_inds.size)
+    # Sample background regions without replacement
+    if bg_inds.size > 0:
+        bg_inds = npr.choice(bg_inds, size=int(bg_rois_per_this_image), replace=False)
+
+    # print("second fileter, fg_size: {} || bg_size: {}".format(fg_inds.shape, bg_inds.shape))
+    # The indices that we're selecting (both fg and bg)
+    keep_inds = np.append(fg_inds, bg_inds)
+
+
+    # Select sampled values from various arrays:
+    labels = labels[keep_inds]
+
+    # Clamp labels for the background RoIs to 0
+    labels[int(fg_rois_per_this_image):] = 0
+    rois = all_rois[keep_inds]
+
+    bbox_target_data = _compute_targets(
+        rois, gt_boxes[gt_assignment[keep_inds], :-1], labels)
+    bbox_targets = \
+        _get_bbox_regression_labels(bbox_target_data, num_classes)
+
+    return labels, rois, bbox_targets
diff --git a/utils/external/faster_rcnn_tensorflow/utility/remote_sensing_dict.py b/utils/external/faster_rcnn_tensorflow/utility/remote_sensing_dict.py
new file mode 100644
index 0000000..8e46095
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/remote_sensing_dict.py
@@ -0,0 +1,15 @@
+# -*- coding: utf-8 -*-
+
+NAME_LABEL_MAP = {
+    'back_ground': 0,
+    'building': 1
+}
+
+
+def get_label_name_map():
+    reverse_dict = {}
+    for name, label in NAME_LABEL_MAP.items():
+        reverse_dict[label] = name
+    return reverse_dict
+
+LABEL_NAME_MAP = get_label_name_map()
\ No newline at end of file
diff --git a/utils/external/faster_rcnn_tensorflow/utility/show_box_in_tensor.py b/utils/external/faster_rcnn_tensorflow/utility/show_box_in_tensor.py
new file mode 100644
index 0000000..8002ee3
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/show_box_in_tensor.py
@@ -0,0 +1,65 @@
+# -*- coding: utf-8 -*-
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+from utils.external.faster_rcnn_tensorflow.utility import draw_box_in_img
+
+def only_draw_boxes(img_batch, boxes):
+
+    boxes = tf.stop_gradient(boxes)
+    img_tensor = tf.squeeze(img_batch, 0)
+    img_tensor = tf.cast(img_tensor, tf.float32)
+    labels = tf.ones(shape=(tf.shape(boxes)[0], ), dtype=tf.int32) * draw_box_in_img.ONLY_DRAW_BOXES
+    scores = tf.zeros_like(labels, dtype=tf.float32)
+    img_tensor_with_boxes = tf.py_func(draw_box_in_img.draw_boxes_with_label_and_scores,
+                                       inp=[img_tensor, boxes, labels, scores],
+                                       Tout=tf.uint8)
+    img_tensor_with_boxes = tf.reshape(img_tensor_with_boxes, tf.shape(img_batch))  # [batch_size, h, w, c]
+
+    return img_tensor_with_boxes
+
+def draw_boxes_with_scores(img_batch, boxes, scores):
+
+    boxes = tf.stop_gradient(boxes)
+    scores = tf.stop_gradient(scores)
+
+    img_tensor = tf.squeeze(img_batch, 0)
+    img_tensor = tf.cast(img_tensor, tf.float32)
+    labels = tf.ones(shape=(tf.shape(boxes)[0],), dtype=tf.int32) * draw_box_in_img.ONLY_DRAW_BOXES_WITH_SCORES
+    img_tensor_with_boxes = tf.py_func(draw_box_in_img.draw_boxes_with_label_and_scores,
+                                       inp=[img_tensor, boxes, labels, scores],
+                                       Tout=[tf.uint8])
+    img_tensor_with_boxes = tf.reshape(img_tensor_with_boxes, tf.shape(img_batch))
+    return img_tensor_with_boxes
+
+def draw_boxes_with_categories(img_batch, boxes, labels):
+    boxes = tf.stop_gradient(boxes)
+
+    img_tensor = tf.squeeze(img_batch, 0)
+    img_tensor = tf.cast(img_tensor, tf.float32)
+    scores = tf.ones(shape=(tf.shape(boxes)[0],), dtype=tf.float32)
+    img_tensor_with_boxes = tf.py_func(draw_box_in_img.draw_boxes_with_label_and_scores,
+                                       inp=[img_tensor, boxes, labels, scores],
+                                       Tout=[tf.uint8])
+    img_tensor_with_boxes = tf.reshape(img_tensor_with_boxes, tf.shape(img_batch))
+    return img_tensor_with_boxes
+
+def draw_boxes_with_categories_and_scores(img_batch, boxes, labels, scores):
+    boxes = tf.stop_gradient(boxes)
+    scores = tf.stop_gradient(scores)
+
+    img_tensor = tf.squeeze(img_batch, 0)
+    img_tensor = tf.cast(img_tensor, tf.float32)
+    img_tensor_with_boxes = tf.py_func(draw_box_in_img.draw_boxes_with_label_and_scores,
+                                       inp=[img_tensor, boxes, labels, scores],
+                                       Tout=[tf.uint8])
+    img_tensor_with_boxes = tf.reshape(img_tensor_with_boxes, tf.shape(img_batch))
+    return img_tensor_with_boxes
+
+if __name__ == "__main__":
+    print (1)
+
diff --git a/utils/external/faster_rcnn_tensorflow/utility/tf_ops.py b/utils/external/faster_rcnn_tensorflow/utility/tf_ops.py
new file mode 100644
index 0000000..9c8dfef
--- /dev/null
+++ b/utils/external/faster_rcnn_tensorflow/utility/tf_ops.py
@@ -0,0 +1,39 @@
+# -*- coding:utf-8 -*-
+
+from __future__ import absolute_import, print_function, division
+
+import tensorflow as tf
+
+'''
+all of these ops are derived from tenosrflow Object Detection API
+'''
+def indices_to_dense_vector(indices,
+                            size,
+                            indices_value=1.,
+                            default_value=0,
+                            dtype=tf.float32):
+  """Creates dense vector with indices set to specific (the para "indices_value" ) and rest to zeros.
+
+  This function exists because it is unclear if it is safe to use
+    tf.sparse_to_dense(indices, [size], 1, validate_indices=False)
+  with indices which are not ordered.
+  This function accepts a dynamic size (e.g. tf.shape(tensor)[0])
+
+  Args:
+    indices: 1d Tensor with integer indices which are to be set to
+        indices_values.
+    size: scalar with size (integer) of output Tensor.
+    indices_value: values of elements specified by indices in the output vector
+    default_value: values of other elements in the output vector.
+    dtype: data type.
+
+  Returns:
+    dense 1D Tensor of shape [size] with indices set to indices_values and the
+        rest set to default_value.
+  """
+  size = tf.to_int32(size)
+  zeros = tf.ones([size], dtype=dtype) * default_value
+  values = tf.ones_like(indices, dtype=dtype) * indices_value
+
+  return tf.dynamic_stitch([tf.range(size), tf.to_int32(indices)],
+                           [zeros, values])
\ No newline at end of file
diff --git a/utils/external/resnet_model.py b/utils/external/resnet_model.py
index e17aab5..673b915 100644
--- a/utils/external/resnet_model.py
+++ b/utils/external/resnet_model.py
@@ -33,6 +33,14 @@
 
 import tensorflow as tf
 
+FLAGS = tf.app.flags.FLAGS
+
+tf.app.flags.DEFINE_boolean('enbl_fused_batchnorm', True,
+                            'Enable fused batch normalization or not. Enable this will bring a '
+                            'significant performance boost, but may not be able to export a '
+                            '*.tflite model when using TensorFlow\'s quantization-aware training '
+                            'APIs (at least for TensorFlow==1.12.0, the answer is no).')
+
 _BATCH_NORM_DECAY = 0.997
 _BATCH_NORM_EPSILON = 1e-5
 DEFAULT_VERSION = 2
@@ -51,7 +59,7 @@ def batch_norm(inputs, training, data_format):
   return tf.layers.batch_normalization(
       inputs=inputs, axis=1 if data_format == 'channels_first' else 3,
       momentum=_BATCH_NORM_DECAY, epsilon=_BATCH_NORM_EPSILON, center=True,
-      scale=True, training=training, fused=True)
+      scale=True, training=training, fused=FLAGS.enbl_fused_batchnorm)
 
 
 def fixed_padding(inputs, kernel_size, data_format):
diff --git a/utils/external/ssd_tensorflow/LICENSE b/utils/external/ssd_tensorflow/LICENSE
new file mode 100644
index 0000000..261eeb9
--- /dev/null
+++ b/utils/external/ssd_tensorflow/LICENSE
@@ -0,0 +1,201 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/utils/external/ssd_tensorflow/README.md b/utils/external/ssd_tensorflow/README.md
new file mode 100644
index 0000000..f2b3a20
--- /dev/null
+++ b/utils/external/ssd_tensorflow/README.md
@@ -0,0 +1,138 @@
+# State-of-the-art Single Shot MultiBox Detector in TensorFlow
+
+This repository contains codes of the reimplementation of [SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325) in TensorFlow. If your goal is to reproduce the results in the original paper, please use the official [codes](https://github.com/weiliu89/caffe/tree/ssd).
+
+There are already some TensorFlow based SSD reimplementation codes on GitHub, the main special features of this repo inlcude:
+
+- state of the art performance(77.8%mAP) when training from VGG-16 pre-trained model (SSD300-VGG16).
+- the model is trained using TensorFlow high level API [tf.estimator](https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator). Although TensorFlow provides many APIs, the Estimator API is highly recommended to yield scalable, high-performance models.
+- all codes were writen by pure TensorFlow ops (no numpy operation) to ensure the performance and portability.
+- using ssd augmentation pipeline discribed in the original paper.
+- PyTorch-like model definition using high-level [tf.layers](https://www.tensorflow.org/api_docs/python/tf/layers) API for better readability ^-^.
+- high degree of modularity to ease futher development.
+- using replicate\_model\_fn makes it flexible to use one or more GPUs.
+
+***New Update(77.9%mAP): using absolute bbox coordinates instead of normalized coordinates, checkout [here](https://github.com/HiKapok/SSD.TensorFlow/tree/AbsoluteCoord).***
+
+## ##
+## Usage
+- Download [Pascal VOC Dataset](https://pjreddie.com/projects/pascal-voc-dataset-mirror/) and reorganize the directory as follows:
+	```
+	VOCROOT/
+		   |->VOC2007/
+		   |    |->Annotations/
+		   |    |->ImageSets/
+		   |    |->...
+		   |->VOC2012/
+		   |    |->Annotations/
+		   |    |->ImageSets/
+		   |    |->...
+		   |->VOC2007TEST/
+		   |    |->Annotations/
+		   |    |->...
+	```
+	VOCROOT is your path of the Pascal VOC Dataset.
+- Run the following script to generate TFRecords.
+	```sh
+	python dataset/convert_tfrecords.py --dataset_directory=VOCROOT --output_directory=./dataset/tfrecords
+	```
+- Download the **pre-trained VGG-16 model (reduced-fc)** from [here](https://drive.google.com/drive/folders/184srhbt8_uvLKeWW_Yo8Mc5wTyc0lJT7) and put them into one sub-directory named 'model' (we support SaverDef.V2 by default, the V1 version is also available for sake of compatibility).
+- Run the following script to start training:
+
+	```sh
+	python train_ssd.py
+	```
+- Run the following script for evaluation and get mAP:
+
+	```sh
+	python eval_ssd.py
+	python voc_eval.py
+	```
+	Note: you need first modify some directory in voc_eval.py.
+- Run the following script for visualization:
+	```sh
+	python simple_ssd_demo.py
+	```
+
+All the codes was tested under TensorFlow 1.6, Python 3.5, Ubuntu 16.04 with CUDA 8.0. If you want to run training by yourself, one decent GPU will be highly recommended. The whole training process for VOC07+12 dataset took ~120k steps in total, and each step (32 samples per-batch) took ~1s on my little workstation with single GTX1080-Ti GPU Card. If you need run training without enough GPU memory you can try half of the current batch size(e.g. 16), try to lower the learning rate and run more steps, watching the TensorBoard until convergency. BTW, the codes here had also been tested under TensorFlow 1.4 with CUDA 8.0, but some modifications to the codes are needed to enable replicate model training, take following steps if you need:
+
+- copy all the codes of [this file](https://github.com/tensorflow/tensorflow/blob/v1.6.0/tensorflow/contrib/estimator/python/estimator/replicate_model_fn.py) to your local file named 'tf\_replicate\_model\_fn.py'
+- add one more line [here](https://github.com/HiKapok/SSD.TensorFlow/blob/899e08dad48669ca0c444284977e3d7ffa1da5fe/train_ssd.py#L25) to import module 'tf\_replicate\_model\_fn'
+- change 'tf.contrib.estimator' in [here](https://github.com/HiKapok/SSD.TensorFlow/blob/899e08dad48669ca0c444284977e3d7ffa1da5fe/train_ssd.py#L383) and [here](https://github.com/HiKapok/SSD.TensorFlow/blob/899e08dad48669ca0c444284977e3d7ffa1da5fe/train_ssd.py#L422) to 'tf\_replicate\_model\_fn'
+- now the training process should run perfectly
+- before you run 'eval_ssd.py', you should also remove [this line](https://github.com/HiKapok/SSD.TensorFlow/blob/e8296848b9f6eb585da5945d6b3ae099029ef4bf/eval_ssd.py#L369) because of the interface compatibility
+
+
+***This repo is just created recently, any contribution will be welcomed.***
+
+## Results (VOC07 Metric)
+
+This implementation(SSD300-VGG16) yield **mAP 77.8%** on PASCAL VOC 2007 test dataset(the original performance described in the paper is 77.2%mAP), the details are as follows:
+
+| sofa   | bird  | pottedplant | bus | diningtable | cow | bottle | horse | aeroplane | motorbike
+|:-------|:-----:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|
+|  78.9  |  76.2 |  53.5   |   85.2  |   75.5    |  85.0 |  48.6  | 86.7  |   82.2    |   83.4   |
+| **sheep**  | **train** | **boat**    | **bicycle** | **chair**    | **cat**   | **tvmonitor** | **person** | **car**  | **dog** |
+|  82.4  |  87.6 |  72.7   |   83.0  |   61.3    | 88.2 |  74.5  | 79.6  |   85.3   |   86.4   |
+
+You can download the trained model(VOC07+12 Train) from [GoogleDrive](https://drive.google.com/open?id=1yeYcfcOURcZ4DaElEn9C2xY1NymGzG5W) for further research.
+
+For Chinese friends, you can also download both the trained model and pre-trained vgg16 weights from [BaiduYun Drive](https://pan.baidu.com/s/1kRhZd4p-N46JFpVkMgU3fg), access code: **tg64**.
+
+Here is the training logs and some detection results:
+
+![](logs/loss.JPG "loss")
+![](logs/celoss.JPG "celoss")
+![](logs/locloss.JPG "locloss")
+![](demo/demo1.jpg "demo1")
+![](demo/demo2.jpg "demo2")
+![](demo/demo3.jpg "demo3")
+
+## *Too Busy* TODO
+
+- Adapting for CoCo Dataset
+- Update version SSD-512
+- Transfer to other backbone networks
+
+## Known Issues
+
+- Got 'TypeError: Expected binary or unicode string, got None' while training
+  - Why: There maybe some inconsistent between different TensorFlow version.
+  - How: If you got this error, try change the default value of checkpoint_path to './model/vgg16.ckpt' in [train_ssd.py](https://github.com/HiKapok/SSD.TensorFlow/blob/86e3fa600d8d07122e9366ae664dea8c3c87c622/train_ssd.py#L107). For more information [issue6](https://github.com/HiKapok/SSD.TensorFlow/issues/6) and [issue9](https://github.com/HiKapok/SSD.TensorFlow/issues/9).
+- Nan loss during training
+  - Why: This is caused by the default learning rate which is a little higher for some TensorFlow version.
+  - How: I don't know the details about the different behavior between different versions. There are two workarounds:
+  	- Adding warm-up: change some codes [here](https://github.com/HiKapok/SSD.TensorFlow/blob/d9cf250df81c8af29985c03d76636b2b8b19f089/train_ssd.py#L99) to the following snippet:
+
+	```python
+	tf.app.flags.DEFINE_string(
+    'decay_boundaries', '2000, 80000, 100000',
+    'Learning rate decay boundaries by global_step (comma-separated list).')
+	tf.app.flags.DEFINE_string(
+    'lr_decay_factors', '0.1, 1, 0.1, 0.01',
+    'The values of learning_rate decay factor for each segment between boundaries (comma-separated list).')
+	```
+	- Lower the learning rate and run more steps until convergency.
+- Why this re-implementation perform better than the reported performance
+  - I don't know
+
+## Citation
+
+Use this bibtex to cite this repository:
+```
+@misc{kapok_ssd_2018,
+  title={Single Shot MultiBox Detector in TensorFlow},
+  author={Changan Wang},
+  year={2018},
+  publisher={Github},
+  journal={GitHub repository},
+  howpublished={\url{https://github.com/HiKapok/SSD.TensorFlow}},
+}
+```
+
+## Discussion
+
+Welcome to join in QQ Group(758790869) for more discussion
+
+## ##
+Apache License, Version 2.0
diff --git a/utils/external/ssd_tensorflow/dataset/convert_tfrecords.py b/utils/external/ssd_tensorflow/dataset/convert_tfrecords.py
new file mode 100644
index 0000000..4ce3ad3
--- /dev/null
+++ b/utils/external/ssd_tensorflow/dataset/convert_tfrecords.py
@@ -0,0 +1,394 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from datetime import datetime
+import os
+import random
+import sys
+import threading
+import xml.etree.ElementTree as xml_tree
+
+import numpy as np
+import six
+import tensorflow as tf
+
+import dataset_common
+
+'''How to organize your dataset folder:
+  VOCROOT/
+       |->VOC2007/
+       |    |->Annotations/
+       |    |->ImageSets/
+       |    |->...
+       |->VOC2012/
+       |    |->Annotations/
+       |    |->ImageSets/
+       |    |->...
+       |->VOC2007TEST/
+       |    |->Annotations/
+       |    |->...
+'''
+tf.app.flags.DEFINE_string('dataset_directory', '/media/rs/7A0EE8880EE83EAF/Detections/PASCAL/VOC',
+                           'All datas directory')
+tf.app.flags.DEFINE_string('train_splits', 'VOC2007, VOC2012',
+                           'Comma-separated list of the training data sub-directory')
+tf.app.flags.DEFINE_string('validation_splits', 'VOC2007TEST',
+                           'Comma-separated list of the validation data sub-directory')
+tf.app.flags.DEFINE_string('output_directory', '/media/rs/7A0EE8880EE83EAF/Detections/SSD/dataset/tfrecords',
+                           'Output data directory')
+tf.app.flags.DEFINE_integer('train_shards', 16,
+                            'Number of shards in training TFRecord files.')
+tf.app.flags.DEFINE_integer('validation_shards', 16,
+                            'Number of shards in validation TFRecord files.')
+tf.app.flags.DEFINE_integer('num_threads', 8,
+                            'Number of threads to preprocess the images.')
+RANDOM_SEED = 180428
+
+FLAGS = tf.app.flags.FLAGS
+
+def _int64_feature(value):
+  """Wrapper for inserting int64 features into Example proto."""
+  if not isinstance(value, list):
+    value = [value]
+  return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
+
+
+def _float_feature(value):
+  """Wrapper for inserting float features into Example proto."""
+  if not isinstance(value, list):
+    value = [value]
+  return tf.train.Feature(float_list=tf.train.FloatList(value=value))
+
+def _bytes_list_feature(value):
+    """Wrapper for inserting a list of bytes features into Example proto.
+    """
+    if not isinstance(value, list):
+        value = [value]
+    return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))
+
+def _bytes_feature(value):
+  """Wrapper for inserting bytes features into Example proto."""
+  if isinstance(value, six.string_types):
+    value = six.binary_type(value, encoding='utf-8')
+  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
+
+def _convert_to_example(filename, image_name, image_buffer, bboxes, labels, labels_text,
+                        difficult, truncated, height, width):
+  """Build an Example proto for an example.
+
+  Args:
+    filename: string, path to an image file, e.g., '/path/to/example.JPG'
+    image_buffer: string, JPEG encoding of RGB image
+    bboxes: List of bounding boxes for each image
+    labels: List of labels for bounding box
+    labels_text: List of labels' name for bounding box
+    difficult: List of ints indicate the difficulty of that bounding box
+    truncated: List of ints indicate the truncation of that bounding box
+    height: integer, image height in pixels
+    width: integer, image width in pixels
+  Returns:
+    Example proto
+  """
+  ymin = []
+  xmin = []
+  ymax = []
+  xmax = []
+  for b in bboxes:
+    assert len(b) == 4
+    # pylint: disable=expression-not-assigned
+    [l.append(point) for l, point in zip([ymin, xmin, ymax, xmax], b)]
+    # pylint: enable=expression-not-assigned
+  channels = 3
+  image_format = 'JPEG'
+
+  example = tf.train.Example(features=tf.train.Features(feature={
+            'image/height': _int64_feature(height),
+            'image/width': _int64_feature(width),
+            'image/channels': _int64_feature(channels),
+            'image/shape': _int64_feature([height, width, channels]),
+            'image/object/bbox/xmin': _float_feature(xmin),
+            'image/object/bbox/xmax': _float_feature(xmax),
+            'image/object/bbox/ymin': _float_feature(ymin),
+            'image/object/bbox/ymax': _float_feature(ymax),
+            'image/object/bbox/label': _int64_feature(labels),
+            'image/object/bbox/label_text': _bytes_list_feature(labels_text),
+            'image/object/bbox/difficult': _int64_feature(difficult),
+            'image/object/bbox/truncated': _int64_feature(truncated),
+            'image/format': _bytes_feature(image_format),
+            'image/filename': _bytes_feature(image_name.encode('utf8')),
+            'image/encoded': _bytes_feature(image_buffer)}))
+  return example
+
+
+class ImageCoder(object):
+  """Helper class that provides TensorFlow image coding utilities."""
+
+  def __init__(self):
+    # Create a single Session to run all image coding calls.
+    self._sess = tf.Session()
+
+    # Initializes function that converts PNG to JPEG data.
+    self._png_data = tf.placeholder(dtype=tf.string)
+    image = tf.image.decode_png(self._png_data, channels=3)
+    self._png_to_jpeg = tf.image.encode_jpeg(image, format='rgb', quality=100)
+
+    # Initializes function that converts CMYK JPEG data to RGB JPEG data.
+    self._cmyk_data = tf.placeholder(dtype=tf.string)
+    image = tf.image.decode_jpeg(self._cmyk_data, channels=0)
+    self._cmyk_to_rgb = tf.image.encode_jpeg(image, format='rgb', quality=100)
+
+    # Initializes function that decodes RGB JPEG data.
+    self._decode_jpeg_data = tf.placeholder(dtype=tf.string)
+    self._decode_jpeg = tf.image.decode_jpeg(self._decode_jpeg_data, channels=3)
+
+  def png_to_jpeg(self, image_data):
+    return self._sess.run(self._png_to_jpeg,
+                          feed_dict={self._png_data: image_data})
+
+  def cmyk_to_rgb(self, image_data):
+    return self._sess.run(self._cmyk_to_rgb,
+                          feed_dict={self._cmyk_data: image_data})
+
+  def decode_jpeg(self, image_data):
+    image = self._sess.run(self._decode_jpeg,
+                           feed_dict={self._decode_jpeg_data: image_data})
+    assert len(image.shape) == 3
+    assert image.shape[2] == 3
+    return image
+
+
+def _process_image(filename, coder):
+  """Process a single image file.
+
+  Args:
+    filename: string, path to an image file e.g., '/path/to/example.JPG'.
+    coder: instance of ImageCoder to provide TensorFlow image coding utils.
+  Returns:
+    image_buffer: string, JPEG encoding of RGB image.
+    height: integer, image height in pixels.
+    width: integer, image width in pixels.
+  """
+  # Read the image file.
+  with tf.gfile.FastGFile(filename, 'rb') as f:
+    image_data = f.read()
+
+  # Decode the RGB JPEG.
+  image = coder.decode_jpeg(image_data)
+
+  # Check that image converted to RGB
+  assert len(image.shape) == 3
+  height = image.shape[0]
+  width = image.shape[1]
+  assert image.shape[2] == 3
+
+  return image_data, height, width
+
+def _find_image_bounding_boxes(directory, cur_record):
+  """Find the bounding boxes for a given image file.
+
+  Args:
+    directory: string; the path of all datas.
+    cur_record: list of strings; the first of which is the sub-directory of cur_record, the second is the image filename.
+  Returns:
+    bboxes: List of bounding boxes for each image.
+    labels: List of labels for bounding box.
+    labels_text: List of labels' name for bounding box.
+    difficult: List of ints indicate the difficulty of that bounding box.
+    truncated: List of ints indicate the truncation of that bounding box.
+  """
+  anna_file = os.path.join(directory, cur_record[0], 'Annotations', cur_record[1].replace('jpg', 'xml'))
+
+  tree = xml_tree.parse(anna_file)
+  root = tree.getroot()
+
+  # Image shape.
+  size = root.find('size')
+  shape = [int(size.find('height').text),
+           int(size.find('width').text),
+           int(size.find('depth').text)]
+  # Find annotations.
+  bboxes = []
+  labels = []
+  labels_text = []
+  difficult = []
+  truncated = []
+  for obj in root.findall('object'):
+      label = obj.find('name').text
+      labels.append(int(dataset_common.VOC_LABELS[label][0]))
+      labels_text.append(label.encode('ascii'))
+
+      isdifficult = obj.find('difficult')
+      if isdifficult is not None:
+          difficult.append(int(isdifficult.text))
+      else:
+          difficult.append(0)
+
+      istruncated = obj.find('truncated')
+      if istruncated is not None:
+          truncated.append(int(istruncated.text))
+      else:
+          truncated.append(0)
+
+      bbox = obj.find('bndbox')
+      bboxes.append((float(bbox.find('ymin').text) / shape[0],
+                     float(bbox.find('xmin').text) / shape[1],
+                     float(bbox.find('ymax').text) / shape[0],
+                     float(bbox.find('xmax').text) / shape[1]
+                     ))
+  return bboxes, labels, labels_text, difficult, truncated
+
+def _process_image_files_batch(coder, thread_index, ranges, name, directory, all_records, num_shards):
+  """Processes and saves list of images as TFRecord in 1 thread.
+
+  Args:
+    coder: instance of ImageCoder to provide TensorFlow image coding utils.
+    thread_index: integer, unique batch to run index is within [0, len(ranges)).
+    ranges: list of pairs of integers specifying ranges of each batches to
+      analyze in parallel.
+    name: string, unique identifier specifying the data set
+    directory: string; the path of all datas
+    all_records: list of string tuples; the first of each tuple is the sub-directory of the record, the second is the image filename.
+    num_shards: integer number of shards for this data set.
+  """
+  # Each thread produces N shards where N = int(num_shards / num_threads).
+  # For instance, if num_shards = 128, and the num_threads = 2, then the first
+  # thread would produce shards [0, 64).
+  num_threads = len(ranges)
+  assert not num_shards % num_threads
+  num_shards_per_batch = int(num_shards / num_threads)
+
+  shard_ranges = np.linspace(ranges[thread_index][0],
+                             ranges[thread_index][1],
+                             num_shards_per_batch + 1).astype(int)
+  num_files_in_thread = ranges[thread_index][1] - ranges[thread_index][0]
+
+  counter = 0
+  for s in range(num_shards_per_batch):
+    # Generate a sharded version of the file name, e.g. 'train-00002-of-00010'
+    shard = thread_index * num_shards_per_batch + s
+    output_filename = '%s-%.5d-of-%.5d' % (name, shard, num_shards)
+    output_file = os.path.join(FLAGS.output_directory, output_filename)
+    writer = tf.python_io.TFRecordWriter(output_file)
+
+    shard_counter = 0
+    files_in_shard = np.arange(shard_ranges[s], shard_ranges[s + 1], dtype=int)
+    for i in files_in_shard:
+      cur_record = all_records[i]
+      filename = os.path.join(directory, cur_record[0], 'JPEGImages', cur_record[1])
+
+      bboxes, labels, labels_text, difficult, truncated = _find_image_bounding_boxes(directory, cur_record)
+      image_buffer, height, width = _process_image(filename, coder)
+
+      example = _convert_to_example(filename, cur_record[1], image_buffer, bboxes, labels, labels_text,
+                                    difficult, truncated, height, width)
+      writer.write(example.SerializeToString())
+      shard_counter += 1
+      counter += 1
+
+      if not counter % 1000:
+        print('%s [thread %d]: Processed %d of %d images in thread batch.' %
+              (datetime.now(), thread_index, counter, num_files_in_thread))
+        sys.stdout.flush()
+
+    writer.close()
+    print('%s [thread %d]: Wrote %d images to %s' %
+          (datetime.now(), thread_index, shard_counter, output_file))
+    sys.stdout.flush()
+    shard_counter = 0
+  print('%s [thread %d]: Wrote %d images to %d shards.' %
+        (datetime.now(), thread_index, counter, num_files_in_thread))
+  sys.stdout.flush()
+
+def _process_image_files(name, directory, all_records, num_shards):
+  """Process and save list of images as TFRecord of Example protos.
+
+  Args:
+    name: string, unique identifier specifying the data set
+    directory: string; the path of all datas
+    all_records: list of string tuples; the first of each tuple is the sub-directory of the record, the second is the image filename.
+    num_shards: integer number of shards for this data set.
+  """
+  # Break all images into batches with a [ranges[i][0], ranges[i][1]].
+  spacing = np.linspace(0, len(all_records), FLAGS.num_threads + 1).astype(np.int)
+  ranges = []
+  threads = []
+  for i in range(len(spacing) - 1):
+    ranges.append([spacing[i], spacing[i + 1]])
+
+  # Launch a thread for each batch.
+  print('Launching %d threads for spacings: %s' % (FLAGS.num_threads, ranges))
+  sys.stdout.flush()
+
+  # Create a mechanism for monitoring when all threads are finished.
+  coord = tf.train.Coordinator()
+
+  # Create a generic TensorFlow-based utility for converting all image codings.
+  coder = ImageCoder()
+
+  threads = []
+  for thread_index in range(len(ranges)):
+    args = (coder, thread_index, ranges, name, directory, all_records, num_shards)
+    t = threading.Thread(target=_process_image_files_batch, args=args)
+    t.start()
+    threads.append(t)
+
+  # Wait for all the threads to terminate.
+  coord.join(threads)
+  print('%s: Finished writing all %d images in data set.' %
+        (datetime.now(), len(all_records)))
+  sys.stdout.flush()
+
+def _process_dataset(name, directory, all_splits, num_shards):
+  """Process a complete data set and save it as a TFRecord.
+
+  Args:
+    name: string, unique identifier specifying the data set.
+    directory: string, root path to the data set.
+    all_splits: list of strings, sub-path to the data set.
+    num_shards: integer number of shards for this data set.
+  """
+  all_records = []
+  for split in all_splits:
+    jpeg_file_path = os.path.join(directory, split, 'JPEGImages')
+    images = tf.gfile.ListDirectory(jpeg_file_path)
+    jpegs = [im_name for im_name in images if im_name.strip()[-3:]=='jpg']
+    all_records.extend(list(zip([split] * len(jpegs), jpegs)))
+
+  shuffled_index = list(range(len(all_records)))
+  random.seed(RANDOM_SEED)
+  random.shuffle(shuffled_index)
+  all_records = [all_records[i] for i in shuffled_index]
+  _process_image_files(name, directory, all_records, num_shards)
+
+def parse_comma_list(args):
+    return [s.strip() for s in args.split(',')]
+
+def main(unused_argv):
+  assert not FLAGS.train_shards % FLAGS.num_threads, (
+      'Please make the FLAGS.num_threads commensurate with FLAGS.train_shards')
+  assert not FLAGS.validation_shards % FLAGS.num_threads, (
+      'Please make the FLAGS.num_threads commensurate with '
+      'FLAGS.validation_shards')
+  print('Saving results to %s' % FLAGS.output_directory)
+
+  # Run it!
+  _process_dataset('val', FLAGS.dataset_directory, parse_comma_list(FLAGS.validation_splits), FLAGS.validation_shards)
+  _process_dataset('train', FLAGS.dataset_directory, parse_comma_list(FLAGS.train_splits), FLAGS.train_shards)
+
+if __name__ == '__main__':
+  tf.app.run()
diff --git a/utils/external/ssd_tensorflow/dataset/dataset_common.py b/utils/external/ssd_tensorflow/dataset/dataset_common.py
new file mode 100644
index 0000000..046dcca
--- /dev/null
+++ b/utils/external/ssd_tensorflow/dataset/dataset_common.py
@@ -0,0 +1,238 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+slim = tf.contrib.slim
+
+VOC_LABELS = {
+    'none': (0, 'Background'),
+    'aeroplane': (1, 'Vehicle'),
+    'bicycle': (2, 'Vehicle'),
+    'bird': (3, 'Animal'),
+    'boat': (4, 'Vehicle'),
+    'bottle': (5, 'Indoor'),
+    'bus': (6, 'Vehicle'),
+    'car': (7, 'Vehicle'),
+    'cat': (8, 'Animal'),
+    'chair': (9, 'Indoor'),
+    'cow': (10, 'Animal'),
+    'diningtable': (11, 'Indoor'),
+    'dog': (12, 'Animal'),
+    'horse': (13, 'Animal'),
+    'motorbike': (14, 'Vehicle'),
+    'person': (15, 'Person'),
+    'pottedplant': (16, 'Indoor'),
+    'sheep': (17, 'Animal'),
+    'sofa': (18, 'Indoor'),
+    'train': (19, 'Vehicle'),
+    'tvmonitor': (20, 'Indoor'),
+}
+
+COCO_LABELS = {
+    "bench":  (14, 'outdoor') ,
+    "skateboard":  (37, 'sports') ,
+    "toothbrush":  (80, 'indoor') ,
+    "person":  (1, 'person') ,
+    "donut":  (55, 'food') ,
+    "none":  (0, 'background') ,
+    "refrigerator":  (73, 'appliance') ,
+    "horse":  (18, 'animal') ,
+    "elephant":  (21, 'animal') ,
+    "book":  (74, 'indoor') ,
+    "car":  (3, 'vehicle') ,
+    "keyboard":  (67, 'electronic') ,
+    "cow":  (20, 'animal') ,
+    "microwave":  (69, 'appliance') ,
+    "traffic light":  (10, 'outdoor') ,
+    "tie":  (28, 'accessory') ,
+    "dining table":  (61, 'furniture') ,
+    "toaster":  (71, 'appliance') ,
+    "baseball glove":  (36, 'sports') ,
+    "giraffe":  (24, 'animal') ,
+    "cake":  (56, 'food') ,
+    "handbag":  (27, 'accessory') ,
+    "scissors":  (77, 'indoor') ,
+    "bowl":  (46, 'kitchen') ,
+    "couch":  (58, 'furniture') ,
+    "chair":  (57, 'furniture') ,
+    "boat":  (9, 'vehicle') ,
+    "hair drier":  (79, 'indoor') ,
+    "airplane":  (5, 'vehicle') ,
+    "pizza":  (54, 'food') ,
+    "backpack":  (25, 'accessory') ,
+    "kite":  (34, 'sports') ,
+    "sheep":  (19, 'animal') ,
+    "umbrella":  (26, 'accessory') ,
+    "stop sign":  (12, 'outdoor') ,
+    "truck":  (8, 'vehicle') ,
+    "skis":  (31, 'sports') ,
+    "sandwich":  (49, 'food') ,
+    "broccoli":  (51, 'food') ,
+    "wine glass":  (41, 'kitchen') ,
+    "surfboard":  (38, 'sports') ,
+    "sports ball":  (33, 'sports') ,
+    "cell phone":  (68, 'electronic') ,
+    "dog":  (17, 'animal') ,
+    "bed":  (60, 'furniture') ,
+    "toilet":  (62, 'furniture') ,
+    "fire hydrant":  (11, 'outdoor') ,
+    "oven":  (70, 'appliance') ,
+    "zebra":  (23, 'animal') ,
+    "tv":  (63, 'electronic') ,
+    "potted plant":  (59, 'furniture') ,
+    "parking meter":  (13, 'outdoor') ,
+    "spoon":  (45, 'kitchen') ,
+    "bus":  (6, 'vehicle') ,
+    "laptop":  (64, 'electronic') ,
+    "cup":  (42, 'kitchen') ,
+    "bird":  (15, 'animal') ,
+    "sink":  (72, 'appliance') ,
+    "remote":  (66, 'electronic') ,
+    "bicycle":  (2, 'vehicle') ,
+    "tennis racket":  (39, 'sports') ,
+    "baseball bat":  (35, 'sports') ,
+    "cat":  (16, 'animal') ,
+    "fork":  (43, 'kitchen') ,
+    "suitcase":  (29, 'accessory') ,
+    "snowboard":  (32, 'sports') ,
+    "clock":  (75, 'indoor') ,
+    "apple":  (48, 'food') ,
+    "mouse":  (65, 'electronic') ,
+    "bottle":  (40, 'kitchen') ,
+    "frisbee":  (30, 'sports') ,
+    "carrot":  (52, 'food') ,
+    "bear":  (22, 'animal') ,
+    "hot dog":  (53, 'food') ,
+    "teddy bear":  (78, 'indoor') ,
+    "knife":  (44, 'kitchen') ,
+    "train":  (7, 'vehicle') ,
+    "vase":  (76, 'indoor') ,
+    "banana":  (47, 'food') ,
+    "motorcycle":  (4, 'vehicle') ,
+    "orange":  (50, 'food')
+  }
+
+# use dataset_inspect.py to get these summary
+data_splits_num = {
+    'train': 22136,
+    'val': 4952,
+}
+
+def slim_get_batch(num_classes, batch_size, split_name, file_pattern, num_readers, num_preprocessing_threads, image_preprocessing_fn, anchor_encoder, num_epochs=None, is_training=True):
+    """Gets a dataset tuple with instructions for reading Pascal VOC dataset.
+
+    Args:
+      num_classes: total class numbers in dataset.
+      batch_size: the size of each batch.
+      split_name: 'train' of 'val'.
+      file_pattern: The file pattern to use when matching the dataset sources (full path).
+      num_readers: the max number of reader used for reading tfrecords.
+      num_preprocessing_threads: the max number of threads used to run preprocessing function.
+      image_preprocessing_fn: the function used to dataset augumentation.
+      anchor_encoder: the function used to encoder all anchors.
+      num_epochs: total epoches for iterate this dataset.
+      is_training: whether we are in traing phase.
+
+    Returns:
+      A batch of [image, shape, loc_targets, cls_targets, match_scores].
+    """
+    if split_name not in data_splits_num:
+        raise ValueError('split name %s was not recognized.' % split_name)
+
+    # Features in Pascal VOC TFRecords.
+    keys_to_features = {
+        'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
+        'image/format': tf.FixedLenFeature((), tf.string, default_value='jpeg'),
+        'image/filename': tf.FixedLenFeature((), tf.string, default_value=''),
+        'image/height': tf.FixedLenFeature([1], tf.int64),
+        'image/width': tf.FixedLenFeature([1], tf.int64),
+        'image/channels': tf.FixedLenFeature([1], tf.int64),
+        'image/shape': tf.FixedLenFeature([3], tf.int64),
+        'image/object/bbox/xmin': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/ymin': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/xmax': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/ymax': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/label': tf.VarLenFeature(dtype=tf.int64),
+        'image/object/bbox/difficult': tf.VarLenFeature(dtype=tf.int64),
+        'image/object/bbox/truncated': tf.VarLenFeature(dtype=tf.int64),
+    }
+    items_to_handlers = {
+        'image': slim.tfexample_decoder.Image('image/encoded', 'image/format'),
+        'filename': slim.tfexample_decoder.Tensor('image/filename'),
+        'shape': slim.tfexample_decoder.Tensor('image/shape'),
+        'object/bbox': slim.tfexample_decoder.BoundingBox(
+                ['ymin', 'xmin', 'ymax', 'xmax'], 'image/object/bbox/'),
+        'object/label': slim.tfexample_decoder.Tensor('image/object/bbox/label'),
+        'object/difficult': slim.tfexample_decoder.Tensor('image/object/bbox/difficult'),
+        'object/truncated': slim.tfexample_decoder.Tensor('image/object/bbox/truncated'),
+    }
+    decoder = slim.tfexample_decoder.TFExampleDecoder(keys_to_features, items_to_handlers)
+
+    labels_to_names = {}
+    for name, pair in VOC_LABELS.items():
+        labels_to_names[pair[0]] = name
+
+    dataset = slim.dataset.Dataset(
+                data_sources=file_pattern,
+                reader=tf.TFRecordReader,
+                decoder=decoder,
+                num_samples=data_splits_num[split_name],
+                items_to_descriptions=None,
+                num_classes=num_classes,
+                labels_to_names=labels_to_names)
+
+    with tf.name_scope('dataset_data_provider'):
+        provider = slim.dataset_data_provider.DatasetDataProvider(
+            dataset,
+            num_readers=num_readers,
+            common_queue_capacity=32 * batch_size,
+            common_queue_min=8 * batch_size,
+            shuffle=is_training,
+            num_epochs=num_epochs)
+
+    [org_image, filename, shape, glabels_raw, gbboxes_raw, isdifficult] = provider.get(['image', 'filename', 'shape',
+                                                                     'object/label',
+                                                                     'object/bbox',
+                                                                     'object/difficult'])
+
+    if is_training:
+        # if all is difficult, then keep the first one
+        isdifficult_mask =tf.cond(tf.count_nonzero(isdifficult, dtype=tf.int32) < tf.shape(isdifficult)[0],
+                                lambda : isdifficult < tf.ones_like(isdifficult),
+                                lambda : tf.one_hot(0, tf.shape(isdifficult)[0], on_value=True, off_value=False, dtype=tf.bool))
+
+        glabels_raw = tf.boolean_mask(glabels_raw, isdifficult_mask)
+        gbboxes_raw = tf.boolean_mask(gbboxes_raw, isdifficult_mask)
+
+    # Pre-processing image, labels and bboxes.
+
+    if is_training:
+        image, glabels, gbboxes = image_preprocessing_fn(org_image, glabels_raw, gbboxes_raw)
+    else:
+        image = image_preprocessing_fn(org_image, glabels_raw, gbboxes_raw)
+        glabels, gbboxes = glabels_raw, gbboxes_raw
+
+    gt_targets, gt_labels, gt_scores = anchor_encoder(glabels, gbboxes)
+
+    return tf.train.batch([image, filename, shape, gt_targets, gt_labels, gt_scores],
+                    dynamic_pad=False,
+                    batch_size=batch_size,
+                    allow_smaller_final_batch=(not is_training),
+                    num_threads=num_preprocessing_threads,
+                    capacity=64 * batch_size)
diff --git a/utils/external/ssd_tensorflow/dataset/dataset_inspect.py b/utils/external/ssd_tensorflow/dataset/dataset_inspect.py
new file mode 100644
index 0000000..a94e6a6
--- /dev/null
+++ b/utils/external/ssd_tensorflow/dataset/dataset_inspect.py
@@ -0,0 +1,35 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+import tensorflow as tf
+
+def count_split_examples(split_path, file_prefix='.tfrecord'):
+    # Count the total number of examples in all of these shard
+    num_samples = 0
+    tfrecords_to_count = tf.gfile.Glob(os.path.join(split_path, file_prefix))
+    opts = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB)
+    for tfrecord_file in tfrecords_to_count:
+        for record in tf.python_io.tf_record_iterator(tfrecord_file):#, options = opts):
+            num_samples += 1
+    return num_samples
+
+if __name__ == '__main__':
+    print('train:', count_split_examples('/media/rs/7A0EE8880EE83EAF/Detections/SSD/dataset/tfrecords', 'train-?????-of-?????'))
+    print('val:', count_split_examples('/media/rs/7A0EE8880EE83EAF/Detections/SSD/dataset/tfrecords', 'val-?????-of-?????'))
diff --git a/utils/external/ssd_tensorflow/demo/demo1.jpg b/utils/external/ssd_tensorflow/demo/demo1.jpg
new file mode 100644
index 0000000..e0ca8c5
Binary files /dev/null and b/utils/external/ssd_tensorflow/demo/demo1.jpg differ
diff --git a/utils/external/ssd_tensorflow/demo/demo2.jpg b/utils/external/ssd_tensorflow/demo/demo2.jpg
new file mode 100644
index 0000000..568105f
Binary files /dev/null and b/utils/external/ssd_tensorflow/demo/demo2.jpg differ
diff --git a/utils/external/ssd_tensorflow/demo/demo3.jpg b/utils/external/ssd_tensorflow/demo/demo3.jpg
new file mode 100644
index 0000000..d486a47
Binary files /dev/null and b/utils/external/ssd_tensorflow/demo/demo3.jpg differ
diff --git a/utils/external/ssd_tensorflow/eval_ssd.py b/utils/external/ssd_tensorflow/eval_ssd.py
new file mode 100644
index 0000000..1a064b8
--- /dev/null
+++ b/utils/external/ssd_tensorflow/eval_ssd.py
@@ -0,0 +1,457 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+
+import tensorflow as tf
+
+import numpy as np
+
+from net import ssd_net
+
+from dataset import dataset_common
+from preprocessing import ssd_preprocessing
+from utility import anchor_manipulator
+from utility import scaffolds
+
+# hardware related configuration
+tf.app.flags.DEFINE_integer(
+    'num_readers', 8,
+    'The number of parallel readers that read data from the dataset.')
+tf.app.flags.DEFINE_integer(
+    'num_preprocessing_threads', 24,
+    'The number of threads used to create the batches.')
+tf.app.flags.DEFINE_integer(
+    'num_cpu_threads', 0,
+    'The number of cpu cores used to train.')
+tf.app.flags.DEFINE_float(
+    'gpu_memory_fraction', 1., 'GPU memory fraction to use.')
+# scaffold related configuration
+tf.app.flags.DEFINE_string(
+    'data_dir', './dataset/tfrecords',
+    'The directory where the dataset input data is stored.')
+tf.app.flags.DEFINE_integer(
+    'num_classes', 21, 'Number of classes to use in the dataset.')
+tf.app.flags.DEFINE_string(
+    'model_dir', './logs/',
+    'The directory where the model will be stored.')
+tf.app.flags.DEFINE_integer(
+    'log_every_n_steps', 10,
+    'The frequency with which logs are printed.')
+tf.app.flags.DEFINE_integer(
+    'save_summary_steps', 500,
+    'The frequency with which summaries are saved, in seconds.')
+# model related configuration
+tf.app.flags.DEFINE_integer(
+    'train_image_size', 300,
+    'The size of the input image for the model to use.')
+tf.app.flags.DEFINE_integer(
+    'train_epochs', 1,
+    'The number of epochs to use for training.')
+tf.app.flags.DEFINE_integer(
+    'batch_size', 1,
+    'Batch size for training and evaluation.')
+tf.app.flags.DEFINE_string(
+    'data_format', 'channels_last', # 'channels_first' or 'channels_last'
+    'A flag to override the data format used in the model. channels_first '
+    'provides a performance boost on GPU but is not always compatible '
+    'with CPU. If left unspecified, the data format will be chosen '
+    'automatically based on whether TensorFlow was built for CPU or GPU.')
+tf.app.flags.DEFINE_float(
+    'negative_ratio', 3., 'Negative ratio in the loss function.')
+tf.app.flags.DEFINE_float(
+    'match_threshold', 0.5, 'Matching threshold in the loss function.')
+tf.app.flags.DEFINE_float(
+    'neg_threshold', 0.5, 'Matching threshold for the negtive examples in the loss function.')
+tf.app.flags.DEFINE_float(
+    'select_threshold', 0.01, 'Class-specific confidence score threshold for selecting a box.')
+tf.app.flags.DEFINE_float(
+    'min_size', 0.03, 'The min size of bboxes to keep.')
+tf.app.flags.DEFINE_float(
+    'nms_threshold', 0.45, 'Matching threshold in NMS algorithm.')
+tf.app.flags.DEFINE_integer(
+    'nms_topk', 200, 'Number of total object to keep after NMS.')
+tf.app.flags.DEFINE_integer(
+    'keep_topk', 400, 'Number of total object to keep for each image before nms.')
+# optimizer related configuration
+tf.app.flags.DEFINE_float(
+    'weight_decay', 5e-4, 'The weight decay on the model weights.')
+# checkpoint related configuration
+tf.app.flags.DEFINE_string(
+    'checkpoint_path', './model',
+    'The path to a checkpoint from which to fine-tune.')
+tf.app.flags.DEFINE_string(
+    'model_scope', 'ssd300',
+    'Model scope name used to replace the name_scope in checkpoint.')
+
+FLAGS = tf.app.flags.FLAGS
+#CUDA_VISIBLE_DEVICES
+
+def get_checkpoint():
+    if tf.train.latest_checkpoint(FLAGS.model_dir):
+        tf.logging.info('Ignoring --checkpoint_path because a checkpoint already exists in %s' % FLAGS.model_dir)
+        return None
+
+    if tf.gfile.IsDirectory(FLAGS.checkpoint_path):
+        checkpoint_path = tf.train.latest_checkpoint(FLAGS.checkpoint_path)
+    else:
+        checkpoint_path = FLAGS.checkpoint_path
+
+    return checkpoint_path
+
+# couldn't find better way to pass params from input_fn to model_fn
+# some tensors used by model_fn must be created in input_fn to ensure they are in the same graph
+# but when we put these tensors to labels's dict, the replicate_model_fn will split them into each GPU
+# the problem is that they shouldn't be splited
+global_anchor_info = dict()
+
+def input_pipeline(dataset_pattern='train-*', is_training=True, batch_size=FLAGS.batch_size):
+    def input_fn():
+        out_shape = [FLAGS.train_image_size] * 2
+        anchor_creator = anchor_manipulator.AnchorCreator(out_shape,
+                                                    layers_shapes = [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
+                                                    anchor_scales = [(0.1,), (0.2,), (0.375,), (0.55,), (0.725,), (0.9,)],
+                                                    extra_anchor_scales = [(0.1414,), (0.2739,), (0.4541,), (0.6315,), (0.8078,), (0.9836,)],
+                                                    anchor_ratios = [(1., 2., .5), (1., 2., 3., .5, 0.3333), (1., 2., 3., .5, 0.3333), (1., 2., 3., .5, 0.3333), (1., 2., .5), (1., 2., .5)],
+                                                    #anchor_ratios = [(2., .5), (2., 3., .5, 0.3333), (2., 3., .5, 0.3333), (2., 3., .5, 0.3333), (2., .5), (2., .5)],
+                                                    layer_steps = [8, 16, 32, 64, 100, 300])
+        all_anchors, all_num_anchors_depth, all_num_anchors_spatial = anchor_creator.get_all_anchors()
+
+        num_anchors_per_layer = []
+        for ind in range(len(all_anchors)):
+            num_anchors_per_layer.append(all_num_anchors_depth[ind] * all_num_anchors_spatial[ind])
+
+        anchor_encoder_decoder = anchor_manipulator.AnchorEncoder(allowed_borders = [1.0] * 6,
+                                                            positive_threshold = FLAGS.match_threshold,
+                                                            ignore_threshold = FLAGS.neg_threshold,
+                                                            prior_scaling=[0.1, 0.1, 0.2, 0.2])
+
+        image_preprocessing_fn = lambda image_, labels_, bboxes_ : ssd_preprocessing.preprocess_image(image_, labels_, bboxes_, out_shape, is_training=is_training, data_format=FLAGS.data_format, output_rgb=False)
+        anchor_encoder_fn = lambda glabels_, gbboxes_: anchor_encoder_decoder.encode_all_anchors(glabels_, gbboxes_, all_anchors, all_num_anchors_depth, all_num_anchors_spatial)
+
+        image, filename, shape, loc_targets, cls_targets, match_scores = dataset_common.slim_get_batch(FLAGS.num_classes,
+                                                                                batch_size,
+                                                                                ('train' if is_training else 'val'),
+                                                                                os.path.join(FLAGS.data_dir, dataset_pattern),
+                                                                                FLAGS.num_readers,
+                                                                                FLAGS.num_preprocessing_threads,
+                                                                                image_preprocessing_fn,
+                                                                                anchor_encoder_fn,
+                                                                                num_epochs=FLAGS.train_epochs,
+                                                                                is_training=is_training)
+        global global_anchor_info
+        global_anchor_info = {'decode_fn': lambda pred : anchor_encoder_decoder.decode_all_anchors(pred, num_anchors_per_layer),
+                            'num_anchors_per_layer': num_anchors_per_layer,
+                            'all_num_anchors_depth': all_num_anchors_depth }
+
+        return {'image': image, 'filename': filename, 'shape': shape, 'loc_targets': loc_targets, 'cls_targets': cls_targets, 'match_scores': match_scores}, None
+    return input_fn
+
+def modified_smooth_l1(bbox_pred, bbox_targets, bbox_inside_weights=1., bbox_outside_weights=1., sigma=1.):
+    """
+        ResultLoss = outside_weights * SmoothL1(inside_weights * (bbox_pred - bbox_targets))
+        SmoothL1(x) = 0.5 * (sigma * x)^2,    if |x| < 1 / sigma^2
+                      |x| - 0.5 / sigma^2,    otherwise
+    """
+    with tf.name_scope('smooth_l1', [bbox_pred, bbox_targets]):
+        sigma2 = sigma * sigma
+
+        inside_mul = tf.multiply(bbox_inside_weights, tf.subtract(bbox_pred, bbox_targets))
+
+        smooth_l1_sign = tf.cast(tf.less(tf.abs(inside_mul), 1.0 / sigma2), tf.float32)
+        smooth_l1_option1 = tf.multiply(tf.multiply(inside_mul, inside_mul), 0.5 * sigma2)
+        smooth_l1_option2 = tf.subtract(tf.abs(inside_mul), 0.5 / sigma2)
+        smooth_l1_result = tf.add(tf.multiply(smooth_l1_option1, smooth_l1_sign),
+                                  tf.multiply(smooth_l1_option2, tf.abs(tf.subtract(smooth_l1_sign, 1.0))))
+
+        outside_mul = tf.multiply(bbox_outside_weights, smooth_l1_result)
+
+        return outside_mul
+
+def select_bboxes(scores_pred, bboxes_pred, num_classes, select_threshold):
+    selected_bboxes = {}
+    selected_scores = {}
+    with tf.name_scope('select_bboxes', [scores_pred, bboxes_pred]):
+        for class_ind in range(1, num_classes):
+            class_scores = scores_pred[:, class_ind]
+            select_mask = class_scores > select_threshold
+
+            select_mask = tf.cast(select_mask, tf.float32)
+            selected_bboxes[class_ind] = tf.multiply(bboxes_pred, tf.expand_dims(select_mask, axis=-1))
+            selected_scores[class_ind] = tf.multiply(class_scores, select_mask)
+
+    return selected_bboxes, selected_scores
+
+def clip_bboxes(ymin, xmin, ymax, xmax, name):
+    with tf.name_scope(name, 'clip_bboxes', [ymin, xmin, ymax, xmax]):
+        ymin = tf.maximum(ymin, 0.)
+        xmin = tf.maximum(xmin, 0.)
+        ymax = tf.minimum(ymax, 1.)
+        xmax = tf.minimum(xmax, 1.)
+
+        ymin = tf.minimum(ymin, ymax)
+        xmin = tf.minimum(xmin, xmax)
+
+        return ymin, xmin, ymax, xmax
+
+def filter_bboxes(scores_pred, ymin, xmin, ymax, xmax, min_size, name):
+    with tf.name_scope(name, 'filter_bboxes', [scores_pred, ymin, xmin, ymax, xmax]):
+        width = xmax - xmin
+        height = ymax - ymin
+
+        filter_mask = tf.logical_and(width > min_size, height > min_size)
+
+        filter_mask = tf.cast(filter_mask, tf.float32)
+        return tf.multiply(ymin, filter_mask), tf.multiply(xmin, filter_mask), \
+                tf.multiply(ymax, filter_mask), tf.multiply(xmax, filter_mask), tf.multiply(scores_pred, filter_mask)
+
+def sort_bboxes(scores_pred, ymin, xmin, ymax, xmax, keep_topk, name):
+    with tf.name_scope(name, 'sort_bboxes', [scores_pred, ymin, xmin, ymax, xmax]):
+        cur_bboxes = tf.shape(scores_pred)[0]
+        scores, idxes = tf.nn.top_k(scores_pred, k=tf.minimum(keep_topk, cur_bboxes), sorted=True)
+
+        ymin, xmin, ymax, xmax = tf.gather(ymin, idxes), tf.gather(xmin, idxes), tf.gather(ymax, idxes), tf.gather(xmax, idxes)
+
+        paddings_scores = tf.expand_dims(tf.stack([0, tf.maximum(keep_topk-cur_bboxes, 0)], axis=0), axis=0)
+
+        return tf.pad(ymin, paddings_scores, "CONSTANT"), tf.pad(xmin, paddings_scores, "CONSTANT"),\
+                tf.pad(ymax, paddings_scores, "CONSTANT"), tf.pad(xmax, paddings_scores, "CONSTANT"),\
+                tf.pad(scores, paddings_scores, "CONSTANT")
+
+def nms_bboxes(scores_pred, bboxes_pred, nms_topk, nms_threshold, name):
+    with tf.name_scope(name, 'nms_bboxes', [scores_pred, bboxes_pred]):
+        idxes = tf.image.non_max_suppression(bboxes_pred, scores_pred, nms_topk, nms_threshold)
+        return tf.gather(scores_pred, idxes), tf.gather(bboxes_pred, idxes)
+
+def parse_by_class(cls_pred, bboxes_pred, num_classes, select_threshold, min_size, keep_topk, nms_topk, nms_threshold):
+    with tf.name_scope('select_bboxes', [cls_pred, bboxes_pred]):
+        scores_pred = tf.nn.softmax(cls_pred)
+        selected_bboxes, selected_scores = select_bboxes(scores_pred, bboxes_pred, num_classes, select_threshold)
+        for class_ind in range(1, num_classes):
+            ymin, xmin, ymax, xmax = tf.unstack(selected_bboxes[class_ind], 4, axis=-1)
+            #ymin, xmin, ymax, xmax = tf.split(selected_bboxes[class_ind], 4, axis=-1)
+            #ymin, xmin, ymax, xmax = tf.squeeze(ymin), tf.squeeze(xmin), tf.squeeze(ymax), tf.squeeze(xmax)
+            ymin, xmin, ymax, xmax = clip_bboxes(ymin, xmin, ymax, xmax, 'clip_bboxes_{}'.format(class_ind))
+            ymin, xmin, ymax, xmax, selected_scores[class_ind] = filter_bboxes(selected_scores[class_ind],
+                                                ymin, xmin, ymax, xmax, min_size, 'filter_bboxes_{}'.format(class_ind))
+            ymin, xmin, ymax, xmax, selected_scores[class_ind] = sort_bboxes(selected_scores[class_ind],
+                                                ymin, xmin, ymax, xmax, keep_topk, 'sort_bboxes_{}'.format(class_ind))
+            selected_bboxes[class_ind] = tf.stack([ymin, xmin, ymax, xmax], axis=-1)
+            selected_scores[class_ind], selected_bboxes[class_ind] = nms_bboxes(selected_scores[class_ind], selected_bboxes[class_ind], nms_topk, nms_threshold, 'nms_bboxes_{}'.format(class_ind))
+
+        return selected_bboxes, selected_scores
+
+def ssd_model_fn(features, labels, mode, params):
+    """model_fn for SSD to be used with our Estimator."""
+    filename = features['filename']
+    shape = features['shape']
+    loc_targets = features['loc_targets']
+    cls_targets = features['cls_targets']
+    match_scores = features['match_scores']
+    features = features['image']
+
+    global global_anchor_info
+    decode_fn = global_anchor_info['decode_fn']
+    num_anchors_per_layer = global_anchor_info['num_anchors_per_layer']
+    all_num_anchors_depth = global_anchor_info['all_num_anchors_depth']
+
+    with tf.variable_scope(params['model_scope'], default_name=None, values=[features], reuse=tf.AUTO_REUSE):
+        backbone = ssd_net.VGG16Backbone(params['data_format'])
+        feature_layers = backbone.forward(features, training=(mode == tf.estimator.ModeKeys.TRAIN))
+        #print(feature_layers)
+        location_pred, cls_pred = ssd_net.multibox_head(feature_layers, params['num_classes'], all_num_anchors_depth, data_format=params['data_format'])
+        if params['data_format'] == 'channels_first':
+            cls_pred = [tf.transpose(pred, [0, 2, 3, 1]) for pred in cls_pred]
+            location_pred = [tf.transpose(pred, [0, 2, 3, 1]) for pred in location_pred]
+
+        cls_pred = [tf.reshape(pred, [tf.shape(features)[0], -1, params['num_classes']]) for pred in cls_pred]
+        location_pred = [tf.reshape(pred, [tf.shape(features)[0], -1, 4]) for pred in location_pred]
+
+        cls_pred = tf.concat(cls_pred, axis=1)
+        location_pred = tf.concat(location_pred, axis=1)
+
+        cls_pred = tf.reshape(cls_pred, [-1, params['num_classes']])
+        location_pred = tf.reshape(location_pred, [-1, 4])
+
+    with tf.device('/cpu:0'):
+        bboxes_pred = decode_fn(location_pred)
+        bboxes_pred = tf.concat(bboxes_pred, axis=0)
+        selected_bboxes, selected_scores = parse_by_class(cls_pred, bboxes_pred,
+                                                        params['num_classes'], params['select_threshold'], params['min_size'],
+                                                        params['keep_topk'], params['nms_topk'], params['nms_threshold'])
+
+    predictions = {'filename': filename, 'shape': shape }
+    for class_ind in range(1, params['num_classes']):
+        predictions['scores_{}'.format(class_ind)] = tf.expand_dims(selected_scores[class_ind], axis=0)
+        predictions['bboxes_{}'.format(class_ind)] = tf.expand_dims(selected_bboxes[class_ind], axis=0)
+
+    flaten_cls_targets = tf.reshape(cls_targets, [-1])
+    flaten_match_scores = tf.reshape(match_scores, [-1])
+    flaten_loc_targets = tf.reshape(loc_targets, [-1, 4])
+
+    # each positive examples has one label
+    positive_mask = flaten_cls_targets > 0
+    n_positives = tf.count_nonzero(positive_mask)
+
+    batch_n_positives = tf.count_nonzero(cls_targets, -1)
+
+    batch_negtive_mask = tf.equal(cls_targets, 0)#tf.logical_and(tf.equal(cls_targets, 0), match_scores > 0.)
+    batch_n_negtives = tf.count_nonzero(batch_negtive_mask, -1)
+
+    batch_n_neg_select = tf.cast(params['negative_ratio'] * tf.cast(batch_n_positives, tf.float32), tf.int32)
+    batch_n_neg_select = tf.minimum(batch_n_neg_select, tf.cast(batch_n_negtives, tf.int32))
+
+    # hard negative mining for classification
+    predictions_for_bg = tf.nn.softmax(tf.reshape(cls_pred, [tf.shape(features)[0], -1, params['num_classes']]))[:, :, 0]
+    prob_for_negtives = tf.where(batch_negtive_mask,
+                           0. - predictions_for_bg,
+                           # ignore all the positives
+                           0. - tf.ones_like(predictions_for_bg))
+    topk_prob_for_bg, _ = tf.nn.top_k(prob_for_negtives, k=tf.shape(prob_for_negtives)[1])
+    score_at_k = tf.gather_nd(topk_prob_for_bg, tf.stack([tf.range(tf.shape(features)[0]), batch_n_neg_select - 1], axis=-1))
+
+    selected_neg_mask = prob_for_negtives >= tf.expand_dims(score_at_k, axis=-1)
+
+    # include both selected negtive and all positive examples
+    final_mask = tf.stop_gradient(tf.logical_or(tf.reshape(tf.logical_and(batch_negtive_mask, selected_neg_mask), [-1]), positive_mask))
+    total_examples = tf.count_nonzero(final_mask)
+
+    cls_pred = tf.boolean_mask(cls_pred, final_mask)
+    location_pred = tf.boolean_mask(location_pred, tf.stop_gradient(positive_mask))
+    flaten_cls_targets = tf.boolean_mask(tf.clip_by_value(flaten_cls_targets, 0, params['num_classes']), final_mask)
+    flaten_loc_targets = tf.stop_gradient(tf.boolean_mask(flaten_loc_targets, positive_mask))
+
+    # Calculate loss, which includes softmax cross entropy and L2 regularization.
+    #cross_entropy = (params['negative_ratio'] + 1.) * tf.cond(n_positives > 0, lambda: tf.losses.sparse_softmax_cross_entropy(labels=glabels, logits=cls_pred), lambda: 0.)
+    cross_entropy = tf.losses.sparse_softmax_cross_entropy(labels=flaten_cls_targets, logits=cls_pred) * (params['negative_ratio'] + 1.)
+    # Create a tensor named cross_entropy for logging purposes.
+    tf.identity(cross_entropy, name='cross_entropy_loss')
+    tf.summary.scalar('cross_entropy_loss', cross_entropy)
+
+    #loc_loss = tf.cond(n_positives > 0, lambda: modified_smooth_l1(location_pred, tf.stop_gradient(flaten_loc_targets), sigma=1.), lambda: tf.zeros_like(location_pred))
+    loc_loss = modified_smooth_l1(location_pred, flaten_loc_targets, sigma=1.)
+    loc_loss = tf.reduce_mean(tf.reduce_sum(loc_loss, axis=-1), name='location_loss')
+    tf.summary.scalar('location_loss', loc_loss)
+    tf.losses.add_loss(loc_loss)
+
+    # Add weight decay to the loss. We exclude the batch norm variables because
+    # doing so leads to a small improvement in accuracy.
+    total_loss = tf.add(cross_entropy, loc_loss, name='total_loss')
+
+    cls_accuracy = tf.metrics.accuracy(flaten_cls_targets, tf.argmax(cls_pred, axis=-1))
+
+    # Create a tensor named train_accuracy for logging purposes.
+    tf.identity(cls_accuracy[1], name='cls_accuracy')
+    tf.summary.scalar('cls_accuracy', cls_accuracy[1])
+
+    summary_hook = tf.train.SummarySaverHook(save_steps=params['save_summary_steps'],
+                                        output_dir=params['summary_dir'],
+                                        summary_op=tf.summary.merge_all())
+    if mode == tf.estimator.ModeKeys.PREDICT:
+        return tf.estimator.EstimatorSpec(
+              mode=mode,
+              predictions=predictions,
+              prediction_hooks=[summary_hook],
+              loss=None, train_op=None)
+    else:
+        raise ValueError('This script only support "PREDICT" mode!')
+
+def parse_comma_list(args):
+    return [float(s.strip()) for s in args.split(',')]
+
+def main(_):
+    # Using the Winograd non-fused algorithms provides a small performance boost.
+    os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = '1'
+
+    gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=FLAGS.gpu_memory_fraction)
+    config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False, intra_op_parallelism_threads=FLAGS.num_cpu_threads, inter_op_parallelism_threads=FLAGS.num_cpu_threads, gpu_options=gpu_options)
+
+    # Set up a RunConfig to only save checkpoints once per training cycle.
+    run_config = tf.estimator.RunConfig().replace(
+                                        save_checkpoints_secs=None).replace(
+                                        save_checkpoints_steps=None).replace(
+                                        save_summary_steps=FLAGS.save_summary_steps).replace(
+                                        keep_checkpoint_max=5).replace(
+                                        log_step_count_steps=FLAGS.log_every_n_steps).replace(
+                                        session_config=config)
+
+    summary_dir = os.path.join(FLAGS.model_dir, 'predict')
+
+    ssd_detector = tf.estimator.Estimator(
+        model_fn=ssd_model_fn, model_dir=FLAGS.model_dir, config=run_config,
+        params={
+            'select_threshold': FLAGS.select_threshold,
+            'min_size': FLAGS.min_size,
+            'nms_threshold': FLAGS.nms_threshold,
+            'nms_topk': FLAGS.nms_topk,
+            'keep_topk': FLAGS.keep_topk,
+            'data_format': FLAGS.data_format,
+            'batch_size': FLAGS.batch_size,
+            'model_scope': FLAGS.model_scope,
+            'save_summary_steps': FLAGS.save_summary_steps,
+            'summary_dir': summary_dir,
+            'num_classes': FLAGS.num_classes,
+            'negative_ratio': FLAGS.negative_ratio,
+            'match_threshold': FLAGS.match_threshold,
+            'neg_threshold': FLAGS.neg_threshold,
+            'weight_decay': FLAGS.weight_decay,
+        })
+    tensors_to_log = {
+        'ce': 'cross_entropy_loss',
+        'loc': 'location_loss',
+        'loss': 'total_loss',
+        'acc': 'cls_accuracy',
+    }
+    logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=FLAGS.log_every_n_steps,
+                                            formatter=lambda dicts: (', '.join(['%s=%.6f' % (k, v) for k, v in dicts.items()])))
+
+    print('Starting a predict cycle.')
+    pred_results = ssd_detector.predict(input_fn=input_pipeline(dataset_pattern='val-*', is_training=False, batch_size=FLAGS.batch_size),
+                                    hooks=[logging_hook], checkpoint_path=get_checkpoint())#, yield_single_examples=False)
+
+    det_results = list(pred_results)
+    #print(list(det_results))
+
+    #[{'bboxes_1': array([[0.        , 0.        , 0.28459054, 0.5679505 ], [0.3158835 , 0.34792888, 0.7312541 , 1.        ]], dtype=float32), 'scores_17': array([0.01333667, 0.01152573], dtype=float32), 'filename': b'000703.jpg', 'shape': array([334, 500,   3])}]
+    for class_ind in range(1, FLAGS.num_classes):
+        with open(os.path.join(summary_dir, 'results_{}.txt'.format(class_ind)), 'wt') as f:
+            for image_ind, pred in enumerate(det_results):
+                filename = pred['filename']
+                shape = pred['shape']
+                scores = pred['scores_{}'.format(class_ind)]
+                bboxes = pred['bboxes_{}'.format(class_ind)]
+                bboxes[:, 0] = (bboxes[:, 0] * shape[0]).astype(np.int32, copy=False) + 1
+                bboxes[:, 1] = (bboxes[:, 1] * shape[1]).astype(np.int32, copy=False) + 1
+                bboxes[:, 2] = (bboxes[:, 2] * shape[0]).astype(np.int32, copy=False) + 1
+                bboxes[:, 3] = (bboxes[:, 3] * shape[1]).astype(np.int32, copy=False) + 1
+
+                valid_mask = np.logical_and((bboxes[:, 2] - bboxes[:, 0] > 0), (bboxes[:, 3] - bboxes[:, 1] > 0))
+
+                for det_ind in range(valid_mask.shape[0]):
+                    if not valid_mask[det_ind]:
+                        continue
+                    f.write('{:s} {:.3f} {:.1f} {:.1f} {:.1f} {:.1f}\n'.
+                                format(filename.decode('utf8')[:-4], scores[det_ind],
+                                       bboxes[det_ind, 1], bboxes[det_ind, 0],
+                                       bboxes[det_ind, 3], bboxes[det_ind, 2]))
+
+
+if __name__ == '__main__':
+  tf.logging.set_verbosity(tf.logging.INFO)
+  tf.app.run()
diff --git a/utils/external/ssd_tensorflow/net/ssd_net.py b/utils/external/ssd_tensorflow/net/ssd_net.py
new file mode 100644
index 0000000..e3ea283
--- /dev/null
+++ b/utils/external/ssd_tensorflow/net/ssd_net.py
@@ -0,0 +1,255 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+
+_BATCH_NORM_DECAY = 0.9
+_BATCH_NORM_EPSILON = 1e-5
+_USE_FUSED_BN = True
+
+# vgg_16/conv2/conv2_1/biases
+# vgg_16/conv4/conv4_3/biases
+# vgg_16/conv1/conv1_1/biases
+# vgg_16/fc6/weights
+# vgg_16/conv3/conv3_2/biases
+# vgg_16/conv5/conv5_3/biases
+# vgg_16/conv3/conv3_1/weights
+# vgg_16/conv4/conv4_2/weights
+# vgg_16/conv1/conv1_1/weights
+# vgg_16/conv5/conv5_3/weights
+# vgg_16/conv4/conv4_1/weights
+# vgg_16/conv3/conv3_3/weights
+# vgg_16/conv5/conv5_2/biases
+# vgg_16/conv3/conv3_2/weights
+# vgg_16/conv4/conv4_2/biases
+# vgg_16/conv5/conv5_2/weights
+# vgg_16/conv3/conv3_1/biases
+# vgg_16/conv2/conv2_2/weights
+# vgg_16/fc7/weights
+# vgg_16/conv5/conv5_1/biases
+# vgg_16/conv1/conv1_2/biases
+# vgg_16/conv2/conv2_2/biases
+# vgg_16/conv4/conv4_1/biases
+# vgg_16/fc7/biases
+# vgg_16/fc6/biases
+# vgg_16/conv4/conv4_3/weights
+# vgg_16/conv2/conv2_1/weights
+# vgg_16/conv5/conv5_1/weights
+# vgg_16/conv3/conv3_3/biases
+# vgg_16/conv1/conv1_2/weights
+
+class ReLuLayer(tf.layers.Layer):
+    def __init__(self, name, **kwargs):
+        super(ReLuLayer, self).__init__(name=name, trainable=trainable, **kwargs)
+        self._name = name
+    def build(self, input_shape):
+        self._relu = lambda x : tf.nn.relu(x, name=self._name)
+        self.built = True
+
+    def call(self, inputs):
+        return self._relu(inputs)
+
+    def compute_output_shape(self, input_shape):
+        return tf.TensorShape(input_shape)
+
+def forward_module(m, inputs, training=False):
+    if isinstance(m, tf.layers.BatchNormalization) or isinstance(m, tf.layers.Dropout):
+        return m.apply(inputs, training=training)
+    return m.apply(inputs)
+
+class VGG16Backbone(object):
+    def __init__(self, data_format='channels_first'):
+        super(VGG16Backbone, self).__init__()
+        self._data_format = data_format
+        self._bn_axis = -1 if data_format == 'channels_last' else 1
+        #initializer = tf.glorot_uniform_initializer  glorot_normal_initializer
+        self._conv_initializer = tf.glorot_uniform_initializer
+        self._conv_bn_initializer = tf.glorot_uniform_initializer#lambda : tf.truncated_normal_initializer(mean=0.0, stddev=0.005)
+        # VGG layers
+        self._conv1_block = self.conv_block(2, 64, 3, (1, 1), 'conv1')
+        self._pool1 = tf.layers.MaxPooling2D(2, 2, padding='same', data_format=self._data_format, name='pool1')
+        self._conv2_block = self.conv_block(2, 128, 3, (1, 1), 'conv2')
+        self._pool2 = tf.layers.MaxPooling2D(2, 2, padding='same', data_format=self._data_format, name='pool2')
+        self._conv3_block = self.conv_block(3, 256, 3, (1, 1), 'conv3')
+        self._pool3 = tf.layers.MaxPooling2D(2, 2, padding='same', data_format=self._data_format, name='pool3')
+        self._conv4_block = self.conv_block(3, 512, 3, (1, 1), 'conv4')
+        self._pool4 = tf.layers.MaxPooling2D(2, 2, padding='same', data_format=self._data_format, name='pool4')
+        self._conv5_block = self.conv_block(3, 512, 3, (1, 1), 'conv5')
+        self._pool5 = tf.layers.MaxPooling2D(3, 1, padding='same', data_format=self._data_format, name='pool5')
+        self._conv6 = tf.layers.Conv2D(filters=1024, kernel_size=3, strides=1, padding='same', dilation_rate=6,
+                            data_format=self._data_format, activation=tf.nn.relu, use_bias=True,
+                            kernel_initializer=self._conv_initializer(),
+                            bias_initializer=tf.zeros_initializer(),
+                            name='fc6', _scope='fc6', _reuse=None)
+        self._conv7 = tf.layers.Conv2D(filters=1024, kernel_size=1, strides=1, padding='same',
+                            data_format=self._data_format, activation=tf.nn.relu, use_bias=True,
+                            kernel_initializer=self._conv_initializer(),
+                            bias_initializer=tf.zeros_initializer(),
+                            name='fc7', _scope='fc7', _reuse=None)
+        # SSD layers
+        with tf.variable_scope('additional_layers') as scope:
+            self._conv8_block = self.ssd_conv_block(256, 2, 'conv8')
+            self._conv9_block = self.ssd_conv_block(128, 2, 'conv9')
+            self._conv10_block = self.ssd_conv_block(128, 1, 'conv10', padding='valid')
+            self._conv11_block = self.ssd_conv_block(128, 1, 'conv11', padding='valid')
+
+    def l2_normalize(self, x, name):
+        with tf.name_scope(name, "l2_normalize", [x]) as name:
+            axis = -1 if self._data_format == 'channels_last' else 1
+            square_sum = tf.reduce_sum(tf.square(x), axis, keep_dims=True)
+            x_inv_norm = tf.rsqrt(tf.maximum(square_sum, 1e-10))
+            return tf.multiply(x, x_inv_norm, name=name)
+
+    def forward(self, inputs, training=False):
+        # inputs should in BGR
+        feature_layers = []
+        # forward vgg layers
+        for conv in self._conv1_block:
+            inputs = forward_module(conv, inputs, training=training)
+        inputs = self._pool1.apply(inputs)
+        for conv in self._conv2_block:
+            inputs = forward_module(conv, inputs, training=training)
+        inputs = self._pool2.apply(inputs)
+        for conv in self._conv3_block:
+            inputs = forward_module(conv, inputs, training=training)
+        inputs = self._pool3.apply(inputs)
+        for conv in self._conv4_block:
+            inputs = forward_module(conv, inputs, training=training)
+        # conv4_3
+        with tf.variable_scope('conv4_3_scale') as scope:
+            weight_scale = tf.Variable([20.] * 512, trainable=training, name='weights')
+            if self._data_format == 'channels_last':
+                weight_scale = tf.reshape(weight_scale, [1, 1, 1, -1], name='reshape')
+            else:
+                weight_scale = tf.reshape(weight_scale, [1, -1, 1, 1], name='reshape')
+
+            feature_layers.append(tf.multiply(weight_scale, self.l2_normalize(inputs, name='norm'), name='rescale')
+                                )
+        inputs = self._pool4.apply(inputs)
+        for conv in self._conv5_block:
+            inputs = forward_module(conv, inputs, training=training)
+        inputs = self._pool5.apply(inputs)
+        # forward fc layers
+        inputs = self._conv6.apply(inputs)
+        inputs = self._conv7.apply(inputs)
+        # fc7
+        feature_layers.append(inputs)
+        # forward ssd layers
+        for layer in self._conv8_block:
+            inputs = forward_module(layer, inputs, training=training)
+        # conv8
+        feature_layers.append(inputs)
+        for layer in self._conv9_block:
+            inputs = forward_module(layer, inputs, training=training)
+        # conv9
+        feature_layers.append(inputs)
+        for layer in self._conv10_block:
+            inputs = forward_module(layer, inputs, training=training)
+        # conv10
+        feature_layers.append(inputs)
+        for layer in self._conv11_block:
+            inputs = forward_module(layer, inputs, training=training)
+        # conv11
+        feature_layers.append(inputs)
+
+        return feature_layers
+
+    def conv_block(self, num_blocks, filters, kernel_size, strides, name, reuse=None):
+        with tf.variable_scope(name):
+            conv_blocks = []
+            for ind in range(1, num_blocks + 1):
+                conv_blocks.append(
+                        tf.layers.Conv2D(filters=filters, kernel_size=kernel_size, strides=strides, padding='same',
+                            data_format=self._data_format, activation=tf.nn.relu, use_bias=True,
+                            kernel_initializer=self._conv_initializer(),
+                            bias_initializer=tf.zeros_initializer(),
+                            name='{}_{}'.format(name, ind), _scope='{}_{}'.format(name, ind), _reuse=None)
+                    )
+            return conv_blocks
+
+    def ssd_conv_block(self, filters, strides, name, padding='same', reuse=None):
+        with tf.variable_scope(name):
+            conv_blocks = []
+            conv_blocks.append(
+                    tf.layers.Conv2D(filters=filters, kernel_size=1, strides=1, padding=padding,
+                        data_format=self._data_format, activation=tf.nn.relu, use_bias=True,
+                        kernel_initializer=self._conv_initializer(),
+                        bias_initializer=tf.zeros_initializer(),
+                        name='{}_1'.format(name), _scope='{}_1'.format(name), _reuse=None)
+                )
+            conv_blocks.append(
+                    tf.layers.Conv2D(filters=filters * 2, kernel_size=3, strides=strides, padding=padding,
+                        data_format=self._data_format, activation=tf.nn.relu, use_bias=True,
+                        kernel_initializer=self._conv_initializer(),
+                        bias_initializer=tf.zeros_initializer(),
+                        name='{}_2'.format(name), _scope='{}_2'.format(name), _reuse=None)
+                )
+            return conv_blocks
+
+    def ssd_conv_bn_block(self, filters, strides, name, reuse=None):
+        with tf.variable_scope(name):
+            conv_bn_blocks = []
+            conv_bn_blocks.append(
+                    tf.layers.Conv2D(filters=filters, kernel_size=1, strides=1, padding='same',
+                        data_format=self._data_format, activation=None, use_bias=False,
+                        kernel_initializer=self._conv_bn_initializer(),
+                        bias_initializer=None,
+                        name='{}_1'.format(name), _scope='{}_1'.format(name), _reuse=None)
+                )
+            conv_bn_blocks.append(
+                    tf.layers.BatchNormalization(axis=self._bn_axis, momentum=BN_MOMENTUM, epsilon=BN_EPSILON, fused=USE_FUSED_BN,
+                        name='{}_bn1'.format(name), _scope='{}_bn1'.format(name), _reuse=None)
+                )
+            conv_bn_blocks.append(
+                    ReLuLayer('{}_relu1'.format(name), _scope='{}_relu1'.format(name), _reuse=None)
+                )
+            conv_bn_blocks.append(
+                    tf.layers.Conv2D(filters=filters * 2, kernel_size=3, strides=strides, padding='same',
+                        data_format=self._data_format, activation=None, use_bias=False,
+                        kernel_initializer=self._conv_bn_initializer(),
+                        bias_initializer=None,
+                        name='{}_2'.format(name), _scope='{}_2'.format(name), _reuse=None)
+                )
+            conv_bn_blocks.append(
+                    tf.layers.BatchNormalization(axis=self._bn_axis, momentum=BN_MOMENTUM, epsilon=BN_EPSILON, fused=USE_FUSED_BN,
+                        name='{}_bn2'.format(name), _scope='{}_bn2'.format(name), _reuse=None)
+                )
+            conv_bn_blocks.append(
+                    ReLuLayer('{}_relu2'.format(name), _scope='{}_relu2'.format(name), _reuse=None)
+                )
+            return conv_bn_blocks
+
+def multibox_head(feature_layers, num_classes, num_anchors_depth_per_layer, data_format='channels_first'):
+    with tf.variable_scope('multibox_head'):
+        cls_preds = []
+        loc_preds = []
+        for ind, feat in enumerate(feature_layers):
+            loc_preds.append(tf.layers.conv2d(feat, num_anchors_depth_per_layer[ind] * 4, (3, 3), use_bias=True,
+                        name='loc_{}'.format(ind), strides=(1, 1),
+                        padding='same', data_format=data_format, activation=None,
+                        kernel_initializer=tf.glorot_uniform_initializer(),
+                        bias_initializer=tf.zeros_initializer()))
+            cls_preds.append(tf.layers.conv2d(feat, num_anchors_depth_per_layer[ind] * num_classes, (3, 3), use_bias=True,
+                        name='cls_{}'.format(ind), strides=(1, 1),
+                        padding='same', data_format=data_format, activation=None,
+                        kernel_initializer=tf.glorot_uniform_initializer(),
+                        bias_initializer=tf.zeros_initializer()))
+
+        return loc_preds, cls_preds
+
+
diff --git a/utils/external/ssd_tensorflow/preprocessing/preprocessing_unittest.py b/utils/external/ssd_tensorflow/preprocessing/preprocessing_unittest.py
new file mode 100644
index 0000000..92e4167
--- /dev/null
+++ b/utils/external/ssd_tensorflow/preprocessing/preprocessing_unittest.py
@@ -0,0 +1,131 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+import tensorflow as tf
+from scipy.misc import imread, imsave, imshow, imresize
+import numpy as np
+import sys; sys.path.insert(0, ".")
+from utility import draw_toolbox
+import ssd_preprocessing
+
+slim = tf.contrib.slim
+
+def save_image_with_bbox(image, labels_, scores_, bboxes_):
+    if not hasattr(save_image_with_bbox, "counter"):
+        save_image_with_bbox.counter = 0  # it doesn't exist yet, so initialize it
+    save_image_with_bbox.counter += 1
+
+    img_to_draw = np.copy(image)
+
+    img_to_draw = draw_toolbox.bboxes_draw_on_img(img_to_draw, labels_, scores_, bboxes_, thickness=2)
+    imsave(os.path.join('./debug/{}.jpg').format(save_image_with_bbox.counter), img_to_draw)
+    return save_image_with_bbox.counter
+
+def slim_get_split(file_pattern='{}_????'):
+    # Features in Pascal VOC TFRecords.
+    keys_to_features = {
+        'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
+        'image/format': tf.FixedLenFeature((), tf.string, default_value='jpeg'),
+        'image/filename': tf.FixedLenFeature((), tf.string, default_value=''),
+        'image/height': tf.FixedLenFeature([1], tf.int64),
+        'image/width': tf.FixedLenFeature([1], tf.int64),
+        'image/channels': tf.FixedLenFeature([1], tf.int64),
+        'image/shape': tf.FixedLenFeature([3], tf.int64),
+        'image/object/bbox/xmin': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/ymin': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/xmax': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/ymax': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/label': tf.VarLenFeature(dtype=tf.int64),
+        'image/object/bbox/difficult': tf.VarLenFeature(dtype=tf.int64),
+        'image/object/bbox/truncated': tf.VarLenFeature(dtype=tf.int64),
+    }
+    items_to_handlers = {
+        'image': slim.tfexample_decoder.Image('image/encoded', 'image/format'),
+        'filename': slim.tfexample_decoder.Tensor('image/filename'),
+        'shape': slim.tfexample_decoder.Tensor('image/shape'),
+        'object/bbox': slim.tfexample_decoder.BoundingBox(
+                ['ymin', 'xmin', 'ymax', 'xmax'], 'image/object/bbox/'),
+        'object/label': slim.tfexample_decoder.Tensor('image/object/bbox/label'),
+        'object/difficult': slim.tfexample_decoder.Tensor('image/object/bbox/difficult'),
+        'object/truncated': slim.tfexample_decoder.Tensor('image/object/bbox/truncated'),
+    }
+    decoder = slim.tfexample_decoder.TFExampleDecoder(keys_to_features, items_to_handlers)
+
+    dataset = slim.dataset.Dataset(
+                data_sources=file_pattern,
+                reader=tf.TFRecordReader,
+                decoder=decoder,
+                num_samples=100,
+                items_to_descriptions=None,
+                num_classes=21,
+                labels_to_names=None)
+
+    with tf.name_scope('dataset_data_provider'):
+        provider = slim.dataset_data_provider.DatasetDataProvider(
+                    dataset,
+                    num_readers=2,
+                    common_queue_capacity=32,
+                    common_queue_min=8,
+                    shuffle=True,
+                    num_epochs=1)
+
+    [org_image, filename, shape, glabels_raw, gbboxes_raw, isdifficult] = provider.get(['image', 'filename', 'shape',
+                                                                         'object/label',
+                                                                         'object/bbox',
+                                                                         'object/difficult'])
+    image, glabels, gbboxes = ssd_preprocessing.preprocess_image(org_image, glabels_raw, gbboxes_raw, [300, 300], is_training=True, data_format='channels_first', output_rgb=True)
+
+    image = tf.transpose(image, perm=(1, 2, 0))
+    save_image_op = tf.py_func(save_image_with_bbox,
+                            [ssd_preprocessing.unwhiten_image(image),
+                            tf.clip_by_value(glabels, 0, tf.int64.max),
+                            tf.ones_like(glabels),
+                            gbboxes],
+                            tf.int64, stateful=True)
+    return save_image_op
+
+if __name__ == '__main__':
+    save_image_op = slim_get_split('/media/rs/7A0EE8880EE83EAF/Detections/SSD/dataset/tfrecords/*')
+    # Create the graph, etc.
+    init_op = tf.group([tf.local_variables_initializer(), tf.local_variables_initializer(), tf.tables_initializer()])
+
+    # Create a session for running operations in the Graph.
+    sess = tf.Session()
+    # Initialize the variables (like the epoch counter).
+    sess.run(init_op)
+
+    # Start input enqueue threads.
+    coord = tf.train.Coordinator()
+    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
+
+    try:
+        while not coord.should_stop():
+            # Run training steps or whatever
+            print(sess.run(save_image_op))
+
+    except tf.errors.OutOfRangeError:
+        print('Done training -- epoch limit reached')
+    finally:
+        # When done, ask the threads to stop.
+        coord.request_stop()
+
+    # Wait for threads to finish.
+    coord.join(threads)
+    sess.close()
diff --git a/utils/external/ssd_tensorflow/preprocessing/ssd_preprocessing.py b/utils/external/ssd_tensorflow/preprocessing/ssd_preprocessing.py
new file mode 100644
index 0000000..3ab8dcc
--- /dev/null
+++ b/utils/external/ssd_tensorflow/preprocessing/ssd_preprocessing.py
@@ -0,0 +1,521 @@
+# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Provides utilities to preprocess images.
+
+The preprocessing steps for VGG were introduced in the following technical
+report:
+
+  Very Deep Convolutional Networks For Large-Scale Image Recognition
+  Karen Simonyan and Andrew Zisserman
+  arXiv technical report, 2015
+  PDF: http://arxiv.org/pdf/1409.1556.pdf
+  ILSVRC 2014 Slides: http://www.robots.ox.ac.uk/~karen/pdf/ILSVRC_2014.pdf
+  CC-BY-4.0
+
+More information can be obtained from the VGG website:
+www.robots.ox.ac.uk/~vgg/research/very_deep/
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+from tensorflow.python.ops import control_flow_ops
+
+slim = tf.contrib.slim
+
+_R_MEAN = 123.68
+_G_MEAN = 116.78
+_B_MEAN = 103.94
+
+def _ImageDimensions(image, rank = 3):
+  """Returns the dimensions of an image tensor.
+
+  Args:
+    image: A rank-D Tensor. For 3-D  of shape: `[height, width, channels]`.
+    rank: The expected rank of the image
+
+  Returns:
+    A list of corresponding to the dimensions of the
+    input image.  Dimensions that are statically known are python integers,
+    otherwise they are integer scalar tensors.
+  """
+  if image.get_shape().is_fully_defined():
+    return image.get_shape().as_list()
+  else:
+    static_shape = image.get_shape().with_rank(rank).as_list()
+    dynamic_shape = tf.unstack(tf.shape(image), rank)
+    return [s if s is not None else d
+            for s, d in zip(static_shape, dynamic_shape)]
+
+def apply_with_random_selector(x, func, num_cases):
+  """Computes func(x, sel), with sel sampled from [0...num_cases-1].
+
+  Args:
+    x: input Tensor.
+    func: Python function to apply.
+    num_cases: Python int32, number of cases to sample sel from.
+
+  Returns:
+    The result of func(x, sel), where func receives the value of the
+    selector as a python integer, but sel is sampled dynamically.
+  """
+  sel = tf.random_uniform([], maxval=num_cases, dtype=tf.int32)
+  # Pass the real x only to one of the func calls.
+  return control_flow_ops.merge([
+      func(control_flow_ops.switch(x, tf.equal(sel, case))[1], case)
+      for case in range(num_cases)])[0]
+
+
+def distort_color(image, color_ordering=0, fast_mode=True, scope=None):
+  """Distort the color of a Tensor image.
+
+  Each color distortion is non-commutative and thus ordering of the color ops
+  matters. Ideally we would randomly permute the ordering of the color ops.
+  Rather then adding that level of complication, we select a distinct ordering
+  of color ops for each preprocessing thread.
+
+  Args:
+    image: 3-D Tensor containing single image in [0, 1].
+    color_ordering: Python int, a type of distortion (valid values: 0-3).
+    fast_mode: Avoids slower ops (random_hue and random_contrast)
+    scope: Optional scope for name_scope.
+  Returns:
+    3-D Tensor color-distorted image on range [0, 1]
+  Raises:
+    ValueError: if color_ordering not in [0, 3]
+  """
+  with tf.name_scope(scope, 'distort_color', [image]):
+    if fast_mode:
+      if color_ordering == 0:
+        image = tf.image.random_brightness(image, max_delta=32. / 255.)
+        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
+      else:
+        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
+        image = tf.image.random_brightness(image, max_delta=32. / 255.)
+    else:
+      if color_ordering == 0:
+        image = tf.image.random_brightness(image, max_delta=32. / 255.)
+        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
+        image = tf.image.random_hue(image, max_delta=0.2)
+        image = tf.image.random_contrast(image, lower=0.5, upper=1.5)
+      elif color_ordering == 1:
+        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
+        image = tf.image.random_brightness(image, max_delta=32. / 255.)
+        image = tf.image.random_contrast(image, lower=0.5, upper=1.5)
+        image = tf.image.random_hue(image, max_delta=0.2)
+      elif color_ordering == 2:
+        image = tf.image.random_contrast(image, lower=0.5, upper=1.5)
+        image = tf.image.random_hue(image, max_delta=0.2)
+        image = tf.image.random_brightness(image, max_delta=32. / 255.)
+        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
+      elif color_ordering == 3:
+        image = tf.image.random_hue(image, max_delta=0.2)
+        image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
+        image = tf.image.random_contrast(image, lower=0.5, upper=1.5)
+        image = tf.image.random_brightness(image, max_delta=32. / 255.)
+      else:
+        raise ValueError('color_ordering must be in [0, 3]')
+
+    # The random_* ops do not necessarily clamp.
+    return tf.clip_by_value(image, 0.0, 1.0)
+
+def ssd_random_sample_patch(image, labels, bboxes, ratio_list=[0.1, 0.3, 0.5, 0.7, 0.9, 1.], name=None):
+  '''ssd_random_sample_patch.
+  select one min_iou
+  sample _width and _height from [0-width] and [0-height]
+  check if the aspect ratio between 0.5-2.
+  select left_top point from (width - _width, height - _height)
+  check if this bbox has a min_iou with all ground_truth bboxes
+  keep ground_truth those center is in this sampled patch, if none then try again
+  '''
+  def sample_width_height(width, height):
+    with tf.name_scope('sample_width_height'):
+      index = 0
+      max_attempt = 10
+      sampled_width, sampled_height = width, height
+
+      def condition(index, sampled_width, sampled_height, width, height):
+        return tf.logical_or(tf.logical_and(tf.logical_or(tf.greater(sampled_width, sampled_height * 2),
+                                                        tf.greater(sampled_height, sampled_width * 2)),
+                                            tf.less(index, max_attempt)),
+                            tf.less(index, 1))
+
+      def body(index, sampled_width, sampled_height, width, height):
+        sampled_width = tf.random_uniform([1], minval=0.3, maxval=0.999, dtype=tf.float32)[0] * width
+        sampled_height = tf.random_uniform([1], minval=0.3, maxval=0.999, dtype=tf.float32)[0] *height
+
+        return index+1, sampled_width, sampled_height, width, height
+
+      [index, sampled_width, sampled_height, _, _] = tf.while_loop(condition, body,
+                                         [index, sampled_width, sampled_height, width, height], parallel_iterations=4, back_prop=False, swap_memory=True)
+
+      return tf.cast(sampled_width, tf.int32), tf.cast(sampled_height, tf.int32)
+
+  def jaccard_with_anchors(roi, bboxes):
+    with tf.name_scope('jaccard_with_anchors'):
+      int_ymin = tf.maximum(roi[0], bboxes[:, 0])
+      int_xmin = tf.maximum(roi[1], bboxes[:, 1])
+      int_ymax = tf.minimum(roi[2], bboxes[:, 2])
+      int_xmax = tf.minimum(roi[3], bboxes[:, 3])
+      h = tf.maximum(int_ymax - int_ymin, 0.)
+      w = tf.maximum(int_xmax - int_xmin, 0.)
+      inter_vol = h * w
+      union_vol = (roi[3] - roi[1]) * (roi[2] - roi[0]) + ((bboxes[:, 2] - bboxes[:, 0]) * (bboxes[:, 3] - bboxes[:, 1]) - inter_vol)
+      jaccard = tf.div(inter_vol, union_vol)
+      return jaccard
+
+  def areas(bboxes):
+    with tf.name_scope('bboxes_areas'):
+      vol = (bboxes[:, 3] - bboxes[:, 1]) * (bboxes[:, 2] - bboxes[:, 0])
+      return vol
+
+  def check_roi_center(width, height, labels, bboxes):
+    with tf.name_scope('check_roi_center'):
+      index = 0
+      max_attempt = 20
+      roi = [0., 0., 0., 0.]
+      float_width = tf.cast(width, tf.float32)
+      float_height = tf.cast(height, tf.float32)
+      mask = tf.cast(tf.zeros_like(labels, dtype=tf.uint8), tf.bool)
+      center_x, center_y = (bboxes[:, 1] + bboxes[:, 3]) / 2, (bboxes[:, 0] + bboxes[:, 2]) / 2
+
+      def condition(index, roi, mask):
+        return tf.logical_or(tf.logical_and(tf.reduce_sum(tf.cast(mask, tf.int32)) < 1,
+                                          tf.less(index, max_attempt)),
+                            tf.less(index, 1))
+
+      def body(index, roi, mask):
+        sampled_width, sampled_height = sample_width_height(float_width, float_height)
+
+        x = tf.random_uniform([], minval=0, maxval=width - sampled_width, dtype=tf.int32)
+        y = tf.random_uniform([], minval=0, maxval=height - sampled_height, dtype=tf.int32)
+
+        roi = [tf.cast(y, tf.float32) / float_height,
+              tf.cast(x, tf.float32) / float_width,
+              tf.cast(y + sampled_height, tf.float32) / float_height,
+              tf.cast(x + sampled_width, tf.float32) / float_width]
+
+        mask_min = tf.logical_and(tf.greater(center_y, roi[0]), tf.greater(center_x, roi[1]))
+        mask_max = tf.logical_and(tf.less(center_y, roi[2]), tf.less(center_x, roi[3]))
+        mask = tf.logical_and(mask_min, mask_max)
+
+        return index + 1, roi, mask
+
+      [index, roi, mask] = tf.while_loop(condition, body, [index, roi, mask], parallel_iterations=10, back_prop=False, swap_memory=True)
+
+      mask_labels = tf.boolean_mask(labels, mask)
+      mask_bboxes = tf.boolean_mask(bboxes, mask)
+
+      return roi, mask_labels, mask_bboxes
+  def check_roi_overlap(width, height, labels, bboxes, min_iou):
+    with tf.name_scope('check_roi_overlap'):
+      index = 0
+      max_attempt = 50
+      roi = [0., 0., 1., 1.]
+      mask_labels = labels
+      mask_bboxes = bboxes
+
+      def condition(index, roi, mask_labels, mask_bboxes):
+        return tf.logical_or(tf.logical_or(tf.logical_and(tf.reduce_sum(tf.cast(jaccard_with_anchors(roi, mask_bboxes) < min_iou, tf.int32)) > 0,
+                                                        tf.less(index, max_attempt)),
+                                          tf.less(index, 1)),
+                            tf.less(tf.shape(mask_labels)[0], 1))
+
+      def body(index, roi, mask_labels, mask_bboxes):
+        roi, mask_labels, mask_bboxes = check_roi_center(width, height, labels, bboxes)
+        return index+1, roi, mask_labels, mask_bboxes
+
+      [index, roi, mask_labels, mask_bboxes] = tf.while_loop(condition, body, [index, roi, mask_labels, mask_bboxes], parallel_iterations=16, back_prop=False, swap_memory=True)
+
+      return tf.cond(tf.greater(tf.shape(mask_labels)[0], 0),
+                  lambda : (tf.cast([roi[0] * tf.cast(height, tf.float32),
+                            roi[1] * tf.cast(width, tf.float32),
+                            (roi[2] - roi[0]) * tf.cast(height, tf.float32),
+                            (roi[3] - roi[1]) * tf.cast(width, tf.float32)], tf.int32), mask_labels, mask_bboxes),
+                  lambda : (tf.cast([0, 0, height, width], tf.int32), labels, bboxes))
+
+
+  def sample_patch(image, labels, bboxes, min_iou):
+    with tf.name_scope('sample_patch'):
+      height, width, depth = _ImageDimensions(image, rank=3)
+
+      roi_slice_range, mask_labels, mask_bboxes = check_roi_overlap(width, height, labels, bboxes, min_iou)
+
+      scale = tf.cast(tf.stack([height, width, height, width]), mask_bboxes.dtype)
+      mask_bboxes = mask_bboxes * scale
+
+      # Add offset.
+      offset = tf.cast(tf.stack([roi_slice_range[0], roi_slice_range[1], roi_slice_range[0], roi_slice_range[1]]), mask_bboxes.dtype)
+      mask_bboxes = mask_bboxes - offset
+
+      cliped_ymin = tf.maximum(0., mask_bboxes[:, 0])
+      cliped_xmin = tf.maximum(0., mask_bboxes[:, 1])
+      cliped_ymax = tf.minimum(tf.cast(roi_slice_range[2], tf.float32), mask_bboxes[:, 2])
+      cliped_xmax = tf.minimum(tf.cast(roi_slice_range[3], tf.float32), mask_bboxes[:, 3])
+
+      mask_bboxes = tf.stack([cliped_ymin, cliped_xmin, cliped_ymax, cliped_xmax], axis=-1)
+      # Rescale to target dimension.
+      scale = tf.cast(tf.stack([roi_slice_range[2], roi_slice_range[3],
+                                roi_slice_range[2], roi_slice_range[3]]), mask_bboxes.dtype)
+
+      return tf.cond(tf.logical_or(tf.less(roi_slice_range[2], 1), tf.less(roi_slice_range[3], 1)),
+                  lambda: (image, labels, bboxes),
+                  lambda: (tf.slice(image, [roi_slice_range[0], roi_slice_range[1], 0], [roi_slice_range[2], roi_slice_range[3], -1]),
+                                  mask_labels, mask_bboxes / scale))
+
+  with tf.name_scope('ssd_random_sample_patch'):
+    image = tf.convert_to_tensor(image, name='image')
+
+    min_iou_list = tf.convert_to_tensor(ratio_list)
+    samples_min_iou = tf.multinomial(tf.log([[1. / len(ratio_list)] * len(ratio_list)]), 1)
+
+    sampled_min_iou = min_iou_list[tf.cast(samples_min_iou[0][0], tf.int32)]
+
+    return tf.cond(tf.less(sampled_min_iou, 1.), lambda: sample_patch(image, labels, bboxes, sampled_min_iou), lambda: (image, labels, bboxes))
+
+def ssd_random_expand(image, bboxes, ratio=2., name=None):
+  with tf.name_scope('ssd_random_expand'):
+    image = tf.convert_to_tensor(image, name='image')
+    if image.get_shape().ndims != 3:
+      raise ValueError('\'image\' must have 3 dimensions.')
+
+    height, width, depth = _ImageDimensions(image, rank=3)
+
+    float_height, float_width = tf.to_float(height), tf.to_float(width)
+
+    canvas_width, canvas_height = tf.to_int32(float_width * ratio), tf.to_int32(float_height * ratio)
+
+    mean_color_of_image = [_R_MEAN/255., _G_MEAN/255., _B_MEAN/255.]#tf.reduce_mean(tf.reshape(image, [-1, 3]), 0)
+
+    x = tf.random_uniform([], minval=0, maxval=canvas_width - width, dtype=tf.int32)
+    y = tf.random_uniform([], minval=0, maxval=canvas_height - height, dtype=tf.int32)
+
+    paddings = tf.convert_to_tensor([[y, canvas_height - height - y], [x, canvas_width - width - x]])
+
+    big_canvas = tf.stack([tf.pad(image[:, :, 0], paddings, "CONSTANT", constant_values = mean_color_of_image[0]),
+                          tf.pad(image[:, :, 1], paddings, "CONSTANT", constant_values = mean_color_of_image[1]),
+                          tf.pad(image[:, :, 2], paddings, "CONSTANT", constant_values = mean_color_of_image[2])], axis=-1)
+
+    scale = tf.cast(tf.stack([height, width, height, width]), bboxes.dtype)
+    absolute_bboxes = bboxes * scale + tf.cast(tf.stack([y, x, y, x]), bboxes.dtype)
+
+    return big_canvas, absolute_bboxes / tf.cast(tf.stack([canvas_height, canvas_width, canvas_height, canvas_width]), bboxes.dtype)
+
+# def ssd_random_sample_patch_wrapper(image, labels, bboxes):
+#   with tf.name_scope('ssd_random_sample_patch_wrapper'):
+#     orgi_image, orgi_labels, orgi_bboxes = image, labels, bboxes
+#     def check_bboxes(bboxes):
+#       areas = (bboxes[:, 3] - bboxes[:, 1]) * (bboxes[:, 2] - bboxes[:, 0])
+#       return tf.logical_and(tf.logical_and(areas < 0.9, areas > 0.001),
+#                             tf.logical_and((bboxes[:, 3] - bboxes[:, 1]) > 0.025, (bboxes[:, 2] - bboxes[:, 0]) > 0.025))
+
+#     index = 0
+#     max_attempt = 3
+#     def condition(index, image, labels, bboxes):
+#       return tf.logical_or(tf.logical_and(tf.reduce_sum(tf.cast(check_bboxes(bboxes), tf.int64)) < 1, tf.less(index, max_attempt)), tf.less(index, 1))
+
+#     def body(index, image, labels, bboxes):
+#       image, bboxes = tf.cond(tf.random_uniform([], minval=0., maxval=1., dtype=tf.float32) < 0.5,
+#                       lambda: (image, bboxes),
+#                       lambda: ssd_random_expand(image, bboxes, tf.random_uniform([1], minval=1.1, maxval=4., dtype=tf.float32)[0]))
+#       # Distort image and bounding boxes.
+#       random_sample_image, labels, bboxes = ssd_random_sample_patch(image, labels, bboxes, ratio_list=[-0.1, 0.1, 0.3, 0.5, 0.7, 0.9, 1.])
+#       random_sample_image.set_shape([None, None, 3])
+#       return index+1, random_sample_image, labels, bboxes
+
+#     [index, image, labels, bboxes] = tf.while_loop(condition, body, [index, orgi_image, orgi_labels, orgi_bboxes], parallel_iterations=4, back_prop=False, swap_memory=True)
+
+#     valid_mask = check_bboxes(bboxes)
+#     labels, bboxes = tf.boolean_mask(labels, valid_mask), tf.boolean_mask(bboxes, valid_mask)
+#     return tf.cond(tf.less(index, max_attempt),
+#                 lambda : (image, labels, bboxes),
+#                 lambda : (orgi_image, orgi_labels, orgi_bboxes))
+
+def ssd_random_sample_patch_wrapper(image, labels, bboxes):
+  with tf.name_scope('ssd_random_sample_patch_wrapper'):
+    orgi_image, orgi_labels, orgi_bboxes = image, labels, bboxes
+    def check_bboxes(bboxes):
+      areas = (bboxes[:, 3] - bboxes[:, 1]) * (bboxes[:, 2] - bboxes[:, 0])
+      return tf.logical_and(tf.logical_and(areas < 0.9, areas > 0.001),
+                            tf.logical_and((bboxes[:, 3] - bboxes[:, 1]) > 0.025, (bboxes[:, 2] - bboxes[:, 0]) > 0.025))
+
+    index = 0
+    max_attempt = 3
+    def condition(index, image, labels, bboxes, orgi_image, orgi_labels, orgi_bboxes):
+      return tf.logical_or(tf.logical_and(tf.reduce_sum(tf.cast(check_bboxes(bboxes), tf.int64)) < 1, tf.less(index, max_attempt)), tf.less(index, 1))
+
+    def body(index, image, labels, bboxes, orgi_image, orgi_labels, orgi_bboxes):
+      image, bboxes = tf.cond(tf.random_uniform([], minval=0., maxval=1., dtype=tf.float32) < 0.5,
+                      lambda: (orgi_image, orgi_bboxes),
+                      lambda: ssd_random_expand(orgi_image, orgi_bboxes, tf.random_uniform([1], minval=1.1, maxval=4., dtype=tf.float32)[0]))
+      # Distort image and bounding boxes.
+      random_sample_image, labels, bboxes = ssd_random_sample_patch(image, orgi_labels, bboxes, ratio_list=[-0.1, 0.1, 0.3, 0.5, 0.7, 0.9, 1.])
+      random_sample_image.set_shape([None, None, 3])
+      return index+1, random_sample_image, labels, bboxes, orgi_image, orgi_labels, orgi_bboxes
+
+    [index, image, labels, bboxes, orgi_image, orgi_labels, orgi_bboxes] = tf.while_loop(condition, body, [index,  image, labels, bboxes, orgi_image, orgi_labels, orgi_bboxes], parallel_iterations=4, back_prop=False, swap_memory=True)
+
+    valid_mask = check_bboxes(bboxes)
+    labels, bboxes = tf.boolean_mask(labels, valid_mask), tf.boolean_mask(bboxes, valid_mask)
+    return tf.cond(tf.less(index, max_attempt),
+                lambda : (image, labels, bboxes),
+                lambda : (orgi_image, orgi_labels, orgi_bboxes))
+
+def _mean_image_subtraction(image, means):
+  """Subtracts the given means from each image channel.
+
+  For example:
+    means = [123.68, 116.779, 103.939]
+    image = _mean_image_subtraction(image, means)
+
+  Note that the rank of `image` must be known.
+
+  Args:
+    image: a tensor of size [height, width, C].
+    means: a C-vector of values to subtract from each channel.
+
+  Returns:
+    the centered image.
+
+  Raises:
+    ValueError: If the rank of `image` is unknown, if `image` has a rank other
+      than three or if the number of channels in `image` doesn't match the
+      number of values in `means`.
+  """
+  if image.get_shape().ndims != 3:
+    raise ValueError('Input must be of size [height, width, C>0]')
+  num_channels = image.get_shape().as_list()[-1]
+  if len(means) != num_channels:
+    raise ValueError('len(means) must match the number of channels')
+
+  channels = tf.split(axis=2, num_or_size_splits=num_channels, value=image)
+  for i in range(num_channels):
+    channels[i] -= means[i]
+  return tf.concat(axis=2, values=channels)
+
+def unwhiten_image(image):
+  means=[_R_MEAN, _G_MEAN, _B_MEAN]
+  num_channels = image.get_shape().as_list()[-1]
+  channels = tf.split(axis=2, num_or_size_splits=num_channels, value=image)
+  for i in range(num_channels):
+    channels[i] += means[i]
+  return tf.concat(axis=2, values=channels)
+
+def random_flip_left_right(image, bboxes):
+  with tf.name_scope('random_flip_left_right'):
+    uniform_random = tf.random_uniform([], 0, 1.0)
+    mirror_cond = tf.less(uniform_random, .5)
+    # Flip image.
+    result = tf.cond(mirror_cond, lambda: tf.image.flip_left_right(image), lambda: image)
+    # Flip bboxes.
+    mirror_bboxes = tf.stack([bboxes[:, 0], 1 - bboxes[:, 3],
+                              bboxes[:, 2], 1 - bboxes[:, 1]], axis=-1)
+    bboxes = tf.cond(mirror_cond, lambda: mirror_bboxes, lambda: bboxes)
+    return result, bboxes
+
+def preprocess_for_train(image, labels, bboxes, out_shape, data_format='channels_first', scope='ssd_preprocessing_train', output_rgb=True):
+  """Preprocesses the given image for training.
+
+  Args:
+    image: A `Tensor` representing an image of arbitrary size.
+    labels: A `Tensor` containing all labels for all bboxes of this image.
+    bboxes: A `Tensor` containing all bboxes of this image, in range [0., 1.] with shape [num_bboxes, 4].
+    out_shape: The height and width of the image after preprocessing.
+    data_format: The data_format of the desired output image.
+  Returns:
+    A preprocessed image.
+  """
+  with tf.name_scope(scope, 'ssd_preprocessing_train', [image, labels, bboxes]):
+    if image.get_shape().ndims != 3:
+      raise ValueError('Input must be of size [height, width, C>0]')
+    # Convert to float scaled [0, 1].
+    orig_dtype = image.dtype
+    if orig_dtype != tf.float32:
+      image = tf.image.convert_image_dtype(image, dtype=tf.float32)
+
+    # Randomly distort the colors. There are 4 ways to do it.
+    distort_image = apply_with_random_selector(image,
+                                          lambda x, ordering: distort_color(x, ordering, True),
+                                          num_cases=4)
+
+    random_sample_image, labels, bboxes = ssd_random_sample_patch_wrapper(distort_image, labels, bboxes)
+    # image, bboxes = tf.cond(tf.random_uniform([1], minval=0., maxval=1., dtype=tf.float32)[0] < 0.25,
+    #                     lambda: (image, bboxes),
+    #                     lambda: ssd_random_expand(image, bboxes, tf.random_uniform([1], minval=2, maxval=4, dtype=tf.int32)[0]))
+
+    # # Distort image and bounding boxes.
+    # random_sample_image, labels, bboxes = ssd_random_sample_patch(image, labels, bboxes, ratio_list=[0.1, 0.3, 0.5, 0.7, 0.9, 1.])
+
+    # Randomly flip the image horizontally.
+    random_sample_flip_image, bboxes = random_flip_left_right(random_sample_image, bboxes)
+    # Rescale to VGG input scale.
+    random_sample_flip_resized_image = tf.image.resize_images(random_sample_flip_image, out_shape, method=tf.image.ResizeMethod.BILINEAR, align_corners=False)
+    random_sample_flip_resized_image.set_shape([None, None, 3])
+
+    final_image = tf.to_float(tf.image.convert_image_dtype(random_sample_flip_resized_image, orig_dtype, saturate=True))
+    final_image = _mean_image_subtraction(final_image, [_R_MEAN, _G_MEAN, _B_MEAN])
+
+    final_image.set_shape(out_shape + [3])
+    if not output_rgb:
+      image_channels = tf.unstack(final_image, axis=-1, name='split_rgb')
+      final_image = tf.stack([image_channels[2], image_channels[1], image_channels[0]], axis=-1, name='merge_bgr')
+    if data_format == 'channels_first':
+      final_image = tf.transpose(final_image, perm=(2, 0, 1))
+    return final_image, labels, bboxes
+
+def preprocess_for_eval(image, out_shape, data_format='channels_first', scope='ssd_preprocessing_eval', output_rgb=True):
+  """Preprocesses the given image for evaluation.
+
+  Args:
+    image: A `Tensor` representing an image of arbitrary size.
+    out_shape: The height and width of the image after preprocessing.
+    data_format: The data_format of the desired output image.
+  Returns:
+    A preprocessed image.
+  """
+  with tf.name_scope(scope, 'ssd_preprocessing_eval', [image]):
+    image = tf.to_float(image)
+    image = tf.image.resize_images(image, out_shape, method=tf.image.ResizeMethod.BILINEAR, align_corners=False)
+    image.set_shape(out_shape + [3])
+
+    image = _mean_image_subtraction(image, [_R_MEAN, _G_MEAN, _B_MEAN])
+    if not output_rgb:
+      image_channels = tf.unstack(image, axis=-1, name='split_rgb')
+      image = tf.stack([image_channels[2], image_channels[1], image_channels[0]], axis=-1, name='merge_bgr')
+    # Image data format.
+    if data_format == 'channels_first':
+      image = tf.transpose(image, perm=(2, 0, 1))
+    return image
+
+def preprocess_image(image, labels, bboxes, out_shape, is_training=False, data_format='channels_first', output_rgb=True):
+  """Preprocesses the given image.
+
+  Args:
+    image: A `Tensor` representing an image of arbitrary size.
+    labels: A `Tensor` containing all labels for all bboxes of this image.
+    bboxes: A `Tensor` containing all bboxes of this image, in range [0., 1.] with shape [num_bboxes, 4].
+    out_shape: The height and width of the image after preprocessing.
+    is_training: Wether we are in training phase.
+    data_format: The data_format of the desired output image.
+
+  Returns:
+    A preprocessed image.
+  """
+  if is_training:
+    return preprocess_for_train(image, labels, bboxes, out_shape, data_format=data_format, output_rgb=output_rgb)
+  else:
+    return preprocess_for_eval(image, out_shape, data_format=data_format, output_rgb=output_rgb)
diff --git a/utils/external/ssd_tensorflow/simple_ssd_demo.py b/utils/external/ssd_tensorflow/simple_ssd_demo.py
new file mode 100644
index 0000000..67540bc
--- /dev/null
+++ b/utils/external/ssd_tensorflow/simple_ssd_demo.py
@@ -0,0 +1,220 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+
+import tensorflow as tf
+from scipy.misc import imread, imsave, imshow, imresize
+import numpy as np
+
+from net import ssd_net
+
+from dataset import dataset_common
+from preprocessing import ssd_preprocessing
+from utility import anchor_manipulator
+from utility import draw_toolbox
+
+# scaffold related configuration
+tf.app.flags.DEFINE_integer(
+    'num_classes', 21, 'Number of classes to use in the dataset.')
+# model related configuration
+tf.app.flags.DEFINE_integer(
+    'train_image_size', 300,
+    'The size of the input image for the model to use.')
+tf.app.flags.DEFINE_string(
+    'data_format', 'channels_last', # 'channels_first' or 'channels_last'
+    'A flag to override the data format used in the model. channels_first '
+    'provides a performance boost on GPU but is not always compatible '
+    'with CPU. If left unspecified, the data format will be chosen '
+    'automatically based on whether TensorFlow was built for CPU or GPU.')
+tf.app.flags.DEFINE_float(
+    'select_threshold', 0.2, 'Class-specific confidence score threshold for selecting a box.')
+tf.app.flags.DEFINE_float(
+    'min_size', 0.03, 'The min size of bboxes to keep.')
+tf.app.flags.DEFINE_float(
+    'nms_threshold', 0.45, 'Matching threshold in NMS algorithm.')
+tf.app.flags.DEFINE_integer(
+    'nms_topk', 20, 'Number of total object to keep after NMS.')
+tf.app.flags.DEFINE_integer(
+    'keep_topk', 200, 'Number of total object to keep for each image before nms.')
+# checkpoint related configuration
+tf.app.flags.DEFINE_string(
+    'checkpoint_path', './logs',
+    'The path to a checkpoint from which to fine-tune.')
+tf.app.flags.DEFINE_string(
+    'model_scope', 'ssd300',
+    'Model scope name used to replace the name_scope in checkpoint.')
+
+FLAGS = tf.app.flags.FLAGS
+#CUDA_VISIBLE_DEVICES
+
+def get_checkpoint():
+    if tf.gfile.IsDirectory(FLAGS.checkpoint_path):
+        checkpoint_path = tf.train.latest_checkpoint(FLAGS.checkpoint_path)
+    else:
+        checkpoint_path = FLAGS.checkpoint_path
+
+    return checkpoint_path
+
+def select_bboxes(scores_pred, bboxes_pred, num_classes, select_threshold):
+    selected_bboxes = {}
+    selected_scores = {}
+    with tf.name_scope('select_bboxes', [scores_pred, bboxes_pred]):
+        for class_ind in range(1, num_classes):
+            class_scores = scores_pred[:, class_ind]
+
+            select_mask = class_scores > select_threshold
+            select_mask = tf.cast(select_mask, tf.float32)
+            selected_bboxes[class_ind] = tf.multiply(bboxes_pred, tf.expand_dims(select_mask, axis=-1))
+            selected_scores[class_ind] = tf.multiply(class_scores, select_mask)
+
+    return selected_bboxes, selected_scores
+
+def clip_bboxes(ymin, xmin, ymax, xmax, name):
+    with tf.name_scope(name, 'clip_bboxes', [ymin, xmin, ymax, xmax]):
+        ymin = tf.maximum(ymin, 0.)
+        xmin = tf.maximum(xmin, 0.)
+        ymax = tf.minimum(ymax, 1.)
+        xmax = tf.minimum(xmax, 1.)
+
+        ymin = tf.minimum(ymin, ymax)
+        xmin = tf.minimum(xmin, xmax)
+
+        return ymin, xmin, ymax, xmax
+
+def filter_bboxes(scores_pred, ymin, xmin, ymax, xmax, min_size, name):
+    with tf.name_scope(name, 'filter_bboxes', [scores_pred, ymin, xmin, ymax, xmax]):
+        width = xmax - xmin
+        height = ymax - ymin
+
+        filter_mask = tf.logical_and(width > min_size, height > min_size)
+
+        filter_mask = tf.cast(filter_mask, tf.float32)
+        return tf.multiply(ymin, filter_mask), tf.multiply(xmin, filter_mask), \
+                tf.multiply(ymax, filter_mask), tf.multiply(xmax, filter_mask), tf.multiply(scores_pred, filter_mask)
+
+def sort_bboxes(scores_pred, ymin, xmin, ymax, xmax, keep_topk, name):
+    with tf.name_scope(name, 'sort_bboxes', [scores_pred, ymin, xmin, ymax, xmax]):
+        cur_bboxes = tf.shape(scores_pred)[0]
+        scores, idxes = tf.nn.top_k(scores_pred, k=tf.minimum(keep_topk, cur_bboxes), sorted=True)
+
+        ymin, xmin, ymax, xmax = tf.gather(ymin, idxes), tf.gather(xmin, idxes), tf.gather(ymax, idxes), tf.gather(xmax, idxes)
+
+        paddings_scores = tf.expand_dims(tf.stack([0, tf.maximum(keep_topk-cur_bboxes, 0)], axis=0), axis=0)
+
+        return tf.pad(ymin, paddings_scores, "CONSTANT"), tf.pad(xmin, paddings_scores, "CONSTANT"),\
+                tf.pad(ymax, paddings_scores, "CONSTANT"), tf.pad(xmax, paddings_scores, "CONSTANT"),\
+                tf.pad(scores, paddings_scores, "CONSTANT")
+
+def nms_bboxes(scores_pred, bboxes_pred, nms_topk, nms_threshold, name):
+    with tf.name_scope(name, 'nms_bboxes', [scores_pred, bboxes_pred]):
+        idxes = tf.image.non_max_suppression(bboxes_pred, scores_pred, nms_topk, nms_threshold)
+        return tf.gather(scores_pred, idxes), tf.gather(bboxes_pred, idxes)
+
+def parse_by_class(cls_pred, bboxes_pred, num_classes, select_threshold, min_size, keep_topk, nms_topk, nms_threshold):
+    with tf.name_scope('select_bboxes', [cls_pred, bboxes_pred]):
+        scores_pred = tf.nn.softmax(cls_pred)
+        selected_bboxes, selected_scores = select_bboxes(scores_pred, bboxes_pred, num_classes, select_threshold)
+        for class_ind in range(1, num_classes):
+            ymin, xmin, ymax, xmax = tf.unstack(selected_bboxes[class_ind], 4, axis=-1)
+            #ymin, xmin, ymax, xmax = tf.squeeze(ymin), tf.squeeze(xmin), tf.squeeze(ymax), tf.squeeze(xmax)
+            ymin, xmin, ymax, xmax = clip_bboxes(ymin, xmin, ymax, xmax, 'clip_bboxes_{}'.format(class_ind))
+            ymin, xmin, ymax, xmax, selected_scores[class_ind] = filter_bboxes(selected_scores[class_ind],
+                                                ymin, xmin, ymax, xmax, min_size, 'filter_bboxes_{}'.format(class_ind))
+            ymin, xmin, ymax, xmax, selected_scores[class_ind] = sort_bboxes(selected_scores[class_ind],
+                                                ymin, xmin, ymax, xmax, keep_topk, 'sort_bboxes_{}'.format(class_ind))
+            selected_bboxes[class_ind] = tf.stack([ymin, xmin, ymax, xmax], axis=-1)
+            selected_scores[class_ind], selected_bboxes[class_ind] = nms_bboxes(selected_scores[class_ind], selected_bboxes[class_ind], nms_topk, nms_threshold, 'nms_bboxes_{}'.format(class_ind))
+
+        return selected_bboxes, selected_scores
+
+def main(_):
+    with tf.Graph().as_default():
+        out_shape = [FLAGS.train_image_size] * 2
+
+        image_input = tf.placeholder(tf.uint8, shape=(None, None, 3))
+        shape_input = tf.placeholder(tf.int32, shape=(2,))
+
+        features = ssd_preprocessing.preprocess_for_eval(image_input, out_shape, data_format=FLAGS.data_format, output_rgb=False)
+        features = tf.expand_dims(features, axis=0)
+
+        anchor_creator = anchor_manipulator.AnchorCreator(out_shape,
+                                                    layers_shapes = [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
+                                                    anchor_scales = [(0.1,), (0.2,), (0.375,), (0.55,), (0.725,), (0.9,)],
+                                                    extra_anchor_scales = [(0.1414,), (0.2739,), (0.4541,), (0.6315,), (0.8078,), (0.9836,)],
+                                                    anchor_ratios = [(1., 2., .5), (1., 2., 3., .5, 0.3333), (1., 2., 3., .5, 0.3333), (1., 2., 3., .5, 0.3333), (1., 2., .5), (1., 2., .5)],
+                                                    #anchor_ratios = [(2., .5), (2., 3., .5, 0.3333), (2., 3., .5, 0.3333), (2., 3., .5, 0.3333), (2., .5), (2., .5)],
+                                                    layer_steps = [8, 16, 32, 64, 100, 300])
+        all_anchors, all_num_anchors_depth, all_num_anchors_spatial = anchor_creator.get_all_anchors()
+
+        anchor_encoder_decoder = anchor_manipulator.AnchorEncoder(allowed_borders = [1.0] * 6,
+                                                            positive_threshold = None,
+                                                            ignore_threshold = None,
+                                                            prior_scaling=[0.1, 0.1, 0.2, 0.2])
+
+        decode_fn = lambda pred : anchor_encoder_decoder.ext_decode_all_anchors(pred, all_anchors, all_num_anchors_depth, all_num_anchors_spatial)
+
+        with tf.variable_scope(FLAGS.model_scope, default_name=None, values=[features], reuse=tf.AUTO_REUSE):
+            backbone = ssd_net.VGG16Backbone(FLAGS.data_format)
+            feature_layers = backbone.forward(features, training=False)
+            location_pred, cls_pred = ssd_net.multibox_head(feature_layers, FLAGS.num_classes, all_num_anchors_depth, data_format=FLAGS.data_format)
+            if FLAGS.data_format == 'channels_first':
+                cls_pred = [tf.transpose(pred, [0, 2, 3, 1]) for pred in cls_pred]
+                location_pred = [tf.transpose(pred, [0, 2, 3, 1]) for pred in location_pred]
+
+            cls_pred = [tf.reshape(pred, [-1, FLAGS.num_classes]) for pred in cls_pred]
+            location_pred = [tf.reshape(pred, [-1, 4]) for pred in location_pred]
+
+            cls_pred = tf.concat(cls_pred, axis=0)
+            location_pred = tf.concat(location_pred, axis=0)
+
+        with tf.device('/cpu:0'):
+            bboxes_pred = decode_fn(location_pred)
+            bboxes_pred = tf.concat(bboxes_pred, axis=0)
+            selected_bboxes, selected_scores = parse_by_class(cls_pred, bboxes_pred,
+                                                            FLAGS.num_classes, FLAGS.select_threshold, FLAGS.min_size,
+                                                            FLAGS.keep_topk, FLAGS.nms_topk, FLAGS.nms_threshold)
+
+            labels_list = []
+            scores_list = []
+            bboxes_list = []
+            for k, v in selected_scores.items():
+                labels_list.append(tf.ones_like(v, tf.int32) * k)
+                scores_list.append(v)
+                bboxes_list.append(selected_bboxes[k])
+            all_labels = tf.concat(labels_list, axis=0)
+            all_scores = tf.concat(scores_list, axis=0)
+            all_bboxes = tf.concat(bboxes_list, axis=0)
+
+        saver = tf.train.Saver()
+        with tf.Session() as sess:
+            init = tf.global_variables_initializer()
+            sess.run(init)
+
+            saver.restore(sess, get_checkpoint())
+
+            np_image = imread('./demo/test.jpg')
+            labels_, scores_, bboxes_ = sess.run([all_labels, all_scores, all_bboxes], feed_dict = {image_input : np_image, shape_input : np_image.shape[:-1]})
+
+            img_to_draw = draw_toolbox.bboxes_draw_on_img(np_image, labels_, scores_, bboxes_, thickness=2)
+            imsave('./demo/test_out.jpg', img_to_draw)
+
+if __name__ == '__main__':
+  tf.logging.set_verbosity(tf.logging.INFO)
+  tf.app.run()
diff --git a/utils/external/ssd_tensorflow/train_ssd.py b/utils/external/ssd_tensorflow/train_ssd.py
new file mode 100644
index 0000000..a6c09a8
--- /dev/null
+++ b/utils/external/ssd_tensorflow/train_ssd.py
@@ -0,0 +1,498 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+
+import tensorflow as tf
+
+from net import ssd_net
+
+from dataset import dataset_common
+from preprocessing import ssd_preprocessing
+from utility import anchor_manipulator
+from utility import scaffolds
+
+# hardware related configuration
+tf.app.flags.DEFINE_integer(
+    'num_readers', 8,
+    'The number of parallel readers that read data from the dataset.')
+tf.app.flags.DEFINE_integer(
+    'num_preprocessing_threads', 24,
+    'The number of threads used to create the batches.')
+tf.app.flags.DEFINE_integer(
+    'num_cpu_threads', 0,
+    'The number of cpu cores used to train.')
+tf.app.flags.DEFINE_float(
+    'gpu_memory_fraction', 1., 'GPU memory fraction to use.')
+# scaffold related configuration
+tf.app.flags.DEFINE_string(
+    'data_dir', './dataset/tfrecords',
+    'The directory where the dataset input data is stored.')
+tf.app.flags.DEFINE_integer(
+    'num_classes', 21, 'Number of classes to use in the dataset.')
+tf.app.flags.DEFINE_string(
+    'model_dir', './logs/',
+    'The directory where the model will be stored.')
+tf.app.flags.DEFINE_integer(
+    'log_every_n_steps', 10,
+    'The frequency with which logs are printed.')
+tf.app.flags.DEFINE_integer(
+    'save_summary_steps', 500,
+    'The frequency with which summaries are saved, in seconds.')
+tf.app.flags.DEFINE_integer(
+    'save_checkpoints_secs', 7200,
+    'The frequency with which the model is saved, in seconds.')
+# model related configuration
+tf.app.flags.DEFINE_integer(
+    'train_image_size', 300,
+    'The size of the input image for the model to use.')
+tf.app.flags.DEFINE_integer(
+    'train_epochs', None,
+    'The number of epochs to use for training.')
+tf.app.flags.DEFINE_integer(
+    'max_number_of_steps', 120000,
+    'The max number of steps to use for training.')
+tf.app.flags.DEFINE_integer(
+    'batch_size', 32,
+    'Batch size for training and evaluation.')
+tf.app.flags.DEFINE_string(
+    'data_format', 'channels_first', # 'channels_first' or 'channels_last'
+    'A flag to override the data format used in the model. channels_first '
+    'provides a performance boost on GPU but is not always compatible '
+    'with CPU. If left unspecified, the data format will be chosen '
+    'automatically based on whether TensorFlow was built for CPU or GPU.')
+tf.app.flags.DEFINE_float(
+    'negative_ratio', 3., 'Negative ratio in the loss function.')
+tf.app.flags.DEFINE_float(
+    'match_threshold', 0.5, 'Matching threshold in the loss function.')
+tf.app.flags.DEFINE_float(
+    'neg_threshold', 0.5, 'Matching threshold for the negtive examples in the loss function.')
+# optimizer related configuration
+tf.app.flags.DEFINE_integer(
+    'tf_random_seed', 20180503, 'Random seed for TensorFlow initializers.')
+tf.app.flags.DEFINE_float(
+    'weight_decay', 5e-4, 'The weight decay on the model weights.')
+tf.app.flags.DEFINE_float(
+    'momentum', 0.9,
+    'The momentum for the MomentumOptimizer and RMSPropOptimizer.')
+tf.app.flags.DEFINE_float('learning_rate', 1e-3, 'Initial learning rate.')
+tf.app.flags.DEFINE_float(
+    'end_learning_rate', 0.000001,
+    'The minimal end learning rate used by a polynomial decay learning rate.')
+# for learning rate piecewise_constant decay
+tf.app.flags.DEFINE_string(
+    'decay_boundaries', '500, 80000, 100000',
+    'Learning rate decay boundaries by global_step (comma-separated list).')
+tf.app.flags.DEFINE_string(
+    'lr_decay_factors', '0.1, 1, 0.1, 0.01',
+    'The values of learning_rate decay factor for each segment between boundaries (comma-separated list).')
+# checkpoint related configuration
+tf.app.flags.DEFINE_string(
+    'checkpoint_path', './model',
+    'The path to a checkpoint from which to fine-tune.')
+tf.app.flags.DEFINE_string(
+    'checkpoint_model_scope', 'vgg_16',
+    'Model scope in the checkpoint. None if the same as the trained model.')
+tf.app.flags.DEFINE_string(
+    'model_scope', 'ssd300',
+    'Model scope name used to replace the name_scope in checkpoint.')
+tf.app.flags.DEFINE_string(
+    'checkpoint_exclude_scopes', 'ssd300/multibox_head, ssd300/additional_layers, ssd300/conv4_3_scale',
+    'Comma-separated list of scopes of variables to exclude when restoring from a checkpoint.')
+tf.app.flags.DEFINE_boolean(
+    'ignore_missing_vars', True,
+    'When restoring a checkpoint would ignore missing variables.')
+tf.app.flags.DEFINE_boolean(
+    'multi_gpu', True,
+    'Whether there is GPU to use for training.')
+
+FLAGS = tf.app.flags.FLAGS
+#CUDA_VISIBLE_DEVICES
+def validate_batch_size_for_multi_gpu(batch_size):
+    """For multi-gpu, batch-size must be a multiple of the number of
+    available GPUs.
+
+    Note that this should eventually be handled by replicate_model_fn
+    directly. Multi-GPU support is currently experimental, however,
+    so doing the work here until that feature is in place.
+    """
+    if FLAGS.multi_gpu:
+        from tensorflow.python.client import device_lib
+
+        local_device_protos = device_lib.list_local_devices()
+        num_gpus = sum([1 for d in local_device_protos if d.device_type == 'GPU'])
+        if not num_gpus:
+            raise ValueError('Multi-GPU mode was specified, but no GPUs '
+                            'were found. To use CPU, run --multi_gpu=False.')
+
+        remainder = batch_size % num_gpus
+        if remainder:
+            err = ('When running with multiple GPUs, batch size '
+                    'must be a multiple of the number of available GPUs. '
+                    'Found {} GPUs with a batch size of {}; try --batch_size={} instead.'
+                    ).format(num_gpus, batch_size, batch_size - remainder)
+            raise ValueError(err)
+        return num_gpus
+    return 0
+
+def get_init_fn():
+    return scaffolds.get_init_fn_for_scaffold(FLAGS.model_dir, FLAGS.checkpoint_path,
+                                            FLAGS.model_scope, FLAGS.checkpoint_model_scope,
+                                            FLAGS.checkpoint_exclude_scopes, FLAGS.ignore_missing_vars,
+                                            name_remap={'/kernel': '/weights', '/bias': '/biases'})
+
+# couldn't find better way to pass params from input_fn to model_fn
+# some tensors used by model_fn must be created in input_fn to ensure they are in the same graph
+# but when we put these tensors to labels's dict, the replicate_model_fn will split them into each GPU
+# the problem is that they shouldn't be splited
+global_anchor_info = dict()
+
+def input_pipeline(dataset_pattern='train-*', is_training=True, batch_size=FLAGS.batch_size):
+    def input_fn():
+        out_shape = [FLAGS.train_image_size] * 2
+        anchor_creator = anchor_manipulator.AnchorCreator(out_shape,
+                                                    layers_shapes = [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
+                                                    anchor_scales = [(0.1,), (0.2,), (0.375,), (0.55,), (0.725,), (0.9,)],
+                                                    extra_anchor_scales = [(0.1414,), (0.2739,), (0.4541,), (0.6315,), (0.8078,), (0.9836,)],
+                                                    anchor_ratios = [(1., 2., .5), (1., 2., 3., .5, 0.3333), (1., 2., 3., .5, 0.3333), (1., 2., 3., .5, 0.3333), (1., 2., .5), (1., 2., .5)],
+                                                    layer_steps = [8, 16, 32, 64, 100, 300])
+        all_anchors, all_num_anchors_depth, all_num_anchors_spatial = anchor_creator.get_all_anchors()
+
+        num_anchors_per_layer = []
+        for ind in range(len(all_anchors)):
+            num_anchors_per_layer.append(all_num_anchors_depth[ind] * all_num_anchors_spatial[ind])
+
+        anchor_encoder_decoder = anchor_manipulator.AnchorEncoder(allowed_borders = [1.0] * 6,
+                                                            positive_threshold = FLAGS.match_threshold,
+                                                            ignore_threshold = FLAGS.neg_threshold,
+                                                            prior_scaling=[0.1, 0.1, 0.2, 0.2])
+
+        image_preprocessing_fn = lambda image_, labels_, bboxes_ : ssd_preprocessing.preprocess_image(image_, labels_, bboxes_, out_shape, is_training=is_training, data_format=FLAGS.data_format, output_rgb=False)
+        anchor_encoder_fn = lambda glabels_, gbboxes_: anchor_encoder_decoder.encode_all_anchors(glabels_, gbboxes_, all_anchors, all_num_anchors_depth, all_num_anchors_spatial)
+
+        image, _, shape, loc_targets, cls_targets, match_scores = dataset_common.slim_get_batch(FLAGS.num_classes,
+                                                                                batch_size,
+                                                                                ('train' if is_training else 'val'),
+                                                                                os.path.join(FLAGS.data_dir, dataset_pattern),
+                                                                                FLAGS.num_readers,
+                                                                                FLAGS.num_preprocessing_threads,
+                                                                                image_preprocessing_fn,
+                                                                                anchor_encoder_fn,
+                                                                                num_epochs=FLAGS.train_epochs,
+                                                                                is_training=is_training)
+        global global_anchor_info
+        global_anchor_info = {'decode_fn': lambda pred : anchor_encoder_decoder.decode_all_anchors(pred, num_anchors_per_layer),
+                            'num_anchors_per_layer': num_anchors_per_layer,
+                            'all_num_anchors_depth': all_num_anchors_depth }
+
+        return image, {'shape': shape, 'loc_targets': loc_targets, 'cls_targets': cls_targets, 'match_scores': match_scores}
+    return input_fn
+
+def modified_smooth_l1(bbox_pred, bbox_targets, bbox_inside_weights=1., bbox_outside_weights=1., sigma=1.):
+    """
+        ResultLoss = outside_weights * SmoothL1(inside_weights * (bbox_pred - bbox_targets))
+        SmoothL1(x) = 0.5 * (sigma * x)^2,    if |x| < 1 / sigma^2
+                      |x| - 0.5 / sigma^2,    otherwise
+    """
+    with tf.name_scope('smooth_l1', [bbox_pred, bbox_targets]):
+        sigma2 = sigma * sigma
+
+        inside_mul = tf.multiply(bbox_inside_weights, tf.subtract(bbox_pred, bbox_targets))
+
+        smooth_l1_sign = tf.cast(tf.less(tf.abs(inside_mul), 1.0 / sigma2), tf.float32)
+        smooth_l1_option1 = tf.multiply(tf.multiply(inside_mul, inside_mul), 0.5 * sigma2)
+        smooth_l1_option2 = tf.subtract(tf.abs(inside_mul), 0.5 / sigma2)
+        smooth_l1_result = tf.add(tf.multiply(smooth_l1_option1, smooth_l1_sign),
+                                  tf.multiply(smooth_l1_option2, tf.abs(tf.subtract(smooth_l1_sign, 1.0))))
+
+        outside_mul = tf.multiply(bbox_outside_weights, smooth_l1_result)
+
+        return outside_mul
+
+
+# from scipy.misc import imread, imsave, imshow, imresize
+# import numpy as np
+# from utility import draw_toolbox
+
+# def save_image_with_bbox(image, labels_, scores_, bboxes_):
+#     if not hasattr(save_image_with_bbox, "counter"):
+#         save_image_with_bbox.counter = 0  # it doesn't exist yet, so initialize it
+#     save_image_with_bbox.counter += 1
+
+#     img_to_draw = np.copy(image)
+
+#     img_to_draw = draw_toolbox.bboxes_draw_on_img(img_to_draw, labels_, scores_, bboxes_, thickness=2)
+#     imsave(os.path.join('./debug/{}.jpg').format(save_image_with_bbox.counter), img_to_draw)
+#     return save_image_with_bbox.counter
+
+def ssd_model_fn(features, labels, mode, params):
+    """model_fn for SSD to be used with our Estimator."""
+    shape = labels['shape']
+    loc_targets = labels['loc_targets']
+    cls_targets = labels['cls_targets']
+    match_scores = labels['match_scores']
+
+    global global_anchor_info
+    decode_fn = global_anchor_info['decode_fn']
+    num_anchors_per_layer = global_anchor_info['num_anchors_per_layer']
+    all_num_anchors_depth = global_anchor_info['all_num_anchors_depth']
+
+    # bboxes_pred = decode_fn(loc_targets[0])
+    # bboxes_pred = [tf.reshape(preds, [-1, 4]) for preds in bboxes_pred]
+    # bboxes_pred = tf.concat(bboxes_pred, axis=0)
+    # save_image_op = tf.py_func(save_image_with_bbox,
+    #                         [ssd_preprocessing.unwhiten_image(features[0]),
+    #                         tf.clip_by_value(cls_targets[0], 0, tf.int64.max),
+    #                         match_scores[0],
+    #                         bboxes_pred],
+    #                         tf.int64, stateful=True)
+    # with tf.control_dependencies([save_image_op]):
+
+    #print(all_num_anchors_depth)
+    with tf.variable_scope(params['model_scope'], default_name=None, values=[features], reuse=tf.AUTO_REUSE):
+        backbone = ssd_net.VGG16Backbone(params['data_format'])
+        feature_layers = backbone.forward(features, training=(mode == tf.estimator.ModeKeys.TRAIN))
+        #print(feature_layers)
+        location_pred, cls_pred = ssd_net.multibox_head(feature_layers, params['num_classes'], all_num_anchors_depth, data_format=params['data_format'])
+
+        if params['data_format'] == 'channels_first':
+            cls_pred = [tf.transpose(pred, [0, 2, 3, 1]) for pred in cls_pred]
+            location_pred = [tf.transpose(pred, [0, 2, 3, 1]) for pred in location_pred]
+
+        cls_pred = [tf.reshape(pred, [tf.shape(features)[0], -1, params['num_classes']]) for pred in cls_pred]
+        location_pred = [tf.reshape(pred, [tf.shape(features)[0], -1, 4]) for pred in location_pred]
+
+        cls_pred = tf.concat(cls_pred, axis=1)
+        location_pred = tf.concat(location_pred, axis=1)
+
+        cls_pred = tf.reshape(cls_pred, [-1, params['num_classes']])
+        location_pred = tf.reshape(location_pred, [-1, 4])
+
+    with tf.device('/cpu:0'):
+        with tf.control_dependencies([cls_pred, location_pred]):
+            with tf.name_scope('post_forward'):
+                #bboxes_pred = decode_fn(location_pred)
+                bboxes_pred = tf.map_fn(lambda _preds : decode_fn(_preds),
+                                        tf.reshape(location_pred, [tf.shape(features)[0], -1, 4]),
+                                        dtype=[tf.float32] * len(num_anchors_per_layer), back_prop=False)
+                #cls_targets = tf.Print(cls_targets, [tf.shape(bboxes_pred[0]),tf.shape(bboxes_pred[1]),tf.shape(bboxes_pred[2]),tf.shape(bboxes_pred[3])])
+                bboxes_pred = [tf.reshape(preds, [-1, 4]) for preds in bboxes_pred]
+                bboxes_pred = tf.concat(bboxes_pred, axis=0)
+
+                flaten_cls_targets = tf.reshape(cls_targets, [-1])
+                flaten_match_scores = tf.reshape(match_scores, [-1])
+                flaten_loc_targets = tf.reshape(loc_targets, [-1, 4])
+
+                # each positive examples has one label
+                positive_mask = flaten_cls_targets > 0
+                n_positives = tf.count_nonzero(positive_mask)
+
+                batch_n_positives = tf.count_nonzero(cls_targets, -1)
+
+                batch_negtive_mask = tf.equal(cls_targets, 0)#tf.logical_and(tf.equal(cls_targets, 0), match_scores > 0.)
+                batch_n_negtives = tf.count_nonzero(batch_negtive_mask, -1)
+
+                batch_n_neg_select = tf.cast(params['negative_ratio'] * tf.cast(batch_n_positives, tf.float32), tf.int32)
+                batch_n_neg_select = tf.minimum(batch_n_neg_select, tf.cast(batch_n_negtives, tf.int32))
+
+                # hard negative mining for classification
+                predictions_for_bg = tf.nn.softmax(tf.reshape(cls_pred, [tf.shape(features)[0], -1, params['num_classes']]))[:, :, 0]
+                prob_for_negtives = tf.where(batch_negtive_mask,
+                                       0. - predictions_for_bg,
+                                       # ignore all the positives
+                                       0. - tf.ones_like(predictions_for_bg))
+                topk_prob_for_bg, _ = tf.nn.top_k(prob_for_negtives, k=tf.shape(prob_for_negtives)[1])
+                score_at_k = tf.gather_nd(topk_prob_for_bg, tf.stack([tf.range(tf.shape(features)[0]), batch_n_neg_select - 1], axis=-1))
+
+                selected_neg_mask = prob_for_negtives >= tf.expand_dims(score_at_k, axis=-1)
+
+                # include both selected negtive and all positive examples
+                final_mask = tf.stop_gradient(tf.logical_or(tf.reshape(tf.logical_and(batch_negtive_mask, selected_neg_mask), [-1]), positive_mask))
+                total_examples = tf.count_nonzero(final_mask)
+
+                cls_pred = tf.boolean_mask(cls_pred, final_mask)
+                location_pred = tf.boolean_mask(location_pred, tf.stop_gradient(positive_mask))
+                flaten_cls_targets = tf.boolean_mask(tf.clip_by_value(flaten_cls_targets, 0, params['num_classes']), final_mask)
+                flaten_loc_targets = tf.stop_gradient(tf.boolean_mask(flaten_loc_targets, positive_mask))
+
+                predictions = {
+                            'classes': tf.argmax(cls_pred, axis=-1),
+                            'probabilities': tf.reduce_max(tf.nn.softmax(cls_pred, name='softmax_tensor'), axis=-1),
+                            'loc_predict': bboxes_pred }
+
+                cls_accuracy = tf.metrics.accuracy(flaten_cls_targets, predictions['classes'])
+                metrics = {'cls_accuracy': cls_accuracy}
+
+                # Create a tensor named train_accuracy for logging purposes.
+                tf.identity(cls_accuracy[1], name='cls_accuracy')
+                tf.summary.scalar('cls_accuracy', cls_accuracy[1])
+
+    if mode == tf.estimator.ModeKeys.PREDICT:
+        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
+
+    # Calculate loss, which includes softmax cross entropy and L2 regularization.
+    #cross_entropy = tf.cond(n_positives > 0, lambda: tf.losses.sparse_softmax_cross_entropy(labels=flaten_cls_targets, logits=cls_pred), lambda: 0.)# * (params['negative_ratio'] + 1.)
+    #flaten_cls_targets=tf.Print(flaten_cls_targets, [flaten_loc_targets],summarize=50000)
+    cross_entropy = tf.losses.sparse_softmax_cross_entropy(labels=flaten_cls_targets, logits=cls_pred) * (params['negative_ratio'] + 1.)
+    # Create a tensor named cross_entropy for logging purposes.
+    tf.identity(cross_entropy, name='cross_entropy_loss')
+    tf.summary.scalar('cross_entropy_loss', cross_entropy)
+
+    #loc_loss = tf.cond(n_positives > 0, lambda: modified_smooth_l1(location_pred, tf.stop_gradient(flaten_loc_targets), sigma=1.), lambda: tf.zeros_like(location_pred))
+    loc_loss = modified_smooth_l1(location_pred, flaten_loc_targets, sigma=1.)
+    #loc_loss = modified_smooth_l1(location_pred, tf.stop_gradient(gtargets))
+    loc_loss = tf.reduce_mean(tf.reduce_sum(loc_loss, axis=-1), name='location_loss')
+    tf.summary.scalar('location_loss', loc_loss)
+    tf.losses.add_loss(loc_loss)
+
+    l2_loss_vars = []
+    for trainable_var in tf.trainable_variables():
+        if '_bn' not in trainable_var.name:
+            if 'conv4_3_scale' not in trainable_var.name:
+                l2_loss_vars.append(tf.nn.l2_loss(trainable_var))
+            else:
+                l2_loss_vars.append(tf.nn.l2_loss(trainable_var) * 0.1)
+    # Add weight decay to the loss. We exclude the batch norm variables because
+    # doing so leads to a small improvement in accuracy.
+    total_loss = tf.add(cross_entropy + loc_loss, tf.multiply(params['weight_decay'], tf.add_n(l2_loss_vars), name='l2_loss'), name='total_loss')
+
+    if mode == tf.estimator.ModeKeys.TRAIN:
+        global_step = tf.train.get_or_create_global_step()
+
+        lr_values = [params['learning_rate'] * decay for decay in params['lr_decay_factors']]
+        learning_rate = tf.train.piecewise_constant(tf.cast(global_step, tf.int32),
+                                                    [int(_) for _ in params['decay_boundaries']],
+                                                    lr_values)
+        truncated_learning_rate = tf.maximum(learning_rate, tf.constant(params['end_learning_rate'], dtype=learning_rate.dtype), name='learning_rate')
+        # Create a tensor named learning_rate for logging purposes.
+        tf.summary.scalar('learning_rate', truncated_learning_rate)
+
+        optimizer = tf.train.MomentumOptimizer(learning_rate=truncated_learning_rate,
+                                                momentum=params['momentum'])
+        optimizer = tf.contrib.estimator.TowerOptimizer(optimizer)
+
+        # Batch norm requires update_ops to be added as a train_op dependency.
+        update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
+        with tf.control_dependencies(update_ops):
+            train_op = optimizer.minimize(total_loss, global_step)
+    else:
+        train_op = None
+
+    return tf.estimator.EstimatorSpec(
+                              mode=mode,
+                              predictions=predictions,
+                              loss=total_loss,
+                              train_op=train_op,
+                              eval_metric_ops=metrics,
+                              scaffold=tf.train.Scaffold(init_fn=get_init_fn()))
+
+def parse_comma_list(args):
+    return [float(s.strip()) for s in args.split(',')]
+
+def main(_):
+    # Using the Winograd non-fused algorithms provides a small performance boost.
+    os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = '1'
+
+    gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=FLAGS.gpu_memory_fraction)
+    config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False, intra_op_parallelism_threads=FLAGS.num_cpu_threads, inter_op_parallelism_threads=FLAGS.num_cpu_threads, gpu_options=gpu_options)
+
+    num_gpus = validate_batch_size_for_multi_gpu(FLAGS.batch_size)
+
+    # Set up a RunConfig to only save checkpoints once per training cycle.
+    run_config = tf.estimator.RunConfig().replace(
+                                        save_checkpoints_secs=FLAGS.save_checkpoints_secs).replace(
+                                        save_checkpoints_steps=None).replace(
+                                        save_summary_steps=FLAGS.save_summary_steps).replace(
+                                        keep_checkpoint_max=5).replace(
+                                        tf_random_seed=FLAGS.tf_random_seed).replace(
+                                        log_step_count_steps=FLAGS.log_every_n_steps).replace(
+                                        session_config=config)
+
+    replicate_ssd_model_fn = tf.contrib.estimator.replicate_model_fn(ssd_model_fn, loss_reduction=tf.losses.Reduction.MEAN)
+    ssd_detector = tf.estimator.Estimator(
+        model_fn=replicate_ssd_model_fn, model_dir=FLAGS.model_dir, config=run_config,
+        params={
+            'num_gpus': num_gpus,
+            'data_format': FLAGS.data_format,
+            'batch_size': FLAGS.batch_size,
+            'model_scope': FLAGS.model_scope,
+            'num_classes': FLAGS.num_classes,
+            'negative_ratio': FLAGS.negative_ratio,
+            'match_threshold': FLAGS.match_threshold,
+            'neg_threshold': FLAGS.neg_threshold,
+            'weight_decay': FLAGS.weight_decay,
+            'momentum': FLAGS.momentum,
+            'learning_rate': FLAGS.learning_rate,
+            'end_learning_rate': FLAGS.end_learning_rate,
+            'decay_boundaries': parse_comma_list(FLAGS.decay_boundaries),
+            'lr_decay_factors': parse_comma_list(FLAGS.lr_decay_factors),
+        })
+    tensors_to_log = {
+        'lr': 'learning_rate',
+        'ce': 'cross_entropy_loss',
+        'loc': 'location_loss',
+        'loss': 'total_loss',
+        'l2': 'l2_loss',
+        'acc': 'post_forward/cls_accuracy',
+    }
+    logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=FLAGS.log_every_n_steps,
+                                            formatter=lambda dicts: (', '.join(['%s=%.6f' % (k, v) for k, v in dicts.items()])))
+
+    #hook = tf.train.ProfilerHook(save_steps=50, output_dir='.', show_memory=True)
+    print('Starting a training cycle.')
+    ssd_detector.train(input_fn=input_pipeline(dataset_pattern='train-*', is_training=True, batch_size=FLAGS.batch_size),
+                    hooks=[logging_hook], max_steps=FLAGS.max_number_of_steps)
+
+if __name__ == '__main__':
+  tf.logging.set_verbosity(tf.logging.INFO)
+  tf.app.run()
+
+
+    # cls_targets = tf.reshape(cls_targets, [-1])
+    # match_scores = tf.reshape(match_scores, [-1])
+    # loc_targets = tf.reshape(loc_targets, [-1, 4])
+
+    # # each positive examples has one label
+    # positive_mask = cls_targets > 0
+    # n_positives = tf.count_nonzero(positive_mask)
+
+    # negtive_mask = tf.logical_and(tf.equal(cls_targets, 0), match_scores > 0.)
+    # n_negtives = tf.count_nonzero(negtive_mask)
+
+    # n_neg_to_select = tf.cast(params['negative_ratio'] * tf.cast(n_positives, tf.float32), tf.int32)
+    # n_neg_to_select = tf.minimum(n_neg_to_select, tf.cast(n_negtives, tf.int32))
+
+    # # hard negative mining for classification
+    # predictions_for_bg = tf.nn.softmax(cls_pred)[:, 0]
+
+    # prob_for_negtives = tf.where(negtive_mask,
+    #                        0. - predictions_for_bg,
+    #                        # ignore all the positives
+    #                        0. - tf.ones_like(predictions_for_bg))
+    # topk_prob_for_bg, _ = tf.nn.top_k(prob_for_negtives, k=n_neg_to_select)
+    # selected_neg_mask = prob_for_negtives > topk_prob_for_bg[-1]
+
+    # # include both selected negtive and all positive examples
+    # final_mask = tf.stop_gradient(tf.logical_or(tf.logical_and(negtive_mask, selected_neg_mask), positive_mask))
+    # total_examples = tf.count_nonzero(final_mask)
+
+    # glabels = tf.boolean_mask(tf.clip_by_value(cls_targets, 0, FLAGS.num_classes), final_mask)
+    # cls_pred = tf.boolean_mask(cls_pred, final_mask)
+    # location_pred = tf.boolean_mask(location_pred, tf.stop_gradient(positive_mask))
+    # loc_targets = tf.boolean_mask(loc_targets, tf.stop_gradient(positive_mask))
diff --git a/utils/external/ssd_tensorflow/utility/anchor_manipulator.py b/utils/external/ssd_tensorflow/utility/anchor_manipulator.py
new file mode 100644
index 0000000..2e51fb0
--- /dev/null
+++ b/utils/external/ssd_tensorflow/utility/anchor_manipulator.py
@@ -0,0 +1,331 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+import math
+
+import tensorflow as tf
+import numpy as np
+
+from tensorflow.contrib.image.python.ops import image_ops
+
+def areas(gt_bboxes):
+    with tf.name_scope('bboxes_areas', [gt_bboxes]):
+        ymin, xmin, ymax, xmax = tf.split(gt_bboxes, 4, axis=1)
+        return (xmax - xmin) * (ymax - ymin)
+
+def intersection(gt_bboxes, default_bboxes):
+    with tf.name_scope('bboxes_intersection', [gt_bboxes, default_bboxes]):
+        # num_anchors x 1
+        ymin, xmin, ymax, xmax = tf.split(gt_bboxes, 4, axis=1)
+        # 1 x num_anchors
+        gt_ymin, gt_xmin, gt_ymax, gt_xmax = [tf.transpose(b, perm=[1, 0]) for b in tf.split(default_bboxes, 4, axis=1)]
+        # broadcast here to generate the full matrix
+        int_ymin = tf.maximum(ymin, gt_ymin)
+        int_xmin = tf.maximum(xmin, gt_xmin)
+        int_ymax = tf.minimum(ymax, gt_ymax)
+        int_xmax = tf.minimum(xmax, gt_xmax)
+        h = tf.maximum(int_ymax - int_ymin, 0.)
+        w = tf.maximum(int_xmax - int_xmin, 0.)
+
+        return h * w
+def iou_matrix(gt_bboxes, default_bboxes):
+    with tf.name_scope('iou_matrix', [gt_bboxes, default_bboxes]):
+        inter_vol = intersection(gt_bboxes, default_bboxes)
+        # broadcast
+        union_vol = areas(gt_bboxes) + tf.transpose(areas(default_bboxes), perm=[1, 0]) - inter_vol
+
+        return tf.where(tf.equal(union_vol, 0.0),
+                        tf.zeros_like(inter_vol), tf.truediv(inter_vol, union_vol))
+
+def do_dual_max_match(overlap_matrix, low_thres, high_thres, ignore_between=True, gt_max_first=True):
+    '''
+    overlap_matrix: num_gt * num_anchors
+    '''
+    with tf.name_scope('dual_max_match', [overlap_matrix]):
+        # first match from anchors' side
+        anchors_to_gt = tf.argmax(overlap_matrix, axis=0)
+        # the matching degree
+        match_values = tf.reduce_max(overlap_matrix, axis=0)
+
+        #positive_mask = tf.greater(match_values, high_thres)
+        less_mask = tf.less(match_values, low_thres)
+        between_mask = tf.logical_and(tf.less(match_values, high_thres), tf.greater_equal(match_values, low_thres))
+        negative_mask = less_mask if ignore_between else between_mask
+        ignore_mask = between_mask if ignore_between else less_mask
+        # fill all negative positions with -1, all ignore positions is -2
+        match_indices = tf.where(negative_mask, -1 * tf.ones_like(anchors_to_gt), anchors_to_gt)
+        match_indices = tf.where(ignore_mask, -2 * tf.ones_like(match_indices), match_indices)
+
+        # negtive values has no effect in tf.one_hot, that means all zeros along that axis
+        # so all positive match positions in anchors_to_gt_mask is 1, all others are 0
+        anchors_to_gt_mask = tf.one_hot(tf.clip_by_value(match_indices, -1, tf.cast(tf.shape(overlap_matrix)[0], tf.int64)),
+                                        tf.shape(overlap_matrix)[0], on_value=1, off_value=0, axis=0, dtype=tf.int32)
+        # match from ground truth's side
+        gt_to_anchors = tf.argmax(overlap_matrix, axis=1)
+
+        if gt_max_first:
+            # the max match from ground truth's side has higher priority
+            left_gt_to_anchors_mask = tf.one_hot(gt_to_anchors, tf.shape(overlap_matrix)[1], on_value=1, off_value=0, axis=1, dtype=tf.int32)
+        else:
+            # the max match from anchors' side has higher priority
+            # use match result from ground truth's side only when the the matching degree from anchors' side is lower than position threshold
+            left_gt_to_anchors_mask = tf.cast(tf.logical_and(tf.reduce_max(anchors_to_gt_mask, axis=1, keep_dims=True) < 1,
+                                                            tf.one_hot(gt_to_anchors, tf.shape(overlap_matrix)[1],
+                                                                        on_value=True, off_value=False, axis=1, dtype=tf.bool)
+                                                            ), tf.int64)
+        # can not use left_gt_to_anchors_mask here, because there are many ground truthes match to one anchor, we should pick the highest one even when we are merging matching from ground truth side
+        left_gt_to_anchors_scores = overlap_matrix * tf.to_float(left_gt_to_anchors_mask)
+        # merge matching results from ground truth's side with the original matching results from anchors' side
+        # then select all the overlap score of those matching pairs
+        selected_scores = tf.gather_nd(overlap_matrix,  tf.stack([tf.where(tf.reduce_max(left_gt_to_anchors_mask, axis=0) > 0,
+                                                                            tf.argmax(left_gt_to_anchors_scores, axis=0),
+                                                                            anchors_to_gt),
+                                                                    tf.range(tf.cast(tf.shape(overlap_matrix)[1], tf.int64))], axis=1))
+        # return the matching results for both foreground anchors and background anchors, also with overlap scores
+        return tf.where(tf.reduce_max(left_gt_to_anchors_mask, axis=0) > 0,
+                        tf.argmax(left_gt_to_anchors_scores, axis=0),
+                        match_indices), selected_scores
+
+# def save_anchors(bboxes, labels, anchors_point):
+#     if not hasattr(save_image_with_bbox, "counter"):
+#         save_image_with_bbox.counter = 0  # it doesn't exist yet, so initialize it
+#     save_image_with_bbox.counter += 1
+
+#     np.save('./debug/bboxes_{}.npy'.format(save_image_with_bbox.counter), np.copy(bboxes))
+#     np.save('./debug/labels_{}.npy'.format(save_image_with_bbox.counter), np.copy(labels))
+#     np.save('./debug/anchors_{}.npy'.format(save_image_with_bbox.counter), np.copy(anchors_point))
+#     return save_image_with_bbox.counter
+
+class AnchorEncoder(object):
+    def __init__(self, allowed_borders, positive_threshold, ignore_threshold, prior_scaling, clip=False):
+        super(AnchorEncoder, self).__init__()
+        self._all_anchors = None
+        self._allowed_borders = allowed_borders
+        self._positive_threshold = positive_threshold
+        self._ignore_threshold = ignore_threshold
+        self._prior_scaling = prior_scaling
+        self._clip = clip
+
+    def center2point(self, center_y, center_x, height, width):
+        return center_y - height / 2., center_x - width / 2., center_y + height / 2., center_x + width / 2.,
+
+    def point2center(self, ymin, xmin, ymax, xmax):
+        height, width = (ymax - ymin), (xmax - xmin)
+        return ymin + height / 2., xmin + width / 2., height, width
+
+    def init_all_anchors(self, all_anchors, all_num_anchors_depth, all_num_anchors_spatial):
+        assert len(all_num_anchors_depth) == len(all_num_anchors_spatial) \
+          and len(all_num_anchors_depth) == len(all_anchors), 'inconsist num layers for anchors.'
+
+        with tf.name_scope('init_all_anchors'):
+            list_anchors_ymin = []
+            list_anchors_xmin = []
+            list_anchors_ymax = []
+            list_anchors_xmax = []
+            for ind, anchor in enumerate(all_anchors):
+                anchors_ymin_, anchors_xmin_, anchors_ymax_, anchors_xmax_ = \
+                  self.center2point(anchor[0], anchor[1], anchor[2], anchor[3])
+                list_anchors_ymin.append(tf.reshape(anchors_ymin_, [-1]))
+                list_anchors_xmin.append(tf.reshape(anchors_xmin_, [-1]))
+                list_anchors_ymax.append(tf.reshape(anchors_ymax_, [-1]))
+                list_anchors_xmax.append(tf.reshape(anchors_xmax_, [-1]))
+
+            anchors_ymin = tf.concat(list_anchors_ymin, 0, name='concat_ymin')
+            anchors_xmin = tf.concat(list_anchors_xmin, 0, name='concat_xmin')
+            anchors_ymax = tf.concat(list_anchors_ymax, 0, name='concat_ymax')
+            anchors_xmax = tf.concat(list_anchors_xmax, 0, name='concat_xmax')
+            if self._clip:
+                anchors_ymin = tf.clip_by_value(anchors_ymin, 0., 1.)
+                anchors_xmin = tf.clip_by_value(anchors_xmin, 0., 1.)
+                anchors_ymax = tf.clip_by_value(anchors_ymax, 0., 1.)
+                anchors_xmax = tf.clip_by_value(anchors_xmax, 0., 1.)
+
+            anchor_cy, anchor_cx, anchor_h, anchor_w = \
+              self.point2center(anchors_ymin, anchors_xmin, anchors_ymax, anchors_xmax)
+            self._all_anchors = (anchor_cy, anchor_cx, anchor_h, anchor_w)
+
+    def encode_all_anchors(self, labels, bboxes, all_anchors, all_num_anchors_depth, all_num_anchors_spatial, debug=False):
+        assert self._all_anchors is not None, 'no anchors to encode.'
+
+        with tf.name_scope('encode_all_anchors'):
+            anchor_cy = self._all_anchors[0]
+            anchor_cx = self._all_anchors[1]
+            anchor_h = self._all_anchors[2]
+            anchor_w = self._all_anchors[3]
+            anchors_ymin, anchors_xmin, anchors_ymax, anchors_xmax = \
+                self.center2point(anchor_cy, anchor_cx, anchor_h, anchor_w)
+            anchors_point = tf.stack(
+                [anchors_ymin, anchors_xmin, anchors_ymax, anchors_xmax], axis=-1)
+
+            tiled_allowed_borders = []
+            for ind, anchor in enumerate(all_anchors):
+                tiled_allowed_borders.extend([self._allowed_borders[ind]]
+                    * all_num_anchors_depth[ind] * all_num_anchors_spatial[ind])
+            anchor_allowed_borders = tf.stack(
+                tiled_allowed_borders, 0, name='concat_allowed_borders')
+
+            inside_mask = tf.logical_and(
+                tf.logical_and(anchors_ymin > -anchor_allowed_borders * 1.,
+                               anchors_xmin > -anchor_allowed_borders * 1.),
+                tf.logical_and(anchors_ymax < (1. + anchor_allowed_borders * 1.),
+                               anchors_xmax < (1. + anchor_allowed_borders * 1.)))
+
+            overlap_matrix = iou_matrix(bboxes, anchors_point) \
+                * tf.cast(tf.expand_dims(inside_mask, 0), tf.float32)
+            matched_gt, gt_scores = do_dual_max_match(
+                overlap_matrix, self._ignore_threshold, self._positive_threshold)
+            matched_gt_mask = matched_gt > -1
+            matched_indices = tf.clip_by_value(matched_gt, 0, tf.int64.max)
+
+            gt_labels = tf.gather(labels, matched_indices)
+            gt_labels = gt_labels * tf.cast(matched_gt_mask, tf.int64)
+            gt_labels = gt_labels + (-1 * tf.cast(matched_gt < -1, tf.int64))
+            gt_ymin, gt_xmin, gt_ymax, gt_xmax = \
+                tf.unstack(tf.gather(bboxes, matched_indices), 4, axis=-1)
+            gt_cy, gt_cx, gt_h, gt_w = self.point2center(gt_ymin, gt_xmin, gt_ymax, gt_xmax)
+            gt_cy = (gt_cy - anchor_cy) / anchor_h / self._prior_scaling[0]
+            gt_cx = (gt_cx - anchor_cx) / anchor_w / self._prior_scaling[1]
+            gt_h = tf.log(gt_h / anchor_h) / self._prior_scaling[2]
+            gt_w = tf.log(gt_w / anchor_w) / self._prior_scaling[3]
+            if debug:
+                gt_targets = tf.stack(
+                    [anchors_ymin, anchors_xmin, anchors_ymax, anchors_xmax], axis=-1)
+            else:
+                gt_targets = tf.stack([gt_cy, gt_cx, gt_h, gt_w], axis=-1)
+            gt_targets = tf.expand_dims(tf.cast(matched_gt_mask, tf.float32), -1) * gt_targets
+
+        return gt_targets, gt_labels, gt_scores
+
+    def decode_all_anchors(self, pred_location, num_anchors_per_layer):
+        assert self._all_anchors is not None, 'no anchors to decode.'
+
+        with tf.name_scope('decode_all_anchors', [pred_location]):
+            anchor_cy, anchor_cx, anchor_h, anchor_w = self._all_anchors
+            pred_h = tf.exp(pred_location[:, -2] * self._prior_scaling[2]) * anchor_h
+            pred_w = tf.exp(pred_location[:, -1] * self._prior_scaling[3]) * anchor_w
+            pred_cy = pred_location[:, 0] * self._prior_scaling[0] * anchor_h + anchor_cy
+            pred_cx = pred_location[:, 1] * self._prior_scaling[1] * anchor_w + anchor_cx
+
+        return tf.split(tf.stack(self.center2point(
+            pred_cy, pred_cx, pred_h, pred_w), axis=-1), num_anchors_per_layer, axis=0)
+
+    def ext_decode_all_anchors(self, pred_location, all_anchors, all_num_anchors_depth, all_num_anchors_spatial):
+        assert (len(all_num_anchors_depth)==len(all_num_anchors_spatial)) and (len(all_num_anchors_depth)==len(all_anchors)), 'inconsist num layers for anchors.'
+        with tf.name_scope('ext_decode_all_anchors', [pred_location]):
+            num_anchors_per_layer = []
+            for ind in range(len(all_anchors)):
+                num_anchors_per_layer.append(all_num_anchors_depth[ind] * all_num_anchors_spatial[ind])
+
+            num_layers = len(all_num_anchors_depth)
+            list_anchors_ymin = []
+            list_anchors_xmin = []
+            list_anchors_ymax = []
+            list_anchors_xmax = []
+            tiled_allowed_borders = []
+            for ind, anchor in enumerate(all_anchors):
+                anchors_ymin_, anchors_xmin_, anchors_ymax_, anchors_xmax_ = self.center2point(anchor[0], anchor[1], anchor[2], anchor[3])
+
+                list_anchors_ymin.append(tf.reshape(anchors_ymin_, [-1]))
+                list_anchors_xmin.append(tf.reshape(anchors_xmin_, [-1]))
+                list_anchors_ymax.append(tf.reshape(anchors_ymax_, [-1]))
+                list_anchors_xmax.append(tf.reshape(anchors_xmax_, [-1]))
+
+            anchors_ymin = tf.concat(list_anchors_ymin, 0, name='concat_ymin')
+            anchors_xmin = tf.concat(list_anchors_xmin, 0, name='concat_xmin')
+            anchors_ymax = tf.concat(list_anchors_ymax, 0, name='concat_ymax')
+            anchors_xmax = tf.concat(list_anchors_xmax, 0, name='concat_xmax')
+
+            anchor_cy, anchor_cx, anchor_h, anchor_w = self.point2center(anchors_ymin, anchors_xmin, anchors_ymax, anchors_xmax)
+
+            pred_h = tf.exp(pred_location[:,-2] * self._prior_scaling[2]) * anchor_h
+            pred_w = tf.exp(pred_location[:, -1] * self._prior_scaling[3]) * anchor_w
+            pred_cy = pred_location[:, 0] * self._prior_scaling[0] * anchor_h + anchor_cy
+            pred_cx = pred_location[:, 1] * self._prior_scaling[1] * anchor_w + anchor_cx
+
+            return tf.split(tf.stack(self.center2point(pred_cy, pred_cx, pred_h, pred_w), axis=-1), num_anchors_per_layer, axis=0)
+
+class AnchorCreator(object):
+    def __init__(self, img_shape, layers_shapes, anchor_scales, extra_anchor_scales, anchor_ratios, layer_steps):
+        super(AnchorCreator, self).__init__()
+        # img_shape -> (height, width)
+        self._img_shape = img_shape
+        self._layers_shapes = layers_shapes
+        self._anchor_scales = anchor_scales
+        self._extra_anchor_scales = extra_anchor_scales
+        self._anchor_ratios = anchor_ratios
+        self._layer_steps = layer_steps
+        self._anchor_offset = [0.5] * len(self._layers_shapes)
+
+    def get_layer_anchors(self, layer_shape, anchor_scale, extra_anchor_scale, anchor_ratio, layer_step, offset = 0.5):
+        ''' assume layer_shape[0] = 6, layer_shape[1] = 5
+        x_on_layer = [[0, 1, 2, 3, 4],
+                       [0, 1, 2, 3, 4],
+                       [0, 1, 2, 3, 4],
+                       [0, 1, 2, 3, 4],
+                       [0, 1, 2, 3, 4],
+                       [0, 1, 2, 3, 4]]
+        y_on_layer = [[0, 0, 0, 0, 0],
+                       [1, 1, 1, 1, 1],
+                       [2, 2, 2, 2, 2],
+                       [3, 3, 3, 3, 3],
+                       [4, 4, 4, 4, 4],
+                       [5, 5, 5, 5, 5]]
+        '''
+        with tf.name_scope('get_layer_anchors'):
+            x_on_layer, y_on_layer = tf.meshgrid(tf.range(layer_shape[1]), tf.range(layer_shape[0]))
+
+            y_on_image = (tf.cast(y_on_layer, tf.float32) + offset) * layer_step / self._img_shape[0]
+            x_on_image = (tf.cast(x_on_layer, tf.float32) + offset) * layer_step / self._img_shape[1]
+
+            num_anchors_along_depth = len(anchor_scale) * len(anchor_ratio) + len(extra_anchor_scale)
+            num_anchors_along_spatial = layer_shape[1] * layer_shape[0]
+
+            list_h_on_image = []
+            list_w_on_image = []
+
+            global_index = 0
+            # for square anchors
+            for _, scale in enumerate(extra_anchor_scale):
+                list_h_on_image.append(scale)
+                list_w_on_image.append(scale)
+                global_index += 1
+            # for other aspect ratio anchors
+            for scale_index, scale in enumerate(anchor_scale):
+                for ratio_index, ratio in enumerate(anchor_ratio):
+                    list_h_on_image.append(scale / math.sqrt(ratio))
+                    list_w_on_image.append(scale * math.sqrt(ratio))
+                    global_index += 1
+            # shape info:
+            # y_on_image, x_on_image: layers_shapes[0] * layers_shapes[1]
+            # h_on_image, w_on_image: num_anchors_along_depth
+            return tf.expand_dims(y_on_image, axis=-1), tf.expand_dims(x_on_image, axis=-1), \
+                    tf.constant(list_h_on_image, dtype=tf.float32), \
+                    tf.constant(list_w_on_image, dtype=tf.float32), num_anchors_along_depth, num_anchors_along_spatial
+
+    def get_all_anchors(self):
+        all_anchors = []
+        all_num_anchors_depth = []
+        all_num_anchors_spatial = []
+        for layer_index, layer_shape in enumerate(self._layers_shapes):
+            anchors_this_layer = self.get_layer_anchors(layer_shape,
+                                                        self._anchor_scales[layer_index],
+                                                        self._extra_anchor_scales[layer_index],
+                                                        self._anchor_ratios[layer_index],
+                                                        self._layer_steps[layer_index],
+                                                        self._anchor_offset[layer_index])
+            all_anchors.append(anchors_this_layer[:-2])
+            all_num_anchors_depth.append(anchors_this_layer[-2])
+            all_num_anchors_spatial.append(anchors_this_layer[-1])
+        return all_anchors, all_num_anchors_depth, all_num_anchors_spatial
+
diff --git a/utils/external/ssd_tensorflow/utility/anchor_manipulator_unittest.py b/utils/external/ssd_tensorflow/utility/anchor_manipulator_unittest.py
new file mode 100644
index 0000000..bbacc64
--- /dev/null
+++ b/utils/external/ssd_tensorflow/utility/anchor_manipulator_unittest.py
@@ -0,0 +1,156 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+import tensorflow as tf
+from scipy.misc import imread, imsave, imshow, imresize
+import numpy as np
+import sys; sys.path.insert(0, ".")
+from utility import draw_toolbox
+from utility import anchor_manipulator
+from preprocessing import ssd_preprocessing
+
+slim = tf.contrib.slim
+
+def save_image_with_bbox(image, labels_, scores_, bboxes_):
+    if not hasattr(save_image_with_bbox, "counter"):
+        save_image_with_bbox.counter = 0  # it doesn't exist yet, so initialize it
+    save_image_with_bbox.counter += 1
+
+    img_to_draw = np.copy(image)
+
+    img_to_draw = draw_toolbox.bboxes_draw_on_img(img_to_draw, labels_, scores_, bboxes_, thickness=2)
+    imsave(os.path.join('./debug/{}.jpg').format(save_image_with_bbox.counter), img_to_draw)
+    return save_image_with_bbox.counter
+
+def slim_get_split(file_pattern='{}_????'):
+    # Features in Pascal VOC TFRecords.
+    keys_to_features = {
+        'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
+        'image/format': tf.FixedLenFeature((), tf.string, default_value='jpeg'),
+        'image/height': tf.FixedLenFeature([1], tf.int64),
+        'image/width': tf.FixedLenFeature([1], tf.int64),
+        'image/channels': tf.FixedLenFeature([1], tf.int64),
+        'image/shape': tf.FixedLenFeature([3], tf.int64),
+        'image/object/bbox/xmin': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/ymin': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/xmax': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/ymax': tf.VarLenFeature(dtype=tf.float32),
+        'image/object/bbox/label': tf.VarLenFeature(dtype=tf.int64),
+        'image/object/bbox/difficult': tf.VarLenFeature(dtype=tf.int64),
+        'image/object/bbox/truncated': tf.VarLenFeature(dtype=tf.int64),
+    }
+    items_to_handlers = {
+        'image': slim.tfexample_decoder.Image('image/encoded', 'image/format'),
+        'shape': slim.tfexample_decoder.Tensor('image/shape'),
+        'object/bbox': slim.tfexample_decoder.BoundingBox(
+                ['ymin', 'xmin', 'ymax', 'xmax'], 'image/object/bbox/'),
+        'object/label': slim.tfexample_decoder.Tensor('image/object/bbox/label'),
+        'object/difficult': slim.tfexample_decoder.Tensor('image/object/bbox/difficult'),
+        'object/truncated': slim.tfexample_decoder.Tensor('image/object/bbox/truncated'),
+    }
+    decoder = slim.tfexample_decoder.TFExampleDecoder(keys_to_features, items_to_handlers)
+
+    dataset = slim.dataset.Dataset(
+                data_sources=file_pattern,
+                reader=tf.TFRecordReader,
+                decoder=decoder,
+                num_samples=100,
+                items_to_descriptions=None,
+                num_classes=21,
+                labels_to_names=None)
+
+    with tf.name_scope('dataset_data_provider'):
+        provider = slim.dataset_data_provider.DatasetDataProvider(
+                    dataset,
+                    num_readers=2,
+                    common_queue_capacity=32,
+                    common_queue_min=8,
+                    shuffle=True,
+                    num_epochs=1)
+
+    [org_image, shape, glabels_raw, gbboxes_raw, isdifficult] = provider.get(['image', 'shape',
+                                                                         'object/label',
+                                                                         'object/bbox',
+                                                                         'object/difficult'])
+    image, glabels, gbboxes = ssd_preprocessing.preprocess_image(org_image, glabels_raw, gbboxes_raw, [300, 300], is_training=True, data_format='channels_last', output_rgb=True)
+
+    anchor_creator = anchor_manipulator.AnchorCreator([300] * 2,
+                                                    layers_shapes = [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
+                                                    anchor_scales = [(0.1,), (0.2,), (0.375,), (0.55,), (0.725,), (0.9,)],
+                                                    extra_anchor_scales = [(0.1414,), (0.2739,), (0.4541,), (0.6315,), (0.8078,), (0.9836,)],
+                                                    anchor_ratios = [(2., .5), (2., 3., .5, 0.3333), (2., 3., .5, 0.3333), (2., 3., .5, 0.3333), (2., .5), (2., .5)],
+                                                    layer_steps = [8, 16, 32, 64, 100, 300])
+
+    all_anchors, all_num_anchors_depth, all_num_anchors_spatial = anchor_creator.get_all_anchors()
+
+    num_anchors_per_layer = []
+    for ind in range(len(all_anchors)):
+        num_anchors_per_layer.append(all_num_anchors_depth[ind] * all_num_anchors_spatial[ind])
+
+    anchor_encoder_decoder = anchor_manipulator.AnchorEncoder(allowed_borders=[1.0] * 6,
+                                                        positive_threshold = 0.5,
+                                                        ignore_threshold = 0.5,
+                                                        prior_scaling=[0.1, 0.1, 0.2, 0.2])
+
+    gt_targets, gt_labels, gt_scores = anchor_encoder_decoder.encode_all_anchors(glabels, gbboxes, all_anchors, all_num_anchors_depth, all_num_anchors_spatial, True)
+
+    anchors = anchor_encoder_decoder._all_anchors
+    # split by layers
+    gt_targets, gt_labels, gt_scores, anchors = tf.split(gt_targets, num_anchors_per_layer, axis=0),\
+                                                tf.split(gt_labels, num_anchors_per_layer, axis=0),\
+                                                tf.split(gt_scores, num_anchors_per_layer, axis=0),\
+                                                [tf.split(anchor, num_anchors_per_layer, axis=0) for anchor in anchors]
+
+    save_image_op = tf.py_func(save_image_with_bbox,
+                            [ssd_preprocessing.unwhiten_image(image),
+                            tf.clip_by_value(tf.concat(gt_labels, axis=0), 0, tf.int64.max),
+                            tf.concat(gt_scores, axis=0),
+                            tf.concat(gt_targets, axis=0)],
+                            tf.int64, stateful=True)
+    return save_image_op
+
+if __name__ == '__main__':
+    save_image_op = slim_get_split('/media/rs/7A0EE8880EE83EAF/Detections/SSD/dataset/tfrecords/train*')
+    # Create the graph, etc.
+    init_op = tf.group([tf.local_variables_initializer(), tf.local_variables_initializer(), tf.tables_initializer()])
+
+    # Create a session for running operations in the Graph.
+    sess = tf.Session()
+    # Initialize the variables (like the epoch counter).
+    sess.run(init_op)
+
+    # Start input enqueue threads.
+    coord = tf.train.Coordinator()
+    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
+
+    try:
+        while not coord.should_stop():
+            # Run training steps or whatever
+            print(sess.run(save_image_op))
+
+    except tf.errors.OutOfRangeError:
+        print('Done training -- epoch limit reached')
+    finally:
+        # When done, ask the threads to stop.
+        coord.request_stop()
+
+    # Wait for threads to finish.
+    coord.join(threads)
+    sess.close()
diff --git a/utils/external/ssd_tensorflow/utility/checkpint_inspect.py b/utils/external/ssd_tensorflow/utility/checkpint_inspect.py
new file mode 100644
index 0000000..2979e88
--- /dev/null
+++ b/utils/external/ssd_tensorflow/utility/checkpint_inspect.py
@@ -0,0 +1,55 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from tensorflow.python import pywrap_tensorflow
+
+def print_tensors_in_checkpoint_file(file_name, tensor_name, all_tensors):
+    try:
+        reader = pywrap_tensorflow.NewCheckpointReader(file_name)
+        if all_tensors:
+            var_to_shape_map = reader.get_variable_to_shape_map()
+            for key in var_to_shape_map:
+                print("tensor_name: ", key)
+                print(reader.get_tensor(key))
+        elif not tensor_name:
+            print(reader.debug_string().decode("utf-8"))
+        else:
+            print("tensor_name: ", tensor_name)
+            print(reader.get_tensor(tensor_name))
+    except Exception as e:  # pylint: disable=broad-except
+        print(str(e))
+        if "corrupted compressed block contents" in str(e):
+            print("It's likely that your checkpoint file has been compressed "
+                  "with SNAPPY.")
+
+def print_all_tensors_name(file_name):
+    try:
+        reader = pywrap_tensorflow.NewCheckpointReader(file_name)
+        var_to_shape_map = reader.get_variable_to_shape_map()
+        for key in var_to_shape_map:
+            print(key)
+    except Exception as e:  # pylint: disable=broad-except
+        print(str(e))
+        if "corrupted compressed block contents" in str(e):
+            print("It's likely that your checkpoint file has been compressed "
+                  "with SNAPPY.")
+
+if __name__ == "__main__":
+    print_all_tensors_name('./model/vgg16_reducedfc.ckpt')
diff --git a/utils/external/ssd_tensorflow/utility/draw_toolbox.py b/utils/external/ssd_tensorflow/utility/draw_toolbox.py
new file mode 100644
index 0000000..a72ae50
--- /dev/null
+++ b/utils/external/ssd_tensorflow/utility/draw_toolbox.py
@@ -0,0 +1,73 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+import cv2
+import matplotlib.cm as mpcm
+
+from dataset import dataset_common
+
+def gain_translate_table():
+    label2name_table = {}
+    for class_name, labels_pair in dataset_common.VOC_LABELS.items():
+        label2name_table[labels_pair[0]] = class_name
+    return label2name_table
+
+label2name_table = gain_translate_table()
+
+def colors_subselect(colors, num_classes=21):
+    dt = len(colors) // num_classes
+    sub_colors = []
+    for i in range(num_classes):
+        color = colors[i*dt]
+        if isinstance(color[0], float):
+            sub_colors.append([int(c * 255) for c in color])
+        else:
+            sub_colors.append([c for c in color])
+    return sub_colors
+
+colors = colors_subselect(mpcm.plasma.colors, num_classes=21)
+colors_tableau = [(255, 255, 255), (31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),
+                 (44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),
+                 (148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),
+                 (227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),
+                 (188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]
+
+def bboxes_draw_on_img(img, classes, scores, bboxes, thickness=2):
+    shape = img.shape
+    scale = 0.4
+    text_thickness = 1
+    line_type = 8
+    for i in range(bboxes.shape[0]):
+        if classes[i] < 1: continue
+        bbox = bboxes[i]
+        color = colors_tableau[classes[i]]
+        # Draw bounding boxes
+        p1 = (int(bbox[0] * shape[0]), int(bbox[1] * shape[1]))
+        p2 = (int(bbox[2] * shape[0]), int(bbox[3] * shape[1]))
+        if (p2[0] - p1[0] < 1) or (p2[1] - p1[1] < 1):
+            continue
+
+        cv2.rectangle(img, p1[::-1], p2[::-1], color, thickness)
+        # Draw text
+        s = '%s/%.1f%%' % (label2name_table[classes[i]], scores[i]*100)
+        # text_size is (width, height)
+        text_size, baseline = cv2.getTextSize(s, cv2.FONT_HERSHEY_SIMPLEX, scale, text_thickness)
+        p1 = (p1[0] - text_size[1], p1[1])
+
+        cv2.rectangle(img, (p1[1] - thickness//2, p1[0] - thickness - baseline), (p1[1] + text_size[0], p1[0] + text_size[1]), color, -1)
+
+        cv2.putText(img, s, (p1[1], p1[0] + baseline), cv2.FONT_HERSHEY_SIMPLEX, scale, (255,255,255), text_thickness, line_type)
+
+    return img
+
diff --git a/utils/external/ssd_tensorflow/utility/scaffolds.py b/utils/external/ssd_tensorflow/utility/scaffolds.py
new file mode 100644
index 0000000..820dabb
--- /dev/null
+++ b/utils/external/ssd_tensorflow/utility/scaffolds.py
@@ -0,0 +1,86 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+
+import tensorflow as tf
+
+def get_init_fn_for_scaffold(model_dir, checkpoint_path, model_scope, checkpoint_model_scope, checkpoint_exclude_scopes, ignore_missing_vars, name_remap=None):
+    if tf.train.latest_checkpoint(model_dir):
+        tf.logging.info('Ignoring --checkpoint_path because a checkpoint already exists in %s.' % model_dir)
+        return None
+    exclusion_scopes = []
+    if checkpoint_exclude_scopes:
+        exclusion_scopes = [scope.strip() for scope in checkpoint_exclude_scopes.split(',')]
+
+    variables_to_restore = []
+    for var in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES):
+        excluded = False
+        for exclusion in exclusion_scopes:
+            if exclusion in var.op.name:#.startswith(exclusion):
+                excluded = True
+                break
+        if not excluded:
+            variables_to_restore.append(var)
+    if checkpoint_model_scope is not None:
+        if checkpoint_model_scope.strip() == '':
+            variables_to_restore = {var.op.name.replace(model_scope + '/', ''): var for var in variables_to_restore}
+        else:
+            variables_to_restore = {var.op.name.replace(model_scope, checkpoint_model_scope.strip()): var for var in variables_to_restore}
+        if name_remap is not None:
+            renamed_variables_to_restore = dict()
+            for var_name, var in variables_to_restore.items():
+                found = False
+                for k, v in name_remap.items():
+                    if k in var_name:
+                        renamed_variables_to_restore[var_name.replace(k, v)] = var
+                        found = True
+                        break
+                if not found:
+                    renamed_variables_to_restore[var_name] = var
+            variables_to_restore = renamed_variables_to_restore
+
+    checkpoint_path = tf.train.latest_checkpoint(checkpoint_path) if tf.gfile.IsDirectory(checkpoint_path) else checkpoint_path
+
+    tf.logging.info('Fine-tuning from %s. Ignoring missing vars: %s.' % (checkpoint_path, ignore_missing_vars))
+
+    if not variables_to_restore:
+        raise ValueError('variables_to_restore cannot be empty')
+    if ignore_missing_vars:
+        reader = tf.train.NewCheckpointReader(checkpoint_path)
+        if isinstance(variables_to_restore, dict):
+            var_dict = variables_to_restore
+        else:
+            var_dict = {var.op.name: var for var in variables_to_restore}
+        available_vars = {}
+        for var in var_dict:
+            if reader.has_tensor(var):
+                available_vars[var] = var_dict[var]
+            else:
+                tf.logging.warning('Variable %s missing in checkpoint %s.', var, checkpoint_path)
+        variables_to_restore = available_vars
+    if variables_to_restore:
+        saver = tf.train.Saver(variables_to_restore, reshape=False)
+        saver.build()
+        def callback(scaffold, session):
+            saver.restore(session, checkpoint_path)
+        return callback
+    else:
+        tf.logging.warning('No Variables to restore.')
+        return None
diff --git a/utils/external/ssd_tensorflow/voc_eval.py b/utils/external/ssd_tensorflow/voc_eval.py
new file mode 100644
index 0000000..27fcbfc
--- /dev/null
+++ b/utils/external/ssd_tensorflow/voc_eval.py
@@ -0,0 +1,258 @@
+# Copyright 2018 Changan Wang
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import sys
+import os
+import numpy as np
+import pickle
+
+if sys.version_info[0] == 2:
+    import xml.etree.cElementTree as ET
+else:
+    import xml.etree.ElementTree as ET
+
+from utils.external.ssd_tensorflow.dataset import dataset_common
+
+def parse_rec(filename):
+    """ Parse a PASCAL VOC xml file """
+    tree = ET.parse(filename)
+    objects = []
+    for obj in tree.findall('object'):
+        obj_struct = {}
+        obj_struct['name'] = obj.find('name').text
+        obj_struct['pose'] = obj.find('pose').text
+        obj_struct['truncated'] = int(obj.find('truncated').text)
+        obj_struct['difficult'] = int(obj.find('difficult').text)
+        bbox = obj.find('bndbox')
+        obj_struct['bbox'] = [int(bbox.find('xmin').text) - 1,
+                              int(bbox.find('ymin').text) - 1,
+                              int(bbox.find('xmax').text) - 1,
+                              int(bbox.find('ymax').text) - 1]
+        objects.append(obj_struct)
+
+    return objects
+
+def do_python_eval(dataset_path, pred_path, use_07=True):
+    output_path = os.path.join(pred_path, 'eval_output')
+    cache_path = os.path.join(pred_path, 'eval_cache')
+    anno_files = os.path.join(dataset_path, 'Annotations/{}.xml')
+    all_images_file = os.path.join(dataset_path, 'ImageSets/Main/test.txt')
+
+    aps = []
+    # The PASCAL VOC metric changed in 2010
+    use_07_metric = use_07
+    print('VOC07 metric? ' + ('Yes' if use_07_metric else 'No'))
+    if not os.path.isdir(output_path):
+        os.mkdir(output_path)
+    for cls_name, cls_pair in dataset_common.VOC_LABELS.items():
+        if 'none' in cls_name:
+            continue
+        cls_id = cls_pair[0]
+        filename = os.path.join(pred_path, 'results_%d.txt' % cls_id)
+        rec, prec, ap = voc_eval(filename, anno_files,
+                 all_images_file, cls_name, cache_path,
+                ovthresh=0.5, use_07_metric=use_07_metric)
+        aps += [ap]
+        print('AP for {} = {:.4f}'.format(cls_name, ap))
+        with open(os.path.join(output_path, cls_name + '_pr.pkl'), 'wb') as f:
+            pickle.dump({'rec': rec, 'prec': prec, 'ap': ap}, f)
+    print('Mean AP = {:.4f}'.format(np.mean(aps)))
+    print('~~~~~~~~')
+    print('Results:')
+    for ap in aps:
+        print('{:.3f}'.format(ap))
+    print('{:.3f}'.format(np.mean(aps)))
+    print('~~~~~~~~')
+    print('')
+    print('--------------------------------------------------------------')
+    print('Results computed with the **unofficial** Python eval code.')
+    print('Results should be very close to the official MATLAB eval code.')
+    print('--------------------------------------------------------------')
+
+
+def voc_ap(rec, prec, use_07_metric=True):
+    """ ap = voc_ap(rec, prec, [use_07_metric])
+    Compute VOC AP given precision and recall.
+    If use_07_metric is true, uses the
+    VOC 07 11 point method (default:False).
+    """
+    if use_07_metric:
+        # 11 point metric
+        ap = 0.
+        for t in np.arange(0., 1.1, 0.1):
+            if np.sum(rec >= t) == 0:
+                p = 0
+            else:
+                p = np.max(prec[rec >= t])
+            ap = ap + p / 11.
+    else:
+        # correct AP calculation
+        # first append sentinel values at the end
+        mrec = np.concatenate(([0.], rec, [1.]))
+        mpre = np.concatenate(([0.], prec, [0.]))
+
+        # compute the precision envelope
+        for i in range(mpre.size - 1, 0, -1):
+            mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])
+
+        # to calculate area under PR curve, look for points
+        # where X axis (recall) changes value
+        i = np.where(mrec[1:] != mrec[:-1])[0]
+
+        # and sum (\Delta recall) * prec
+        ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
+    return ap
+
+
+def voc_eval(detpath,
+             annopath,
+             imagesetfile,
+             classname,
+             cachedir,
+             ovthresh=0.5,
+             use_07_metric=True):
+    """rec, prec, ap = voc_eval(detpath,
+                               annopath,
+                               imagesetfile,
+                               classname,
+                               [ovthresh],
+                               [use_07_metric])
+        Top level function that does the PASCAL VOC evaluation.
+        detpath: Path to detections
+           detpath.format(classname) should produce the detection results file.
+        annopath: Path to annotations
+           annopath.format(imagename) should be the xml annotations file.
+        imagesetfile: Text file containing the list of images, one image per line.
+        classname: Category name (duh)
+        cachedir: Directory for caching the annotations
+        [ovthresh]: Overlap threshold (default = 0.5)
+        [use_07_metric]: Whether to use VOC07's 11 point AP computation
+           (default False)
+    """
+    # assumes detections are in detpath.format(classname)
+    # assumes annotations are in annopath.format(imagename)
+    # assumes imagesetfile is a text file with each line an image name
+    # cachedir caches the annotations in a pickle file
+    # first load gt
+    if not os.path.isdir(cachedir):
+        os.mkdir(cachedir)
+    cachefile = os.path.join(cachedir, 'annots.pkl')
+    # read list of images
+    with open(imagesetfile, 'r') as f:
+        lines = f.readlines()
+    imagenames = [x.strip() for x in lines]
+    if not os.path.isfile(cachefile):
+        # load annots
+        recs = {}
+        for i, imagename in enumerate(imagenames):
+            recs[imagename] = parse_rec(annopath.format(imagename))
+            if i % 100 == 0:
+                print('Reading annotation for {:d}/{:d}'.format(
+                   i + 1, len(imagenames)))
+        # save
+        print('Saving cached annotations to {:s}'.format(cachefile))
+        with open(cachefile, 'wb') as f:
+            pickle.dump(recs, f)
+    else:
+        # load
+        with open(cachefile, 'rb') as f:
+            recs = pickle.load(f)
+
+    # extract gt objects for this class
+    class_recs = {}
+    npos = 0
+
+    for imagename in imagenames:
+        R = [obj for obj in recs[imagename] if obj['name'] == classname]
+        bbox = np.array([x['bbox'] for x in R])
+        difficult = np.array([x['difficult'] for x in R]).astype(np.bool)
+        det = [False] * len(R)
+        npos = npos + sum(~difficult)
+        class_recs[imagename] = {'bbox': bbox,
+                                 'difficult': difficult,
+                                 'det': det}
+    # read dets
+    with open(detpath, 'r') as f:
+        lines = f.readlines()
+
+    if any(lines) == 1:
+
+        splitlines = [x.strip().split(' ') for x in lines]
+        image_ids = [x[0] for x in splitlines]
+        confidence = np.array([float(x[1]) for x in splitlines])
+        BB = np.array([[float(z) for z in x[2:]] for x in splitlines])
+
+        # sort by confidence
+        sorted_ind = np.argsort(-confidence)
+        sorted_scores = np.sort(-confidence)
+        BB = BB[sorted_ind, :]
+        image_ids = [image_ids[x] for x in sorted_ind]
+
+        # go down dets and mark TPs and FPs
+        nd = len(image_ids)
+        tp = np.zeros(nd)
+        fp = np.zeros(nd)
+        for d in range(nd):
+            R = class_recs[image_ids[d]]
+            bb = BB[d, :].astype(float)
+            ovmax = -np.inf
+            BBGT = R['bbox'].astype(float)
+            if BBGT.size > 0:
+                # compute overlaps
+                # intersection
+                ixmin = np.maximum(BBGT[:, 0], bb[0])
+                iymin = np.maximum(BBGT[:, 1], bb[1])
+                ixmax = np.minimum(BBGT[:, 2], bb[2])
+                iymax = np.minimum(BBGT[:, 3], bb[3])
+                iw = np.maximum(ixmax - ixmin, 0.)
+                ih = np.maximum(iymax - iymin, 0.)
+                inters = iw * ih
+                uni = ((bb[2] - bb[0]) * (bb[3] - bb[1]) +
+                       (BBGT[:, 2] - BBGT[:, 0]) *
+                       (BBGT[:, 3] - BBGT[:, 1]) - inters)
+                overlaps = inters / uni
+                ovmax = np.max(overlaps)
+                jmax = np.argmax(overlaps)
+
+            if ovmax > ovthresh:
+                if not R['difficult'][jmax]:
+                    if not R['det'][jmax]:
+                        tp[d] = 1.
+                        R['det'][jmax] = 1
+                    else:
+                        fp[d] = 1.
+            else:
+                fp[d] = 1.
+
+        # compute precision recall
+        fp = np.cumsum(fp)
+        tp = np.cumsum(tp)
+        rec = tp / float(npos)
+        # avoid divide by zero in case the first detection matches a difficult
+        # ground truth
+        prec = tp / np.maximum(tp + fp, np.finfo(np.float64).eps)
+        ap = voc_ap(rec, prec, use_07_metric)
+    else:
+        rec = -1.
+        prec = -1.
+        ap = -1.
+
+    return rec, prec, ap
+
+if __name__ == '__main__':
+        do_python_eval()
diff --git a/utils/get_idle_gpus.py b/utils/get_idle_gpus.py
index c9e5dcf..150d916 100644
--- a/utils/get_idle_gpus.py
+++ b/utils/get_idle_gpus.py
@@ -14,7 +14,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ==============================================================================
-"""Get a list of idle GPUs."""
+"""Get a list of idle GPUs.
+
+This script sorts GPUs in the ascending order of memory usage, and return the top-k ones.
+"""
 
 import os
 import sys
@@ -24,37 +27,28 @@
 assert len(sys.argv) == 2
 nb_idle_gpus = int(sys.argv[1])
 
-# dump the output of "nvidia-smi" command to file
-dump_file = './nvidia-smi-dump'
-with open(dump_file, 'w') as o_file:
-  subprocess.call(['nvidia-smi'], stdout=o_file)
+# assume: idle gpu has no more than 50% of total card memory used
+mem_usage_ulimit = .5
 
-# parse the output of "nvidia-smi" command
-with open(dump_file, 'r') as i_file:
-  # obtain list of all & busy GPUs
-  parse_procs = False
-  all_gpus, busy_gpus = [], []
-  for i_line in i_file:
-    if 'Processes' in i_line:
-      parse_procs = True
-    sub_strs = i_line.split()
-    if len(sub_strs) < 2:
-      continue
-    if not parse_procs:
-      if sub_strs[1].isdigit():
-        all_gpus.append(sub_strs[1])
-    else:
-      if sub_strs[1].isdigit():
-        busy_gpus.append(sub_strs[1])
+# command to execute to get gpu id and corresponding memory used
+# and total memory. It gives output in the format
+# gpu id, memory used, total memory
+cmd = 'nvidia-smi --query-gpu=index,memory.used,memory.total ' \
+  '--format=csv,noheader,nounits'
+gpu_smi_output = subprocess.check_output(cmd, shell=True)
+gpu_smi_output = gpu_smi_output.decode('utf-8')
 
-  # obtain list of idle GPUs
-  idle_gpus = list(set(all_gpus) - set(busy_gpus))
-  idle_gpus.sort()
-  if len(idle_gpus) < nb_idle_gpus:
-    raise ValueError('not enough idle GPUs; idle GPUs are: {}'.format(idle_gpus))
-  idle_gpus = idle_gpus[:nb_idle_gpus]
-  idle_gpus_str = ','.join([str(idle_gpu) for idle_gpu in idle_gpus])
-  print(idle_gpus_str)
+idle_gpus = []
+for gpu in gpu_smi_output.split(sep='\n')[:-1]:
+  (gpu_id, mem_used, mem_total) = [int(value) for value in gpu.split(sep=',')]
+  mem_usage = float(mem_used) / mem_total
+  if mem_usage < mem_usage_ulimit:
+    idle_gpus += [(gpu_id, mem_usage)]
+idle_gpus.sort(key=lambda x: x[1])
+idle_gpus = [x[0] for x in idle_gpus]  # only keep GPU ids
 
-# remove the dump file
-os.remove(dump_file)
+if len(idle_gpus) < nb_idle_gpus:
+  raise ValueError('not enough idle GPUs; idle GPUs are: {}'.format(idle_gpus))
+idle_gpus = idle_gpus[:nb_idle_gpus]
+idle_gpus_str = ','.join([str(idle_gpu) for idle_gpu in idle_gpus])
+print(idle_gpus_str)
diff --git a/utils/lrn_rate_utils.py b/utils/lrn_rate_utils.py
index 94081da..6fa1e28 100644
--- a/utils/lrn_rate_utils.py
+++ b/utils/lrn_rate_utils.py
@@ -18,26 +18,8 @@
 
 import tensorflow as tf
 
-from utils.multi_gpu_wrapper import MultiGpuWrapper as mgw
-
 FLAGS = tf.app.flags.FLAGS
 
-# set <nb_epochs_rat> to values smaller than 1.0 to use fewer epochs and speed up training
-tf.app.flags.DEFINE_float('nb_epochs_rat', 1.0, '# of training epochs\'s ratio')
-
-def calc_nb_batches(nb_epochs, batch_size):
-  """Calculate the number of mini-batches.
-
-  Args:
-  * nb_epochs: number of epoches
-  * batch_size: number of samples in each mini-batch
-
-  Returns:
-  * nb_batches: number of mini-batches
-  """
-
-  return int(FLAGS.nb_smpls_train * nb_epochs * FLAGS.nb_epochs_rat / batch_size)
-
 def setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay_rates):
   """Setup the learning rate with piecewise constant strategy.
 
@@ -51,7 +33,7 @@ def setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay
   * lrn_rate: learning rate
   """
 
-  # adjust the interval endpoints
+  # adjust interval endpoints w.r.t. FLAGS.nb_epochs_rat
   idxs_epoch = [idx_epoch * FLAGS.nb_epochs_rat for idx_epoch in idxs_epoch]
 
   # setup learning rate with the piecewise constant strategy
@@ -59,8 +41,9 @@ def setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay
   nb_batches_per_epoch = float(FLAGS.nb_smpls_train) / batch_size
   bnds = [int(nb_batches_per_epoch * idx_epoch) for idx_epoch in idxs_epoch]
   vals = [lrn_rate_init * decay_rate for decay_rate in decay_rates]
+  lrn_rate = tf.train.piecewise_constant(global_step, bnds, vals)
 
-  return tf.train.piecewise_constant(global_step, bnds, vals)
+  return lrn_rate
 
 def setup_lrn_rate_exponential_decay(global_step, batch_size, epoch_step, decay_rate):
   """Setup the learning rate with exponential decaying strategy.
@@ -75,154 +58,13 @@ def setup_lrn_rate_exponential_decay(global_step, batch_size, epoch_step, decay_
   * lrn_rate: learning rate
   """
 
-  # adjust the step size & decaying rate
+  # adjust the step size & decaying rate w.r.t. FLAGS.nb_epochs_rat
   epoch_step *= FLAGS.nb_epochs_rat
 
   # setup learning rate with the exponential decay strategy
   lrn_rate_init = FLAGS.lrn_rate_init * batch_size / FLAGS.batch_size_norm
   batch_step = int(FLAGS.nb_smpls_train * epoch_step / batch_size)
+  lrn_rate = tf.train.exponential_decay(
+    lrn_rate_init, tf.cast(global_step, tf.int32), batch_step, decay_rate, staircase=True)
 
-  return tf.train.exponential_decay(
-    lrn_rate_init, global_step, batch_step, decay_rate, staircase=True)
-
-def setup_lrn_rate_lenet_cifar10(global_step, batch_size):
-  """Setup the learning rate for LeNet-like models on the CIFAR-10 dataset.
-
-  Args:
-  * global_step: training iteration counter
-  * batch_size: number of samples in each mini-batch
-
-  Returns:
-  * lrn_rate: learning rate
-  * nb_batches: number of mini-batches
-  """
-
-  nb_epochs = 250
-  idxs_epoch = [100, 150, 200]
-  decay_rates = [1.0, 0.1, 0.01, 0.001]
-  lrn_rate = setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay_rates)
-  nb_batches = calc_nb_batches(nb_epochs, batch_size)
-
-  return lrn_rate, nb_batches
-
-def setup_lrn_rate_resnet_cifar10(global_step, batch_size):
-  """Setup the learning rate for ResNet models on the CIFAR-10 dataset.
-
-  Args:
-  * global_step: training iteration counter
-  * batch_size: number of samples in each mini-batch
-
-  Returns:
-  * lrn_rate: learning rate
-  * nb_batches: number of mini-batches
-  """
-
-  nb_epochs = 250
-  idxs_epoch = [100, 150, 200]
-  decay_rates = [1.0, 0.1, 0.01, 0.001]
-  lrn_rate = setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay_rates)
-  nb_batches = calc_nb_batches(nb_epochs, batch_size)
-
-  return lrn_rate, nb_batches
-
-def setup_lrn_rate_resnet_ilsvrc12(global_step, batch_size):
-  """Setup the learning rate for ResNet models on the ILSVRC-12 dataset.
-
-  Args:
-  * global_step: training iteration counter
-  * batch_size: number of samples in each mini-batch
-
-  Returns:
-  * lrn_rate: learning rate
-  * nb_batches: number of mini-batches
-  """
-
-  nb_epochs = 100
-  idxs_epoch = [30, 60, 80, 90]
-  decay_rates = [1.0, 0.1, 0.01, 0.001, 0.0001]
-  lrn_rate = setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay_rates)
-  nb_batches = calc_nb_batches(nb_epochs, batch_size)
-
-  return lrn_rate, nb_batches
-
-def setup_lrn_rate_mobilenet_v1_ilsvrc12(global_step, batch_size):
-  """Setup the learning rate for MobileNet-v1 models on the ILSVRC-12 dataset.
-
-  Args:
-  * global_step: training iteration counter
-  * batch_size: number of samples in each mini-batch
-
-  Returns:
-  * lrn_rate: learning rate
-  * nb_batches: number of mini-batches
-  """
-
-  nb_epochs = 100
-  idxs_epoch = [30, 60, 80, 90]
-  decay_rates = [1.0, 0.1, 0.01, 0.001, 0.0001]
-  lrn_rate = setup_lrn_rate_piecewise_constant(global_step, batch_size, idxs_epoch, decay_rates)
-  nb_batches = calc_nb_batches(nb_epochs, batch_size)
-
-  return lrn_rate, nb_batches
-
-def setup_lrn_rate_mobilenet_v2_ilsvrc12(global_step, batch_size):
-  """Setup the learning rate for MobileNet-v2 models on the ILSVRC-12 dataset.
-
-  Args:
-  * global_step: training iteration counter
-  * batch_size: number of samples in each mini-batch
-
-  Returns:
-  * lrn_rate: learning rate
-  * nb_batches: number of mini-batches
-  """
-
-  nb_epochs = 412
-  epoch_step = 2.5
-  decay_rate = 0.98 ** epoch_step
-  lrn_rate = setup_lrn_rate_exponential_decay(global_step, batch_size, epoch_step, decay_rate)
-  nb_batches = calc_nb_batches(nb_epochs, batch_size)
-
-  return lrn_rate, nb_batches
-
-def setup_lrn_rate(global_step, model_name, dataset_name):
-  """Setup the learning rate for the given dataset.
-
-  Args:
-  * global_step: training iteration counter
-  * model_name: model's name; must be one of ['lenet', 'resnet_*', 'mobilenet_v1', 'mobilenet_v2']
-  * dataset_name: dataset's name; must be one of ['cifar_10', 'ilsvrc_12']
-
-  Returns:
-  * lrn_rate: learning rate
-  * nb_batches: number of training mini-batches
-  """
-
-  # obtain the overall batch size across all GPUs
-  if not FLAGS.enbl_multi_gpu:
-    batch_size = FLAGS.batch_size
-  else:
-    batch_size = FLAGS.batch_size * mgw.size()
-
-  # choose a learning rate protocol according to the model & dataset combination
-  global_step = tf.cast(global_step, tf.int32)
-  if dataset_name == 'cifar_10':
-    if model_name == 'lenet':
-      lrn_rate, nb_batches = setup_lrn_rate_lenet_cifar10(global_step, batch_size)
-    elif model_name.startswith('resnet'):
-      lrn_rate, nb_batches = setup_lrn_rate_resnet_cifar10(global_step, batch_size)
-    else:
-      raise NotImplementedError('model: {} / dataset: {}'.format(model_name, dataset_name))
-  elif dataset_name == 'ilsvrc_12':
-    if model_name.startswith('resnet'):
-      lrn_rate, nb_batches = setup_lrn_rate_resnet_ilsvrc12(global_step, batch_size)
-    elif model_name.startswith('mobilenet_v1'):
-      lrn_rate, nb_batches = setup_lrn_rate_mobilenet_v1_ilsvrc12(global_step, batch_size)
-    elif model_name.startswith('mobilenet_v2'):
-      lrn_rate, nb_batches = setup_lrn_rate_mobilenet_v2_ilsvrc12(global_step, batch_size)
-    else:
-      raise NotImplementedError('model: {} / dataset: {}'.format(model_name, dataset_name))
-  else:
-    raise NotImplementedError('dataset: ' + dataset_name)
-
-  return lrn_rate, nb_batches
+  return lrn_rate
diff --git a/utils/misc_utils.py b/utils/misc_utils.py
index aa811ef..7388c8e 100644
--- a/utils/misc_utils.py
+++ b/utils/misc_utils.py
@@ -22,6 +22,18 @@
 
 FLAGS = tf.app.flags.FLAGS
 
+def auto_barrier(mpi_comm=None):
+  """Automatically insert a barrier for multi-GPU training, or pass for single-GPU training.
+
+  Args:
+  * mpi_comm: MPI communication object
+  """
+
+  if FLAGS.enbl_multi_gpu:
+    mpi_comm.Barrier()
+  else:
+    pass
+
 def is_primary_worker(scope='global'):
   """Check whether is the primary worker of all nodes (global) or the current node (local).