Data Science Toolkit (DST) is a Python library that helps implement data science projects with ease: from data ingestion and preprocessing to modeling, geospatial analysis, computer vision, text vectorization, and reinforcement learning.
It bundles practical, production-friendly utilities and higher-level abstractions so you can move faster while keeping control over the details.
- Data handling:
DataFramefor loading CSV/JSON/Excel/Parquet, cleaning, transforming, and streaming large datasets. - Modeling:
Modelfor traditional ML and deep learning training, cross-validation, metrics, and GPU helpers. - Text & NLP:
Vectorizerfor bag-of-words/TF-IDF, tokenization, cosine similarity, and projections. - Charts:
Chartutilities for quick exploratory visuals with Matplotlib/Seaborn/Plotly. - GIS:
GISfor geospatial data layers, joins, CRS transforms, area/perimeter, and exports. - Computer Vision:
ImageFactoryfor resizing, cropping, contour detection, blending, and basic filters. - Reinforcement Learning:
EnvironmentandR3tools to explore policies and custom environments. - Crop Simulation:
CSMmodules for crop water requirement, ET simulations, and monitoring pipelines. - Utilities:
Libwith climate, math, text processing, IO helpers, and more.
DST is published as data-science-toolkit.
pip install data-science-toolkitIf you’re installing from source (for development):
git clone https://github.com/elhachimi-ch/dst.git
cd dst
pip install -e .Notes:
- Requires Python 3.5+.
- Some features (e.g., deep learning, GIS, CV) pull heavier dependencies (TensorFlow, CatBoost, OpenCV, Geo stack). Install times may vary.
from data_science_toolkit.dataframe import DataFrame
from data_science_toolkit.model import Model
# Load a toy dataset
data = DataFrame()
data.load_dataset('iris')
y = data.get_column('target')
data.drop_column('target')
# Fit a decision tree
model = Model(data_x=data.get_dataframe(), data_y=y, model_type='dt', training_percent=0.8)
model.train()
model.report() # classification metrics
model.cross_validation(5)from data_science_toolkit.dataframe import DataFrame
# Stream a Parquet dataset efficiently
df = DataFrame(data_path="path/to/parquet/dir", data_type="parquet", n_workers="auto")
summary = df.describe() # computes per-column stats without loading entire data into RAM
print(summary)from data_science_toolkit.vectorizer import Vectorizer
documents = [
"data science is fun",
"toolkits help data workflows",
"science advances with good tools"
]
vec = Vectorizer(documents_as_list=documents, vectorizer_type='tfidf', ngram_tuple=(1,2))
matrix = vec.get_matrix()
features = vec.get_features_names()
print(len(features), features[:10])from data_science_toolkit.gis import GIS
gis = GIS()
gis.add_data_layer("parcels", "data/parcels.geojson", data_type="sf")
gis.add_area_column("parcels", unit="ha")
gis.to_crs("parcels", epsg="3857")
gis.export("parcels", "out/parcels_3857", file_format="geojson")from data_science_toolkit.imagefactory import ImageFactory
img = ImageFactory("data/sample.jpg")
img.to_gray_scale()
img.gaussian_blur((5,5))
img.save("out/processed.jpg")Full API docs and tutorials live at: https://data-science-toolkit.readthedocs.io
Contributions and suggestions are welcome via GitHub pull requests.
Typical workflow:
- Fork the repo and create a feature branch.
- Install dev dependencies:
pip install -e .. - Add tests or notebook snippets where relevant.
- Open a PR with a clear description and examples.
We’re actively enhancing the repo with new algorithms and utilities. Feedback on priorities is appreciated.
MIT License. See the LICENSE file for details.
If you use DST in academic work, please cite the repository and (optionally) reference the Code Ocean capsule for reproducibility: https://codeocean.com/capsule/1309232/tree
Additionally, please cite the following paper:
El Hachimi, Chouaib; Belaqziz, Salwa; Khabba, Saïd; Chehbouni, Abdelghani. 2022. "Data Science Toolkit: An All-in-One Python Library to Help Researchers and Practitioners in Implementing Data Science-Related Algorithms with Less Effort." Software Impacts 12:100240. https://doi.org/10.1016/J.SIMPA.2022.100240
BibTeX (optional):
@article{ElHachimi2022,
author = {Chouaib El Hachimi and Salwa Belaqziz and Saïd Khabba and Abdelghani Chehbouni},
doi = {10.1016/J.SIMPA.2022.100240},
issn = {2665-9638},
journal = {Software Impacts},
month = {5},
pages = {100240},
publisher = {Elsevier},
title = {Data Science Toolkit: An all-in-one python library to help researchers and practitioners in implementing data science-related algorithms with less effort},
volume = {12},
url = {https://linkinghub.elsevier.com/retrieve/pii/S2665963822000124},
year = {2022}
}