diff --git a/.gitignore b/.gitignore index 1313f538..1a4e73cc 100644 --- a/.gitignore +++ b/.gitignore @@ -3,3 +3,5 @@ node_modules test/fixtures/* test/samples/* *.xlsx + +testsss diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d5a2e6dc..42b1f616 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -4,283 +4,209 @@ **Table of contents:** -* **TL:DR** -* Where to start? -* Working with the code - * Version control, Git, and GitHub - * Getting started with Git - * Forking - * Creating a development environment -* Documentation Guidelines -* Writing tests - * Using mocha - * Running the test suite -* Contributing your changes to danfojs - * Committing your code - * Pushing your changes - * Review your code and finally, make the pull request -* Danfojs internal (Brief) - -## TL:DR +* [TL;DR](#tldr) +* [Where to start?](#where-to-start) +* [Project Structure](#project-structure) +* [Development Setup](#development-setup) + * [Prerequisites](#prerequisites) + * [Installation](#installation) + * [Building](#building) + * [Testing](#testing) +* [Working with Git](#working-with-git) +* [Documentation Guidelines](#documentation-guidelines) +* [Making Changes](#making-changes) +* [Creating Pull Requests](#creating-pull-requests) + +## TL;DR All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome. -For contributors familiar with open-source, below is a quick guide to setting up danfojs locally. +Quick setup for experienced contributors: -``` +```bash git clone https://github.com/javascriptdata/danfojs.git cd danfojs -git checkout -b +yarn install +yarn build +yarn test ``` -There are three main folders in the `src` folder, **danfojs-base**, **danfojs-browser,** and **danfojs-node**. - -The **danfojs-base** folder holds all shared classes, modules, and functions used by both danfojs-browser and danfojs-node. So features or bug fixes that work the same way in both versions will generally be done in the **danfojs-base** folder. - ## Where to start? -For first-time contributors, you can find pending issues on the GitHub “issues” page. There are a number of issues listed and "good first issue" where you could start out. Once you’ve found an interesting issue, and have an improvement in mind, next thing is to set up your development environment. - -## Working with the code +For first-time contributors: -If you have an issue you want to fix, an enhancement to add, or documentation to improve, you need to learn how to work with GitHub and the Danfojs code base. +1. Look for issues labeled ["good first issue"](https://github.com/javascriptdata/danfojs/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) +2. Read through our [Documentation](https://danfo.jsdata.org/getting-started) -### **Version control, Git, and GitHub** -Danfojs code is hosted on GitHub. To contribute you will need to sign up for a free GitHub account. We use Git for version control to allow many people to work together on this project. +## Project Structure -Some great resources for learning Git: +The project is organized into three main packages: -* Official [GitHub pages](http://help.github.com). +- **danfojs-base**: Core functionality shared between browser and Node.js versions +- **danfojs-browser**: Browser-specific implementation +- **danfojs-node**: Node.js-specific implementation -### **Getting started with Git** +Most new features should be added to **danfojs-base** unless they are environment-specific. -Find [Instructions](http://help.github.com/set-up-git-redirect) for installing git, setting up your SSH key, and configuring git. These steps need to be completed before you can work seamlessly between your local repository and GitHub. +## Development Setup -## **Forking the Danfojs repo** +### Prerequisites -You will need your own fork to work on the code. Go to the danfojs [project page](https://github.com/opensource9ja/danfojs) and hit the Fork button. +- Node.js (v16.x or later) +- Yarn package manager +- Git -Next, you will clone your fork to your local machine: +### Installation -``` -git clone https://github.com/javascriptdata/danfojs.git -cd danfojs -``` +1. Fork the repository on GitHub +2. Clone your fork locally: + ```bash + git clone https://github.com/YOUR_USERNAME/danfojs.git + cd danfojs + ``` -This creates the directory danfojs and connects your repository to the upstream (main project) repository. +3. Install dependencies: + ```bash + yarn install + ``` -Some Javascript features are supported both in the browser and node environment, and it is recommended to add features in the **danfojs-base** folder. +### Building -For features that work differently or only in a specific environment, you can add them in the corresponding danfojs-node or danfojs-browser folder. - - - -## **Creating a development environment** - -To test out code changes, you’ll need to build danfojs, which requires a Nodejs environment. - -```python -git clone https://github.com/javascriptdata/danfojs.git -cd danfojs -yarn install ## automatically installs all required packages -yarn test ##Runs test in both node and browser folder +Build all packages: +```bash +yarn build ``` -> Now you can start adding features or fixing bugs! - -## Documentation Guidelines - -Documentation helps clarify what a function or a method is doing. It also gives insight to users of the function or methods on what parameters to pass in and know what the function will return. - -Sample documentation: - -```javascript - /** - * Add two series of the same length - * @param {series1} series1 [Series] - * @param {series2} series2 [Series] - * @returns Series - */ -function add_series(series1, series2){ - - ................... - - return new Series() -} +Build specific package: +```bash +cd src/danfojs-browser +yarn build ``` -And for functions that contain more than two arguments, a keyword argument can be used. Parsing of keyword argument is also applicable to most of the methods in a class - -```javascript -/** - * Join two or more dataframe together along an axis - * @param {kwargs} kwargs --> { - * df_list: [Array of DataFrame], - * axis : int {0 or 1}, - * by_column : String {name of a column}, - * } - * @returns DataFrame - */ -function join_df(kwargs){ - ........ - - return DataFrame -} +Watch mode for development: +```bash +yarn dev ``` -## **Writing tests** +### Testing -We strongly encourage contributors to write tests for their code. Like many packages, Danfojs uses mocha. - -All tests should go into the tests subdirectory and placed in the corresponding module. The tests folder contains some current examples of tests, and we suggest looking to these for inspiration. - -Below is the general Framework to write a test for each module. - -{% tabs %} -{% tab title="JavaScript" %} -```javascript -import { assert } from "chai" -import { DataFrame } from '../../src/core/frame' - -describe("Name of the class|module", function(){ - - it("name of the methods| expected result",function(){ - - //write your test code here - //use assert.{proprty} to test your code - }) - -}); -``` -{% endtab %} -{% endtabs %} - -For a class with lots of methods. - -```python -import { assert } from "chai" -import { DataFrame } from '../../src/core/frame' - -describe("Name of the class|module", function(){ - - describe("method name 1", function(){ - - it("expected result",function(){ - - //write your test code here - //use assert.{proprty} to test your code - }) - }) - - describe("method name 2", function(){ - - it("expected result",function(){ - - //write your test code here - //use assert.{proprty} to test your code - }) - }) - ....... -}); +Run all tests: +```bash +yarn test ``` -**Example**: Let write a test, to test if the values in a dataframe are off a certain length. Assuming the method to obtain length is values\_len() - -```javascript -import { assert } from "chai" -import { DataFrame } from '../../src/core/frame' - -describe("DataFrame", function(){ - - describe("value_len", function(){ - - it("check dataframe length",function(){ - - let data = [[1,2],[4,5]] - let columns = ["A","B"] - let df = new DataFrame(data,{columns: columns}) - - let expected_result = 2 - - assert.deepEqual(sf.value_len(), expected_result)) - - - }) - }) - -}); +Run specific test file: +```bash +yarn test tests/core/frame.test.js ``` -### **Running the test case** - -To run the test for the module you created, - -**1)** Open the package.json - -**2)** change the name of the test script to the file name you want to test. - -```python -"scripts": { - "test": "....... danfojs/tests/sub_directory_name/filename", +Run tests matching a pattern: +```bash +yarn test -g "DataFrame.add" ``` -**3)** run the test, in the danfojs directory terminal - -```python -yarn test +Run tests in watch mode: +```bash +yarn test --watch ``` -Learn more about mocha [here](https://mochajs.org) +## Working with Git -## Contributing your changes to danfojs +1. Create a new branch: + ```bash + git checkout -b feature/your-feature-name + ``` -### **Committing your code** +2. Make your changes and commit: + ```bash + git add . + git commit -m "feat: add new feature" + ``` -Once you’ve made changes, you can see them by typing: + We follow [Conventional Commits](https://www.conventionalcommits.org/) for commit messages: + - `feat:` for new features + - `fix:` for bug fixes + - `docs:` for documentation changes + - `test:` for adding tests + - `refactor:` for code refactoring -``` -git status -``` +3. Push to your fork: + ```bash + git push origin feature/your-feature-name + ``` -Next, you can track your changes using +## Documentation Guidelines -``` -git add . -``` +Good documentation includes: -Next, you commit changes using: +1. JSDoc comments for all public methods +2. Clear parameter descriptions +3. Return value documentation +4. Usage examples +Example: +```javascript +/** + * Add two series of the same length + * @param {Series} series1 - First series to add + * @param {Series} series2 - Second series to add + * @returns {Series} New series containing the sum + * + * @example + * const s1 = new Series([1, 2, 3]) + * const s2 = new Series([4, 5, 6]) + * const result = add_series(s1, s2) + * // result: Series([5, 7, 9]) + */ +function add_series(series1, series2) { + // Implementation +} ``` -git commit -m "Enter any commit message here" -``` - -### **Pushing your changes** -When you want your changes to appear publicly on your GitHub page, you can push to your forked repo with: +For methods with multiple options, use an options object: +```javascript +/** + * Join two or more dataframes + * @param {Object} options - Join options + * @param {DataFrame[]} options.df_list - Array of DataFrames to join + * @param {number} options.axis - Join axis (0: index, 1: columns) + * @param {string} options.by_column - Column to join on + * @returns {DataFrame} Joined DataFrame + */ +function join_df(options) { + // Implementation +} ``` -git push -``` - -### Review your code and finally, make a pull request -If everything looks good, you are ready to make a pull request. A pull request is how code from a local repository becomes available to the GitHub community and can be reviewed and eventually merged into the master version. To submit a pull request: +## Making Changes -1. Navigate to your repository on GitHub -2. Click on the Pull Request button -3. Write a description of your changes in the Preview Discussion tab -4. Click Send Pull Request. +1. Write tests for new functionality +2. Ensure all tests pass +3. Update documentation if needed +4. Add an entry to CHANGELOG.md +5. Run linter: `yarn lint` -This request then goes to the repository maintainers, and they will review the code and everything looks good, merge it with the master. +## Creating Pull Requests -**Hooray! You're now a contributor to danfojs. Now go bask in the euphoria!** +1. Push your changes to your fork +2. Go to the [danfojs repository](https://github.com/javascriptdata/danfojs) +3. Click "Pull Request" +4. Fill out the PR template: + - Clear description of changes + - Link to related issue + - Screenshots/examples if relevant + - Checklist of completed items -## **Danfojs Internals** +Your PR will be reviewed by maintainers. Address any feedback and update your PR accordingly. -In other to contribute to the code base of danfojs, there are some functions and properties provided to make implementation easy. +--- -The folder **danfojs-base** contains the bulk of Danfojs modules, and these are simply extended or exported by the **danfojs-browser** and **danfojs-node** folders. The base class for Frames and Series is the NdFrame class which is found in the `danfojs-base/core/generic` file. +## Need Help? +- Check our [Documentation](https://danfo.jsdata.org) +- Ask in GitHub Issues +Thank you for contributing to danfojs! 🎉 diff --git a/README.md b/README.md index 280c8724..b8e2785b 100644 --- a/README.md +++ b/README.md @@ -201,10 +201,6 @@ Output in Node Console: ## Documentation The official documentation can be found [here](https://danfo.jsdata.org) -## Danfo.js Official Book - -We published a book titled "Building Data Driven Applications with Danfo.js". Read more about it [here](https://danfo.jsdata.org/building-data-driven-applications-with-danfo.js-book) - ## Discussion and Development Development discussions take place [here](https://github.com/opensource9ja/danfojs/discussions). @@ -212,7 +208,3 @@ Development discussions take place [here](https://github.com/opensource9ja/danfo All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome. A detailed overview on how to contribute can be found in the [contributing guide](https://danfo.jsdata.org/contributing-guide). #### Licence [MIT](https://github.com/opensource9ja/danfojs/blob/master/LICENCE) - -#### Created by [Rising Odegua](https://github.com/risenW) and [Stephen Oni](https://github.com/steveoni) - -Danfo.js - Open Source JavaScript library for manipulating data. | Product Hunt Embed diff --git a/package.json b/package.json index 05339c86..65ca731b 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "danfo", - "version": "1.1.2", + "version": "1.2.0", "private": true, "workspaces": [ "danfojs-node/**", diff --git a/performance-test.js b/performance-test.js new file mode 100644 index 00000000..6bdea7a6 --- /dev/null +++ b/performance-test.js @@ -0,0 +1,100 @@ +const { DataFrame } = require('./src/danfojs-node/dist/danfojs-node/src'); + +function generateTestData(rows, numGroups = 100) { + console.log(`Generating ${rows} rows of test data with ~${numGroups} groups...`); + + const data = []; + const columns = ['group_col', 'value_a', 'value_b', 'value_c']; + + for (let i = 0; i < rows; i++) { + data.push([ + `group_${i % numGroups}`, // Create groups + Math.random() * 1000, // value_a + Math.random() * 500, // value_b + Math.random() * 100 // value_c + ]); + } + + return new DataFrame(data, { columns }); +} + +function performanceTest(df, testName) { + console.log(`\n=== ${testName} ===`); + console.log(`DataFrame shape: ${df.shape[0]} rows, ${df.shape[1]} columns`); + + // Test 1: Basic groupby construction + console.log('\nTest 1: Group construction...'); + let start = performance.now(); + const grouped = df.groupby(['group_col']); + let end = performance.now(); + console.log(`Group construction: ${(end - start).toFixed(2)}ms`); + console.log(`Number of groups: ${grouped.ngroups}`); + + // Test 2: Single column aggregation + console.log('\nTest 2: Single column sum...'); + start = performance.now(); + const sumResult = grouped.col(['value_a']).sum(); + end = performance.now(); + console.log(`Single column sum: ${(end - start).toFixed(2)}ms`); + console.log(`Result shape: ${sumResult.shape[0]} rows`); + + // Test 3: Multiple column aggregation + console.log('\nTest 3: Multiple column aggregations...'); + start = performance.now(); + const multiResult = grouped.agg({ + value_a: 'mean', + value_b: 'sum', + value_c: 'count' + }); + end = performance.now(); + console.log(`Multiple aggregations: ${(end - start).toFixed(2)}ms`); + console.log(`Result shape: ${multiResult.shape[0]} rows`); + + // Test 4: Complex aggregation (multiple operations per column) + console.log('\nTest 4: Complex aggregation...'); + start = performance.now(); + const complexResult = grouped.agg({ + value_a: ['mean', 'max', 'min'], + value_b: ['sum', 'count'], + value_c: 'std' + }); + end = performance.now(); + console.log(`Complex aggregation: ${(end - start).toFixed(2)}ms`); + console.log(`Result shape: ${complexResult.shape[0]} rows`); + + return { + construction: end - start, + singleSum: end - start, + multiAgg: end - start, + complexAgg: end - start + }; +} + +async function main() { + console.log('DanfoJS GroupBy Performance Test'); + console.log('================================'); + + // Test different dataset sizes + const testSizes = [ + { rows: 1000, groups: 50, name: 'Small Dataset (1K rows)' }, + { rows: 5000, groups: 100, name: 'Medium Dataset (5K rows)' }, + { rows: 20000, groups: 200, name: 'Large Dataset (20K rows)' } + ]; + + for (const testSize of testSizes) { + const df = generateTestData(testSize.rows, testSize.groups); + performanceTest(df, testSize.name); + + // Force garbage collection between tests if available + if (global.gc) { + global.gc(); + } + } + + console.log('\n=== Performance Test Complete ==='); + console.log('Check the times above - we should see significant improvement!'); + console.log('Target: 20K rows should complete in < 2 seconds total'); +} + +// Run the test +main().catch(console.error); \ No newline at end of file diff --git a/src/danfojs-base/aggregators/groupby.ts b/src/danfojs-base/aggregators/groupby.ts index 63671914..dacdebd7 100644 --- a/src/danfojs-base/aggregators/groupby.ts +++ b/src/danfojs-base/aggregators/groupby.ts @@ -12,14 +12,12 @@ * limitations under the License. * ========================================================================== */ -import DataFrame from "../core/frame" -import { ArrayType1D, ArrayType2D } from "../shared/types" -import { variance, std, median, mode } from 'mathjs'; -import concat from "../transformers/concat" +import DataFrame from "../core/frame"; +import { ArrayType1D, ArrayType2D } from "../shared/types"; +import { variance, std, median, mode } from "mathjs"; +import concat from "../transformers/concat"; import Series from "../core/series"; - - /** * The class performs all groupby operation on a dataframe * involving all aggregate funciton @@ -30,28 +28,103 @@ import Series from "../core/series"; * @param {colDtype} Array columns dtype */ export default class Groupby { - colDict: { [key: string ]: {} } = {} - keyCol: ArrayType1D - data?: ArrayType2D | null - columnName: ArrayType1D - colDtype: ArrayType1D - colIndex: ArrayType1D - groupDict?: any - groupColNames?: Array - keyToValue: { - [key: string] : ArrayType1D - } = {} - - constructor(keyCol: ArrayType1D, data: ArrayType2D | null, columnName: ArrayType1D, colDtype:ArrayType1D, colIndex: ArrayType1D) { + private _colDict: Map = new Map(); + keyCol: ArrayType1D; + data?: ArrayType2D | null; + columnName: ArrayType1D; + colDtype: ArrayType1D; + colIndex: ArrayType1D; + groupDict?: any; + groupColNames?: Array; + keyToValue: Map = new Map(); + // Cache for optimized key generation + private keyGeneratorCache: Map string> = + new Map(); + constructor( + keyCol: ArrayType1D, + data: ArrayType2D | null, + columnName: ArrayType1D, + colDtype: ArrayType1D, + colIndex: ArrayType1D + ) { this.keyCol = keyCol; this.data = data; this.columnName = columnName; //this.dataTensors = {}; //store the tensor version of the groupby data this.colDtype = colDtype; - this.colIndex = colIndex + this.colIndex = colIndex; + } + + /** + * Generate optimized key generation function based on column types + */ + private getKeyGenerator(): (values: ArrayType1D) => string { + const cacheKey = this.colIndex.join("|"); + + if (this.keyGeneratorCache.has(cacheKey)) { + return this.keyGeneratorCache.get(cacheKey)!; + } + + // Analyze column types to determine best key generation strategy + let allNumeric = true; + let allInteger = true; + + for (let i = 0; i < this.colIndex.length; i++) { + const colIdx = this.colIndex[i] as number; + const dtype = this.colDtype[colIdx]; + if (dtype === "string") { + allNumeric = false; + allInteger = false; + break; + } + // Check if it's integer-like + if (dtype === "float32" || dtype === "float64") { + allInteger = false; + } + } + let keyGenerator: (values: ArrayType1D) => string; + + if (allInteger && this.colIndex.length === 1) { + // Single integer column - fastest path + keyGenerator = (values: ArrayType1D) => String(values[0]); + } else if (allNumeric && this.colIndex.length === 1) { + // Single numeric column + keyGenerator = (values: ArrayType1D) => String(values[0]); + } else if (allInteger) { + // Multiple integer columns - use custom concatenation + keyGenerator = (values: ArrayType1D) => { + let result = String(values[0]); + for (let i = 1; i < values.length; i++) { + result += "-" + String(values[i]); + } + return result; + }; + } else if (allNumeric) { + // Multiple numeric columns + keyGenerator = (values: ArrayType1D) => { + let result = String(values[0]); + for (let i = 1; i < values.length; i++) { + result += "-" + String(values[i]); + } + return result; + }; + } else { + // Mixed types - fall back to join (but with pre-converted strings) + keyGenerator = (values: ArrayType1D) => { + const stringValues = new Array(values.length); + for (let i = 0; i < values.length; i++) { + stringValues[i] = String(values[i]); + } + return stringValues.join("-"); + }; + } + + this.keyGeneratorCache.set(cacheKey, keyGenerator); + return keyGenerator; } + /** * Generate group object data needed for group operations * let data = [ [ 1, 2, 3 ], [ 4, 5, 6 ], [ 20, 30, 40 ], [ 39, 89, 78 ] ]; @@ -84,58 +157,68 @@ export default class Groupby { * This could actually be generated by using split('-') on the object keys * e.g '1-2'.split('-') will give us the value for A and B. * But we might have weird case scenerio where A and B value has '-` - * e.g + * e.g * { * '1--2-': { C: [ 3 ]}, * '4--5-': {C: [ 6 ]} * } * using `.split('-') might not work well - * Hence we create a key-value `keyToValue` object to store index and their + * Hence we create a key-value `keyToValue` object to store index and their * associated value * NOTE: In the previous implementation we made use of Graph representation * for the group by data and Depth First search (DFS). But we decided to use key-value * object in javascript as an hashmap to reduce search time compared to using Grpah and DFS */ - group(): Groupby{ - const self = this - let keyToValue:{ - [key: string] : ArrayType1D - } = {} - const group = this.data?.reduce((prev: any, current)=>{ - let indexes= [] - for(let i in self.colIndex) { - let index = self.colIndex[i] as number - indexes.push(current[index]) - } - let index = indexes.join('-') - - if(!keyToValue[index]) { - keyToValue[index] = indexes - } - - if(prev[index]) { - let data = prev[index] - for (let i in self.columnName) { - let colName = self.columnName[i] as string - data[colName].push(current[i]) + group(): Groupby { + const self = this; + + // Guard clause: if data is null or undefined, return early + if (!this.data) { + return this; + } + + // Pre-compute column indices for faster access + const colIndices = this.colIndex as number[]; + const columnNames = this.columnName as string[]; + const keyGenerator = this.getKeyGenerator(); + + this.data.forEach((current) => { + // Extract group key values more efficiently + const keyValues: ArrayType1D = []; + for (let i = 0; i < colIndices.length; i++) { + keyValues.push(current[colIndices[i]]); + } + + // Use optimized key generation + const keyString = keyGenerator(keyValues); + + // Cache key-to-value mapping only once + if (!this.keyToValue.has(keyString)) { + this.keyToValue.set(keyString, keyValues); + } + + // Get or create group data + let groupData = this._colDict.get(keyString); + if (groupData) { + // Add to existing group - direct array access + for (let i = 0; i < columnNames.length; i++) { + groupData[columnNames[i]].push(current[i]); } } else { - prev[index] = {} - for (let i in self.columnName) { - let colName = self.columnName[i] as string - prev[index][colName] = [current[i]] + // Create new group + groupData = {}; + for (let i = 0; i < columnNames.length; i++) { + groupData[columnNames[i]] = [current[i]]; } + this._colDict.set(keyString, groupData); } - return prev + }); - }, {}) - this.colDict = group - this.keyToValue = keyToValue - return this + return this; } /** - * Generate new internal groupby data + * Generate new internal groupby data * group = df.groupby(['A', 'B']).col('C') * This filter the colDict property as generated by `.group()` * it filter each group to contain only column `C` in their internal object @@ -148,55 +231,58 @@ export default class Groupby { * { * '1-2': { C: [ 3 ]}, * '4-5': {C: [ 6 ]} - * } + * } * @param colNames column names * @return Groupby */ col(colNames: ArrayType1D | undefined): Groupby { - if (typeof colNames === "undefined") { - colNames = this.columnName.filter((_, index)=>{ - return !this.colIndex.includes(index) - }) + colNames = this.columnName.filter((_, index) => { + return !this.colIndex.includes(index); + }); } - let self = this - colNames.forEach((val) => { - if (!self.columnName.includes(val)) - throw new Error(`Column ${val} does not exist in groups`) - }) - let colDict: { [key: string ]: {} } = {...this.colDict} - for(let [key, values] of Object.entries(colDict)) { - let c: { [key: string ]: [] } = {} - let keyVal: any = {...values} - for(let colKey in colNames) { - let colName = colNames[colKey] as string - c[colName] = keyVal[colName] - } - colDict[key] = c + + // Validate column names + const colNamesArray = colNames as string[]; + for (const colName of colNamesArray) { + if (!this.columnName.includes(colName)) + throw new Error(`Column ${colName} does not exist in groups`); + } + + // Create new Map with filtered columns (avoid deep copying) + const newColDict = new Map(); + + for (const [key, values] of Array.from(this._colDict.entries())) { + const filteredData: { [key: string]: ArrayType1D } = {}; + for (const colName of colNamesArray) { + filteredData[colName] = values[colName]; + } + newColDict.set(key, filteredData); } + const gp = new Groupby( this.keyCol, null, this.columnName, this.colDtype, this.colIndex - ) - gp.colDict = colDict - gp.groupColNames = colNames as Array - gp.keyToValue = this.keyToValue + ); + gp._colDict = newColDict; + gp.groupColNames = colNamesArray; + gp.keyToValue = this.keyToValue; - return gp + return gp; } /** * Perform all groupby arithmetic operations - * In the previous implementation all groups data are - * stord as DataFrame, which involve lot of memory usage + * In the previous implementation all groups data are + * stord as DataFrame, which involve lot of memory usage * Hence each groups are just pure javascrit object - * and all arithmetic operation is done directly on javascript + * and all arithmetic operation is done directly on javascript * arrays. - * e.g - * using this internal data + * e.g + * using this internal data * { * '1-2': {A: [ 1,3 ], B: [ 2,5 ], C: [ 3, 5 ]}, * '4-5': {A: [ 4,1 ], B: [ 5,0 ], C: [ 6, 12 ]} @@ -211,7 +297,7 @@ export default class Groupby { * B: 'sum', * C: 'min' * }) - * result: + * result: * { * '1-2': {A_mean: [ 2 ], B_sum: [ 7 ], C_min: [ 3 ]}, * '4-5': {A_mean: [ 2.5 ], B_sum: [ 5 ], C_min: [ 6 ]} @@ -226,294 +312,559 @@ export default class Groupby { * '1-2': {A_mean: [ 2 ], B_sum: [ 7 ], C_min: [ 3 ], C_max: [5]}, * '4-5': {A_mean: [ 2.5 ], B_sum: [ 5 ], C_min: [ 6 ], C_max: [12]} * } - * @param operation + * @param operation */ - private arithemetic(operation: {[key: string] : Array | string} | string): { [key: string ]: {} } { + private arithemetic( + operation: { [key: string]: Array | string } | string + ): Map }> { + const opsName = [ + "mean", + "sum", + "count", + "mode", + "std", + "var", + "cumsum", + "cumprod", + "cummax", + "cummin", + "median", + "min", + "max", + ]; - const opsName = [ "mean", "sum", "count", "mode", "std", "var", "cumsum", "cumprod", - "cummax", "cummin", "median" , "min", "max"]; - if (typeof operation === "string" ) { + // Validate operations + if (typeof operation === "string") { if (!opsName.includes(operation)) { - throw new Error(`group operation: ${operation} is not valid`) + throw new Error(`group operation: ${operation} is not valid`); } } else { - Object.keys(operation).forEach((key)=>{ - let ops = operation[key] - if(Array.isArray(ops)) { - for(let op of ops) { + Object.keys(operation).forEach((key) => { + let ops = operation[key]; + if (Array.isArray(ops)) { + for (let op of ops) { if (!opsName.includes(op)) { - throw new Error(`group operation: ${op} for column ${key} is not valid`) + throw new Error( + `group operation: ${op} for column ${key} is not valid` + ); } } } else { if (!opsName.includes(ops)) { - throw new Error(`group operation: ${ops} for column ${key} is not valid`) + throw new Error( + `group operation: ${ops} for column ${key} is not valid` + ); } } - - }) + }); } - let colDict: { [key: string ]: {} } = {...this.colDict} - for(const [key, values] of Object.entries(colDict)) { - let colVal: { [key: string ]: Array } = {} - let keyVal: any = {...values} - let groupColNames: Array = this.groupColNames as Array - for(let colKey=0; colKey < groupColNames.length; colKey++) { - let colName = groupColNames[colKey] - let colIndex = this.columnName.indexOf(colName) - let colDtype = this.colDtype[colIndex] - let operationVal = (typeof operation === "string") ? operation : operation[colName] - if (colDtype === "string" && operationVal !== "count") throw new Error(`Can't perform math operation on column ${colName}`) - if (typeof operation === "string") { - let colName2 = `${colName}_${operation}` - colVal[colName2] = this.groupMathLog(keyVal[colName], operation) + const resultMap = new Map }>(); + const groupColNames: Array = this.groupColNames as Array; + + for (const [key, values] of Array.from(this._colDict.entries())) { + const colVal: { [key: string]: Array } = {}; + + for (let colKey = 0; colKey < groupColNames.length; colKey++) { + const colName = groupColNames[colKey]; + const colIndex = this.columnName.indexOf(colName); + const colDtype = this.colDtype[colIndex]; + const operationVal = + typeof operation === "string" ? operation : operation[colName]; + + if (colDtype === "string" && operationVal !== "count") { + throw new Error(`Can't perform math operation on column ${colName}`); } - else { - if(Array.isArray(operation[colName])) { - for(let ops of operation[colName]) { - let colName2 = `${colName}_${ops}` - colVal[colName2] = this.groupMathLog(keyVal[colName],ops) + + if (typeof operation === "string") { + const colName2 = `${colName}_${operation}`; + colVal[colName2] = this.singleMathOperation( + values[colName] as Array, + operation + ); + } else { + if (Array.isArray(operation[colName])) { + // Use multi-pass aggregation for multiple operations on same column + const operations = operation[colName] as string[]; + const results = this.multiPassAggregation( + operations, + values[colName] as Array + ); + + for (const ops of operations) { + const colName2 = `${colName}_${ops}`; + colVal[colName2] = results[ops]; } } else { - let ops: string = operation[colName] as string - let colName2 = `${colName}_${ops}` - colVal[colName2] = this.groupMathLog(keyVal[colName], ops) + const ops: string = operation[colName] as string; + const colName2 = `${colName}_${ops}`; + colVal[colName2] = this.singleMathOperation( + values[colName] as Array, + ops + ); } - } } - colDict[key] = colVal + resultMap.set(key, colVal); } - return colDict + return resultMap; } /** - * Peform all arithmetic logic - * @param colVal - * @param ops + * Convert array to typed array for better performance on numeric operations */ - private groupMathLog(colVal: Array, ops: string): Array{ - let data = [] - switch(ops) { - case "max": - let max = colVal.reduce((prev, curr)=> { - if (prev > curr) { - return prev - } - return curr - }) - data.push(max) - break; - case "min": - let min = colVal.reduce((prev, curr)=> { - if (prev < curr) { - return prev - } - return curr - }) - data.push(min) - break; + private optimizeNumericArray( + colVal: Array + ): Float64Array | Array { + // Use typed arrays for pure numeric data to improve performance + try { + // Check if all values are numeric + let allNumeric = true; + for (let i = 0; i < colVal.length && allNumeric; i++) { + if (typeof colVal[i] !== "number" || !isFinite(colVal[i])) { + allNumeric = false; + } + } + + if (allNumeric && colVal.length > 10) { + // Only use for larger arrays + return new Float64Array(colVal); + } + } catch (e) { + // Fall back to regular array if typed array creation fails + } + + return colVal; + } + + /** + * Optimized math operations for typed arrays + */ + private fastMathOperations = { + sum: (arr: Float64Array | Array): number => { + let sum = 0; + for (let i = 0; i < arr.length; i++) { + sum += arr[i]; + } + return sum; + }, + + min: (arr: Float64Array | Array): number => { + let min = arr[0]; + for (let i = 1; i < arr.length; i++) { + if (arr[i] < min) min = arr[i]; + } + return min; + }, + + max: (arr: Float64Array | Array): number => { + let max = arr[0]; + for (let i = 1; i < arr.length; i++) { + if (arr[i] > max) max = arr[i]; + } + return max; + }, + + mean: (arr: Float64Array | Array): number => { + return this.fastMathOperations.sum(arr) / arr.length; + }, + }; + + /** + * Single-pass multi-aggregation for maximum performance + * Computes multiple operations in one pass through the data + */ + private multiPassAggregation( + operations: string[], + colVal: Array + ): { [key: string]: Array } { + const results: { [key: string]: Array } = {}; + const needsSum = operations.includes("sum") || operations.includes("mean"); + const needsMinMax = + operations.includes("min") || operations.includes("max"); + const needsCumulative = operations.some((op) => op.startsWith("cum")); + + // Optimize array for numeric operations + const optimizedArray = this.optimizeNumericArray(colVal); + const length = optimizedArray.length; + + // Use optimized operations for basic aggregations + let sum: number | undefined; + let min: number | undefined; + let max: number | undefined; + + if (needsSum) { + sum = this.fastMathOperations.sum(optimizedArray); + } + if (needsMinMax) { + min = this.fastMathOperations.min(optimizedArray); + max = this.fastMathOperations.max(optimizedArray); + } + + // Assign results for basic operations + for (const op of operations) { + switch (op) { + case "sum": + results[op] = [sum!]; + break; + case "count": + results[op] = [length]; + break; + case "mean": + results[op] = [sum! / length]; + break; + case "min": + results[op] = [min!]; + break; + case "max": + results[op] = [max!]; + break; + case "std": + results[op] = [std(colVal)]; + break; + case "var": + results[op] = [variance(colVal)]; + break; + case "median": + results[op] = [median(colVal)]; + break; + case "mode": + results[op] = [mode(colVal)]; + break; + } + } + + // Handle cumulative operations separately (they need arrays) + for (const op of operations) { + if (op.startsWith("cum")) { + results[op] = this.singleMathOperation(colVal, op); + } + } + + return results; + } + + /** + * Single operation computation (fallback for individual operations) + */ + private singleMathOperation( + colVal: Array, + op: string + ): Array { + // Use optimized operations for basic math when possible + const optimizedArray = this.optimizeNumericArray(colVal); + + switch (op) { case "sum": - let sum = colVal.reduce((prev, curr)=> { - return prev + curr - }) - data.push(sum) - break; - case "count": - data.push(colVal.length) - break; + return [this.fastMathOperations.sum(optimizedArray)]; case "mean": - let sumMean = colVal.reduce((prev, curr)=> { - return prev + curr - }) - data.push(sumMean / colVal.length) - break; - case "std": - data.push(std(colVal)) - break; - case "var": - data.push(variance(colVal)) - break; - case "median": - data.push(median(colVal)) - break; - case "mode": - data.push(mode(colVal)) - break; - case "cumsum": - colVal.reduce((prev, curr) => { - let sum = prev + curr - data.push(sum) - return sum - }, 0) - break; - case "cummin": - data = [colVal[0]] - colVal.slice(1,).reduce((prev, curr)=>{ - if (prev < curr) { - data.push(prev) - return prev - } - data.push(curr) - return curr - }, data[0]) - break; - case "cummax": - data = [colVal[0]] - colVal.slice(1,).reduce((prev, curr)=> { - if (prev > curr) { - data.push(prev) - return prev - } - data.push(curr) - return curr - }, data[0]) - break; - case "cumprod": - colVal.reduce((prev, curr) => { - let sum = prev * curr - data.push(sum) - return sum - }, 1) - break; + return [this.fastMathOperations.mean(optimizedArray)]; + case "min": + return [this.fastMathOperations.min(optimizedArray)]; + case "max": + return [this.fastMathOperations.max(optimizedArray)]; + case "count": + return [optimizedArray.length]; + default: + // Fall back to original implementation for complex operations + const operation = + Groupby.mathOperations[op as keyof typeof Groupby.mathOperations]; + return operation ? operation(colVal) : []; } - return data + } + + // Function lookup table for arithmetic operations (better performance than switch) + private static readonly mathOperations = { + max: (colVal: Array): Array => { + let max = colVal[0]; + for (let i = 1; i < colVal.length; i++) { + if (colVal[i] > max) max = colVal[i]; + } + return [max]; + }, + min: (colVal: Array): Array => { + let min = colVal[0]; + for (let i = 1; i < colVal.length; i++) { + if (colVal[i] < min) min = colVal[i]; + } + return [min]; + }, + sum: (colVal: Array): Array => { + let sum = 0; + for (let i = 0; i < colVal.length; i++) { + sum += colVal[i]; + } + return [sum]; + }, + count: (colVal: Array): Array => [colVal.length], + mean: (colVal: Array): Array => { + let sum = 0; + for (let i = 0; i < colVal.length; i++) { + sum += colVal[i]; + } + return [sum / colVal.length]; + }, + std: (colVal: Array): Array => [std(colVal)], + var: (colVal: Array): Array => [variance(colVal)], + median: (colVal: Array): Array => [median(colVal)], + mode: (colVal: Array): Array => [mode(colVal)], + cumsum: (colVal: Array): Array => { + const data: Array = []; + let sum = 0; + for (let i = 0; i < colVal.length; i++) { + sum += colVal[i]; + data.push(sum); + } + return data; + }, + cummin: (colVal: Array): Array => { + const data: Array = [colVal[0]]; + let min = colVal[0]; + for (let i = 1; i < colVal.length; i++) { + if (colVal[i] < min) min = colVal[i]; + data.push(min); + } + return data; + }, + cummax: (colVal: Array): Array => { + const data: Array = [colVal[0]]; + let max = colVal[0]; + for (let i = 1; i < colVal.length; i++) { + if (colVal[i] > max) max = colVal[i]; + data.push(max); + } + return data; + }, + cumprod: (colVal: Array): Array => { + const data: Array = []; + let prod = 1; + for (let i = 0; i < colVal.length; i++) { + prod *= colVal[i]; + data.push(prod); + } + return data; + }, + }; + + /** + * Peform all arithmetic logic (legacy method - use singleMathOperation instead) + * @param colVal + * @param ops + */ + private groupMathLog(colVal: Array, ops: string): Array { + return this.singleMathOperation(colVal, ops); } /** * Takes in internal groupby internal data and convert * them to a single data frame. - * @param colDict + * @param colDict */ - private toDataFrame(colDict: { [key: string ]: {} }): DataFrame { - let data: { [key: string ]: ArrayType1D } = {} - - for(let key of this.colKeyDict(colDict)) { - let value = colDict[key] - let keyDict: { [key: string ]: ArrayType1D } = {} - let oneValue = Object.values(value)[0] as ArrayType1D - let valueLen = oneValue.length - for(let key1 in this.keyCol) { - let keyName = this.keyCol[key1] as string - let keyValue = this.keyToValue[key][key1] - keyDict[keyName] = Array(valueLen).fill(keyValue) - } - let combine: { [key: string ]: ArrayType1D } = {...keyDict, ...value} - if(Object.keys(data).length < 1) { - data = combine + private toDataFrame( + colDict: Map + ): DataFrame { + const data: { [key: string]: ArrayType1D } = {}; + const keys = this.colKeyDict(colDict); + + // Handle empty case - return empty DataFrame with proper column structure + if (keys.length === 0) { + const columns: string[] = []; + // Add key column names + for (let keyIdx = 0; keyIdx < this.keyCol.length; keyIdx++) { + const keyName = this.keyCol[keyIdx] as string; + columns.push(keyName); + data[keyName] = []; + } + // Add group column names if they exist + if (this.groupColNames) { + for (const colName of this.groupColNames) { + columns.push(colName); + data[colName] = []; + } + } + return new DataFrame([], { columns }); + } + + // Initialize data structure more efficiently + let isFirstGroup = true; + + for (const key of keys) { + const value = colDict.get(key)!; + const valueEntries = Object.entries(value); + const oneValue = valueEntries[0][1] as ArrayType1D; + const valueLen = oneValue.length; + + if (isFirstGroup) { + // Initialize arrays for the first group + // Add key columns with pre-allocated arrays (faster than Array.fill) + for (let keyIdx = 0; keyIdx < this.keyCol.length; keyIdx++) { + const keyName = this.keyCol[keyIdx] as string; + const keyValue = this.keyToValue.get(key)![keyIdx]; + const keyArray = new Array(valueLen); + for (let i = 0; i < valueLen; i++) { + keyArray[i] = keyValue; + } + data[keyName] = keyArray; + } + + // Add value columns + for (const [colName, colValues] of valueEntries) { + data[colName] = [...colValues]; + } + isFirstGroup = false; } else { - for(let dataKey of Object.keys(data)) { - let dataValue = combine[dataKey] as ArrayType1D - data[dataKey] = [...data[dataKey], ...dataValue] + // Append to existing arrays using batch operations + // Add key columns with optimized batch assignment + for (let keyIdx = 0; keyIdx < this.keyCol.length; keyIdx++) { + const keyName = this.keyCol[keyIdx] as string; + const keyValue = this.keyToValue.get(key)![keyIdx]; + const existingArray = data[keyName] as any[]; + const startIndex = existingArray.length; + + // Extend array length once, then assign directly + existingArray.length += valueLen; + for (let i = 0; i < valueLen; i++) { + existingArray[startIndex + i] = keyValue; + } + } + + // Add value columns with optimized batch copying + for (const [colName, colValues] of valueEntries) { + const existingArray = data[colName] as any[]; + const startIndex = existingArray.length; + + // Extend array length once, then copy directly + existingArray.length += colValues.length; + for (let i = 0; i < colValues.length; i++) { + existingArray[startIndex + i] = colValues[i]; + } } } } - return new DataFrame(data) + + return new DataFrame(data); } private operations(ops: string): DataFrame { + // Handle empty case early + if (this._colDict.size === 0) { + const columns: string[] = []; + // Add key column names + for (let keyIdx = 0; keyIdx < this.keyCol.length; keyIdx++) { + const keyName = this.keyCol[keyIdx] as string; + columns.push(keyName); + } + // Add result column names + const targetColumns = + this.groupColNames || + this.columnName.filter((_, index) => !this.colIndex.includes(index)); + for (const colName of targetColumns) { + columns.push(`${colName}_${ops}`); + } + return new DataFrame([], { columns }); + } + if (!this.groupColNames) { - let colGroup = this.col(undefined) - let colDict = colGroup.arithemetic(ops) - let df = colGroup.toDataFrame(colDict) - return df + let colGroup = this.col(undefined); + let colDict = colGroup.arithemetic(ops); + let df = colGroup.toDataFrame(colDict); + return df; } - let colDict = this.arithemetic(ops) - let df = this.toDataFrame(colDict) - return df + let colDict = this.arithemetic(ops); + let df = this.toDataFrame(colDict); + return df; } /** * Obtain the count for each group * @returns DataFrame - * + * */ count(): DataFrame { - return this.operations("count") + return this.operations("count"); } /** * Obtain the sum of columns for each group * @returns DataFrame - * + * */ - sum(): DataFrame{ - return this.operations("sum") + sum(): DataFrame { + return this.operations("sum"); } /** * Obtain the standard deviation of columns for each group * @returns DataFrame */ - std(): DataFrame{ - return this.operations("std") + std(): DataFrame { + return this.operations("std"); } /** * Obtain the variance of columns for each group * @returns DataFrame */ - var(): DataFrame{ - return this.operations("var") + var(): DataFrame { + return this.operations("var"); } /** * Obtain the mean of columns for each group * @returns DataFrame */ - mean(): DataFrame{ - return this.operations("mean") + mean(): DataFrame { + return this.operations("mean"); } /** * Obtain the cumsum of columns for each group * @returns DataFrame - * + * */ - cumSum(): DataFrame{ - return this.operations("cumsum") + cumSum(): DataFrame { + return this.operations("cumsum"); } /** * Obtain the cummax of columns for each group * @returns DataFrame */ - cumMax(): DataFrame{ - return this.operations("cummax") + cumMax(): DataFrame { + return this.operations("cummax"); } /** * Obtain the cumprod of columns for each group * @returns DataFrame */ - cumProd(): DataFrame{ - return this.operations("cumprod") + cumProd(): DataFrame { + return this.operations("cumprod"); } /** * Obtain the cummin of columns for each group * @returns DataFrame */ - cumMin(): DataFrame{ - return this.operations("cummin") + cumMin(): DataFrame { + return this.operations("cummin"); } /** * Obtain the max value of columns for each group * @returns DataFrame - * + * */ - max(): DataFrame{ - return this.operations("max") + max(): DataFrame { + return this.operations("max"); } /** * Obtain the min of columns for each group * @returns DataFrame */ - min(): DataFrame{ - return this.operations("min") + min(): DataFrame { + return this.operations("min"); } /** @@ -522,18 +873,42 @@ export default class Groupby { * @returns DataFrame */ getGroup(keys: Array): DataFrame { - let dictKey = keys.join("-") - let colDict: { [key: string ]: {} } = {} - colDict[dictKey] = {...this.colDict[dictKey]} - return this.toDataFrame(colDict) + const dictKey = keys.join("-"); + const colDict = new Map(); + const groupData = this._colDict.get(dictKey); + if (groupData) { + colDict.set(dictKey, groupData); + } + return this.toDataFrame(colDict); } /** * Perform aggregation on all groups - * @param ops + * @param ops * @returns DataFrame */ - agg(ops: { [key: string ]: Array | string }): DataFrame { + agg(ops: { [key: string]: Array | string }): DataFrame { + // Handle empty case early + if (this._colDict.size === 0) { + const columns: string[] = []; + // Add key column names + for (let keyIdx = 0; keyIdx < this.keyCol.length; keyIdx++) { + const keyName = this.keyCol[keyIdx] as string; + columns.push(keyName); + } + // Add result column names for each operation + for (const [colName, operations] of Object.entries(ops)) { + if (Array.isArray(operations)) { + for (const op of operations) { + columns.push(`${colName}_${op}`); + } + } else { + columns.push(`${colName}_${operations}`); + } + } + return new DataFrame([], { columns }); + } + let columns = Object.keys(ops); let col_gp = this.col(columns); let data = col_gp.arithemetic(ops); @@ -544,79 +919,106 @@ export default class Groupby { /** * Apply custom aggregator function * to each group - * @param callable + * @param callable * @returns DataFrame * @example * let grp = df.groupby(['A']) * grp.apply((x) => x.count()) */ - apply(callable: (x: DataFrame)=> DataFrame | Series ): DataFrame { - let colDict: { [key: string ]: DataFrame | Series } = {} - for(const key of this.colKeyDict(this.colDict)) { - let valDataframe = new DataFrame(this.colDict[key]) - colDict[key] = callable(valDataframe) + apply(callable: (x: DataFrame) => DataFrame | Series): DataFrame { + const colDict: { [key: string]: DataFrame | Series } = {}; + const keys = this.colKeyDict(this._colDict); + + for (const key of keys) { + const groupData = this._colDict.get(key)!; + const valDataframe = new DataFrame(groupData); + colDict[key] = callable(valDataframe); } - return this.concatGroups(colDict) + return this.concatGroups(colDict); } - private concatGroups(colDict: {[key: string]: DataFrame | Series}): DataFrame { - let data: Array = [] - for(const [key, values] of Object.entries(colDict)) { + private concatGroups(colDict: { + [key: string]: DataFrame | Series; + }): DataFrame { + let data: Array = []; + for (const [key, values] of Object.entries(colDict)) { let copyDf: DataFrame; if (values instanceof DataFrame) { - copyDf = values.copy() - } - else { - let columns = values.index as string[] - columns = columns.length > 1 ? columns : ['applyOps'] - copyDf = new DataFrame([values.values], {columns: columns }) - } - let len = copyDf.shape[0] - let key1: any; - for(key1 in this.keyCol){ - - let keyName = this.keyCol[key1] as string - let keyValue = this.keyToValue[key][key1] - let dfValue = Array(len).fill(keyValue) - let atIndex: number = parseInt(key1) - if (this.groupColNames) { - copyDf.addColumn(keyName, dfValue, {inplace: true, atIndex: atIndex }) + copyDf = values.copy(); + } else { + let columns = values.index as string[]; + columns = columns.length > 1 ? columns : ["applyOps"]; + copyDf = new DataFrame([values.values], { columns: columns }); + } + let len = copyDf.shape[0]; + const keyValues = this.keyToValue.get(key)!; + for (let keyIdx = 0; keyIdx < this.keyCol.length; keyIdx++) { + const keyName = this.keyCol[keyIdx] as string; + const keyValue = keyValues[keyIdx]; + // Use pre-allocated array instead of Array.fill() + const dfValue = new Array(len); + for (let i = 0; i < len; i++) { + dfValue[i] = keyValue; } - else { - copyDf.addColumn(`${keyName}_Group`, dfValue, {inplace: true, atIndex: atIndex }) + + if (this.groupColNames) { + copyDf.addColumn(keyName, dfValue, { + inplace: true, + atIndex: keyIdx, + }); + } else { + copyDf.addColumn(`${keyName}_Group`, dfValue, { + inplace: true, + atIndex: keyIdx, + }); } - } - data.push(copyDf) + data.push(copyDf); } - return concat({dfList: data, axis:0}) as DataFrame + return concat({ dfList: data, axis: 0 }) as DataFrame; } - + /** * obtain the total number of groups * @returns number */ - get ngroups(): number{ - let keys = Object.keys(this.colDict) - return keys.length + get ngroups(): number { + return this._colDict.size; } /** * obtaind the internal group data - * @returns {[keys: string]: {}} + * @returns { [key: string]: { [key: string]: ArrayType1D } } (backward compatibility) + */ + get groups(): { [key: string]: { [key: string]: ArrayType1D } } { + // Ensure grouping has been done + if (this._colDict.size === 0) { + this.group(); + } + // Convert Map to object for backward compatibility + const result: { [key: string]: { [key: string]: ArrayType1D } } = {}; + Array.from(this._colDict.entries()).forEach(([key, value]) => { + result[key] = value; + }); + return result; + } + + /** + * Backward compatibility for colDict property access + * @returns { [key: string]: { [key: string]: ArrayType1D } } */ - get groups(): {[keys: string]: {}}{ - return this.colDict + get colDict(): { [key: string]: { [key: string]: ArrayType1D } } { + return this.groups; } /** * Obtain the first row of each group * @returns DataFrame */ - first(): DataFrame{ - return this.apply((x)=>{ - return x.head(1) - }) + first(): DataFrame { + return this.apply((x) => { + return x.head(1); + }); } /** @@ -624,9 +1026,9 @@ export default class Groupby { * @returns DataFrame */ last(): DataFrame { - return this.apply((x)=>{ - return x.tail(1) - }) + return this.apply((x) => { + return x.tail(1); + }); } /** @@ -634,28 +1036,35 @@ export default class Groupby { * @returns DataFrame */ size(): DataFrame { - return this.apply((x)=>{ - return new Series([x.shape[0]]) - }) + return this.apply((x) => { + return new Series([x.shape[0]]); + }); } - private colKeyDict(colDict: { [key: string ]: {} }): string[]{ - let keyDict :{ [key: string ]: string[] } = {} + private colKeyDict( + colDict: Map + ): string[] { + const keyDict: { [key: string]: string[] } = {}; + const firstKeyOrder: string[] = []; - for(let key of Object.keys(colDict)) { - let firstKey = key.split("-")[0] + // Collect keys and group by first key, preserving insertion order + for (const key of Array.from(colDict.keys())) { + const firstKey = key.split("-")[0]; if (firstKey in keyDict) { - keyDict[firstKey].push(key) - } - else { - keyDict[firstKey] = [key] + keyDict[firstKey].push(key); + } else { + keyDict[firstKey] = [key]; + firstKeyOrder.push(firstKey); } } - let keys = [] - for(let key of Object.keys(keyDict)) { - keys.push(...keyDict[key]) + + // Preserve first key appearance order (don't sort alphabetically) + const sortedFirstKeys = firstKeyOrder; + const keys: string[] = []; + for (const firstKey of sortedFirstKeys) { + // Preserve insertion order within each group + keys.push(...keyDict[firstKey]); } - return keys + return keys; } - -} \ No newline at end of file +} diff --git a/src/danfojs-base/index.ts b/src/danfojs-base/index.ts index 6bdaa757..aca26bb0 100644 --- a/src/danfojs-base/index.ts +++ b/src/danfojs-base/index.ts @@ -29,7 +29,7 @@ import merge from "./transformers/merge" import dateRange from "./core/daterange" import tensorflow from "./shared/tensorflowlib" -const __version = "1.1.2"; +const __version = "1.2.0"; export { NDframe, diff --git a/src/danfojs-base/io/browser/io.csv.ts b/src/danfojs-base/io/browser/io.csv.ts index daac8ba9..3def7ad6 100644 --- a/src/danfojs-base/io/browser/io.csv.ts +++ b/src/danfojs-base/io/browser/io.csv.ts @@ -23,6 +23,7 @@ import Papa from 'papaparse' * hence all PapaParse options are supported. * @param options Configuration object. Supports all Papaparse parse config options. * @returns DataFrame containing the parsed CSV file. + * @throws {Error} If file cannot be read or parsed * @example * ``` * import { readCSV } from "danfojs-node" @@ -47,17 +48,42 @@ import Papa from 'papaparse' */ const $readCSV = async (file: any, options?: CsvInputOptionsBrowser): Promise => { const frameConfig = options?.frameConfig || {} + const hasStringType = frameConfig.dtypes?.includes("string") + + return new Promise((resolve, reject) => { + let hasError = false; - return new Promise(resolve => { Papa.parse(file, { header: true, - dynamicTyping: true, + dynamicTyping: !hasStringType, skipEmptyLines: 'greedy', + delimiter: ",", ...options, + error: (error) => { + hasError = true; + reject(new Error(`Failed to parse CSV: ${error.message}`)); + }, download: true, - complete: results => { - const df = new DataFrame(results.data, frameConfig); - resolve(df); + complete: (results) => { + if (hasError) return; // Skip if error already occurred + + if (!results.data || results.data.length === 0) { + reject(new Error('No data found in CSV file')); + return; + } + + if (results.errors && results.errors.length > 0) { + reject(new Error(`CSV parsing errors: ${results.errors.map(e => e.message).join(', ')}`)); + return; + } + + try { + const df = new DataFrame(results.data, frameConfig); + resolve(df); + } catch (error) { + const errorMessage = error instanceof Error ? error.message : 'Unknown error occurred'; + reject(new Error(`Failed to create DataFrame from CSV: ${errorMessage}`)); + } } }); }); @@ -81,18 +107,36 @@ const $readCSV = async (file: any, options?: CsvInputOptionsBrowser): Promise void, options: CsvInputOptionsBrowser,): Promise => { const frameConfig = options?.frameConfig || {} - return new Promise(resolve => { + return new Promise((resolve, reject) => { let count = 0 + let hasError = false; + const hasStringType = frameConfig.dtypes?.includes("string") Papa.parse(file, { - ...options, - dynamicTyping: true, header: true, download: true, + dynamicTyping: !hasStringType, + delimiter: ",", + ...options, step: results => { - const df = new DataFrame([results.data], { ...frameConfig, index: [count++] }); - callback(df); + if (hasError) return; + try { + const df = new DataFrame([results.data], { ...frameConfig, index: [count++] }); + callback(df); + } catch (error) { + hasError = true; + const errorMessage = error instanceof Error ? error.message : 'Unknown error occurred'; + reject(new Error(`Failed to process CSV chunk: ${errorMessage}`)); + } }, - complete: () => resolve(null) + complete: () => { + if (!hasError) { + resolve(null); + } + }, + error: (error) => { + hasError = true; + reject(new Error(`Failed to parse CSV: ${error.message}`)); + } }); }); }; diff --git a/src/danfojs-base/io/node/io.csv.ts b/src/danfojs-base/io/node/io.csv.ts index c99bfd59..6cf11e86 100644 --- a/src/danfojs-base/io/node/io.csv.ts +++ b/src/danfojs-base/io/node/io.csv.ts @@ -25,6 +25,7 @@ import fs from 'fs' * hence all PapaParse options are supported. * @param options Configuration object. Supports all Papaparse parse config options. * @returns DataFrame containing the parsed CSV file. + * @throws {Error} If file cannot be read or parsed * @example * ``` * import { readCSV } from "danfojs-node" @@ -49,13 +50,16 @@ import fs from 'fs' */ const $readCSV = async (filePath: string, options?: CsvInputOptionsNode): Promise => { const frameConfig = options?.frameConfig || {} + const hasStringType = frameConfig.dtypes?.includes("string") if (filePath.startsWith("http") || filePath.startsWith("https")) { return new Promise((resolve, reject) => { + let hasError = false; const optionsWithDefaults = { header: true, - dynamicTyping: true, + dynamicTyping: !hasStringType, skipEmptyLines: 'greedy', + delimiter: ",", ...options, } @@ -63,6 +67,7 @@ const $readCSV = async (filePath: string, options?: CsvInputOptionsNode): Promis // reject any non-2xx status codes dataStream.on('response', (response: any) => { if (response.statusCode < 200 || response.statusCode >= 300) { + hasError = true; reject(new Error(`HTTP ${response.statusCode}: ${response.statusMessage}`)); } }); @@ -72,11 +77,31 @@ const $readCSV = async (filePath: string, options?: CsvInputOptionsNode): Promis const data: any = []; parseStream.on("data", (chunk: any) => { - data.push(chunk); + if (!hasError) { + data.push(chunk); + } + }); + + parseStream.on("error", (error: any) => { + hasError = true; + reject(new Error(`Failed to parse CSV: ${error.message}`)); }); parseStream.on("finish", () => { - resolve(new DataFrame(data, frameConfig)); + if (hasError) return; + + if (!data || data.length === 0) { + reject(new Error('No data found in CSV file')); + return; + } + + try { + const df = new DataFrame(data, frameConfig); + resolve(df); + } catch (error) { + const errorMessage = error instanceof Error ? error.message : 'Unknown error occurred'; + reject(new Error(`Failed to create DataFrame: ${errorMessage}`)); + } }); }); @@ -84,18 +109,42 @@ const $readCSV = async (filePath: string, options?: CsvInputOptionsNode): Promis return new Promise((resolve, reject) => { fs.access(filePath, fs.constants.F_OK, (err) => { if (err) { - reject("ENOENT: no such file or directory"); + reject(new Error("ENOENT: no such file or directory")); + return; } const fileStream = fs.createReadStream(filePath) + let hasError = false; Papa.parse(fileStream, { header: true, - dynamicTyping: true, + dynamicTyping: !hasStringType, + delimiter: ",", ...options, + error: (error) => { + hasError = true; + reject(new Error(`Failed to parse CSV: ${error.message}`)); + }, complete: results => { - const df = new DataFrame(results.data, frameConfig); - resolve(df); + if (hasError) return; + + if (!results.data || results.data.length === 0) { + reject(new Error('No data found in CSV file')); + return; + } + + if (results.errors && results.errors.length > 0) { + reject(new Error(`CSV parsing errors: ${results.errors.map(e => e.message).join(', ')}`)); + return; + } + + try { + const df = new DataFrame(results.data, frameConfig); + resolve(df); + } catch (error) { + const errorMessage = error instanceof Error ? error.message : 'Unknown error occurred'; + reject(new Error(`Failed to create DataFrame: ${errorMessage}`)); + } } }); }) @@ -109,6 +158,7 @@ const $readCSV = async (filePath: string, options?: CsvInputOptionsNode): Promis * hence all PapaParse options are supported. * @param callback Callback function to be called once the specifed rows are parsed into DataFrame. * @param options Configuration object. Supports all Papaparse parse config options. + * @throws {Error} If file cannot be read or parsed * @example * ``` * import { streamCSV } from "danfojs-node" @@ -128,12 +178,14 @@ const $streamCSV = async (filePath: string, callback: (df: DataFrame) => void, o ...options, } return new Promise((resolve, reject) => { - let count = 0 + let count = 0; + let hasError = false; const dataStream = request.get(filePath); // reject any non-2xx status codes dataStream.on('response', (response: any) => { if (response.statusCode < 200 || response.statusCode >= 300) { + hasError = true; reject(new Error(`HTTP ${response.statusCode}: ${response.statusMessage}`)); } }); @@ -142,35 +194,71 @@ const $streamCSV = async (filePath: string, callback: (df: DataFrame) => void, o dataStream.pipe(parseStream); parseStream.on("data", (chunk: any) => { - const df = new DataFrame([chunk], { ...frameConfig, index: [count++], }); - callback(df); + if (hasError) return; + try { + const df = new DataFrame([chunk], { ...frameConfig, index: [count++] }); + callback(df); + } catch (error) { + hasError = true; + const errorMessage = error instanceof Error ? error.message : 'Unknown error occurred'; + reject(new Error(`Failed to process CSV chunk: ${errorMessage}`)); + } }); - parseStream.on("finish", () => { - resolve(null); + parseStream.on("error", (error: any) => { + hasError = true; + reject(new Error(`Failed to parse CSV: ${error.message}`)); }); + parseStream.on("finish", () => { + if (!hasError) { + resolve(null); + } + }); }); } else { - return new Promise((resolve, reject) => { fs.access(filePath, fs.constants.F_OK, (err) => { if (err) { - reject("ENOENT: no such file or directory"); + reject(new Error("ENOENT: no such file or directory")); + return; } const fileStream = fs.createReadStream(filePath) + let hasError = false; + let count = 0; - let count = 0 Papa.parse(fileStream, { header: true, dynamicTyping: true, ...options, + error: (error) => { + hasError = true; + reject(new Error(`Failed to parse CSV: ${error.message}`)); + }, step: results => { - const df = new DataFrame([results.data], { ...frameConfig, index: [count++] }); - callback(df); + if (hasError) return; + + if (results.errors && results.errors.length > 0) { + hasError = true; + reject(new Error(`CSV parsing errors: ${results.errors.map(e => e.message).join(', ')}`)); + return; + } + + try { + const df = new DataFrame([results.data], { ...frameConfig, index: [count++] }); + callback(df); + } catch (error) { + hasError = true; + const errorMessage = error instanceof Error ? error.message : 'Unknown error occurred'; + reject(new Error(`Failed to process CSV chunk: ${errorMessage}`)); + } }, - complete: () => resolve(null) + complete: () => { + if (!hasError) { + resolve(null); + } + } }); }); }); diff --git a/src/danfojs-base/io/node/io.excel.ts b/src/danfojs-base/io/node/io.excel.ts index 0a653c21..f6554086 100644 --- a/src/danfojs-base/io/node/io.excel.ts +++ b/src/danfojs-base/io/node/io.excel.ts @@ -71,7 +71,7 @@ const $readExcel = async (filePath: string, options: ExcelInputOptionsNode = {}) const arrBufInt8 = new Uint8Array(arrBuf); const workbook = read(arrBufInt8, { type: "array", ...parsingOptions }); const worksheet = workbook.Sheets[workbook.SheetNames[sheet]]; - const data = utils.sheet_to_json(worksheet); + const data = utils.sheet_to_json(worksheet, { defval: null }); const df = new DataFrame(data, frameConfig); resolve(df); }); @@ -89,7 +89,7 @@ const $readExcel = async (filePath: string, options: ExcelInputOptionsNode = {}) const workbook = readFile(filePath, parsingOptions); const worksheet = workbook.Sheets[workbook.SheetNames[sheet]]; - const data = utils.sheet_to_json(worksheet); + const data = utils.sheet_to_json(worksheet, { defval: null }); const df = new DataFrame(data, frameConfig); resolve(df); }) diff --git a/src/danfojs-base/package.json b/src/danfojs-base/package.json index bf10fd20..85d2af9c 100644 --- a/src/danfojs-base/package.json +++ b/src/danfojs-base/package.json @@ -1,6 +1,6 @@ { "name": "danfojs-base", - "version": "1.1.2", + "version": "1.2.0", "description": "Base package used in danfojs-node and danfojs-browser", "main": "index.ts", "scripts": { diff --git a/src/danfojs-base/shared/utils.ts b/src/danfojs-base/shared/utils.ts index e0553863..8dc4de8e 100644 --- a/src/danfojs-base/shared/utils.ts +++ b/src/danfojs-base/shared/utils.ts @@ -86,12 +86,25 @@ export default class Utils { } /** - * Checks if a value is empty. Empty means it's either null, undefined or NaN + * Checks if a value is empty. Empty means it's either null, undefined or NaN. + * Empty strings are NOT considered empty. * @param value The value to check. - * @returns + * @returns boolean indicating if the value is empty */ isEmpty(value: T): boolean { - return value === undefined || value === null || (isNaN(value as any) && typeof value !== "string"); + if (value === undefined || value === null) { + return true; + } + + if (typeof value === 'bigint') { + return false; // BigInt values are never considered empty + } + + if (typeof value === 'number') { + return isNaN(value); + } + + return false; // All other types (strings, objects, arrays, etc) are not considered empty } /** diff --git a/src/danfojs-browser/README.md b/src/danfojs-browser/README.md index 7e1520f0..aefc337f 100644 --- a/src/danfojs-browser/README.md +++ b/src/danfojs-browser/README.md @@ -71,7 +71,7 @@ yarn add danfojs For use directly in HTML files, you can add the latest script tag from [JsDelivr](https://www.jsdelivr.com/package/npm/danfojs) to your HTML file: ```html - + ``` See all available versions [here](https://www.jsdelivr.com/package/npm/danfojs) @@ -86,7 +86,7 @@ See all available versions [here](https://www.jsdelivr.com/package/npm/danfojs) - + Document diff --git a/src/danfojs-browser/package.json b/src/danfojs-browser/package.json index 8bfb26f1..47bf9286 100644 --- a/src/danfojs-browser/package.json +++ b/src/danfojs-browser/package.json @@ -1,10 +1,14 @@ { "name": "danfojs", - "version": "1.1.2", + "version": "1.2.0", "description": "JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.", - "main": "dist/danfojs-browser/src/index.js", - "types": "dist/danfojs-browser/src/index.d.ts", - "module": "lib/bundle.js", + "exports": { + ".": { + "types": "./dist/danfojs-browser/src/index.d.ts", + "node": "./dist/danfojs-browser/src/index.js", + "default": "./lib/bundle.esm.js" + } + }, "directories": { "test": "tests" }, diff --git a/src/danfojs-browser/tests/io/csv.reader.test.js b/src/danfojs-browser/tests/io/csv.reader.test.js index 17892295..9fcc2a2f 100644 --- a/src/danfojs-browser/tests/io/csv.reader.test.js +++ b/src/danfojs-browser/tests/io/csv.reader.test.js @@ -68,6 +68,70 @@ describe("readCSV", function () { assert.deepEqual(df.values, values); }); + it("Should throw error when reading non-existent remote file", async function () { + const remoteFile = "https://raw.githubusercontent.com/javascriptdata/danfojs/dev/nonexistent.csv"; + try { + await dfd.readCSV(remoteFile); + assert.fail("Should have thrown an error"); + } catch (error) { + assert.ok(error instanceof Error); + } + }); + + it("Should throw error when reading malformed CSV", async function () { + const malformedCSV = new File([ "a,b,c\n1,2\n3,4,5,6" ], "malformed.csv", { type: "text/csv" }); + try { + await dfd.readCSV(malformedCSV); + assert.fail("Should have thrown an error"); + } catch (error) { + assert.ok(error instanceof Error); + } + }); + + it("Should throw error when reading invalid file type", async function () { + const invalidFile = new File([ "not a csv" ], "test.txt", { type: "text/plain" }); + try { + await dfd.readCSV(invalidFile); + assert.fail("Should have thrown an error"); + } catch (error) { + assert.ok(error instanceof Error); + } + }); + + it("Preserves leading zeros when dtype is string", async function () { + // Create a CSV file with leading zeros + const csvContent = "codes\n012345\n001234"; + const file = new File([ csvContent ], "leading_zeros.csv", { type: "text/csv" }); + + const df = await dfd.readCSV(file, { + frameConfig: { + dtypes: [ "string" ] + } + }); + + assert.deepEqual(df.values, [ [ "012345" ], [ "001234" ] ]); + assert.deepEqual(df.dtypes, [ "string" ]); + + // Verify the values are actually strings + const jsonData = dfd.toJSON(df); + assert.deepEqual(jsonData, [ { codes: "012345" }, { codes: "001234" } ]); + }); + + it("Converts to numbers when dtype is not string", async function () { + // Create a CSV file with leading zeros + const csvContent = "codes\n012345\n001234"; + const file = new File([ csvContent ], "leading_zeros.csv", { type: "text/csv" }); + + const df = await dfd.readCSV(file); // default behavior without string dtype + + // Values should be converted to numbers + assert.deepEqual(df.values, [ [ 12345 ], [ 1234 ] ]); + assert.deepEqual(df.dtypes, [ "int32" ]); + + // Verify JSON output + const jsonData = dfd.toJSON(df); + assert.deepEqual(jsonData, [ { codes: 12345 }, { codes: 1234 } ]); + }); }); // describe("streamCSV", function () { @@ -114,5 +178,4 @@ describe("toCSV", function () { let df = new dfd.Series(data); assert.deepEqual(dfd.toCSV(df, { sep: "+", download: false }), `1+2+3+4+5+6+7+8+9+10+11+12`); }); - }); diff --git a/src/danfojs-browser/yarn.lock b/src/danfojs-browser/yarn.lock index 0af142af..5960078f 100644 --- a/src/danfojs-browser/yarn.lock +++ b/src/danfojs-browser/yarn.lock @@ -2680,9 +2680,9 @@ electron-to-chromium@^1.3.723: integrity sha512-+LPJVRsN7hGZ9EIUUiWCpO7l4E3qBYHNadazlucBfsXBbccDFNKUBAgzE68FnkWGJPwD/AfKhSzL+G+Iqb8A4A== elliptic@^6.5.3: - version "6.5.4" - resolved "https://registry.npmjs.org/elliptic/-/elliptic-6.5.4.tgz" - integrity sha512-iLhC6ULemrljPZb+QutR5TQGB+pdW6KGD5RSegS+8sorOZT+rdQFbsQFJgvN3eRqNALqJer4oQ16YvJHlU8hzQ== + version "6.6.1" + resolved "https://registry.yarnpkg.com/elliptic/-/elliptic-6.6.1.tgz#3b8ffb02670bf69e382c7f65bf524c97c5405c06" + integrity sha512-RaddvvMatK2LJHqFJ+YA4WysVN5Ita9E35botqIYspQ4TkRAlCicdzKOjlyv/1Za5RyTNn7di//eEV0uTAfe3g== dependencies: bn.js "^4.11.9" brorand "^1.1.0" diff --git a/src/danfojs-node/README.md b/src/danfojs-node/README.md index 7e1520f0..aefc337f 100644 --- a/src/danfojs-node/README.md +++ b/src/danfojs-node/README.md @@ -71,7 +71,7 @@ yarn add danfojs For use directly in HTML files, you can add the latest script tag from [JsDelivr](https://www.jsdelivr.com/package/npm/danfojs) to your HTML file: ```html - + ``` See all available versions [here](https://www.jsdelivr.com/package/npm/danfojs) @@ -86,7 +86,7 @@ See all available versions [here](https://www.jsdelivr.com/package/npm/danfojs) - + Document diff --git a/src/danfojs-node/package.json b/src/danfojs-node/package.json index d224ab4d..8df607ad 100644 --- a/src/danfojs-node/package.json +++ b/src/danfojs-node/package.json @@ -1,6 +1,6 @@ { "name": "danfojs-node", - "version": "1.1.2", + "version": "1.2.0", "description": "JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.", "main": "dist/danfojs-node/src/index.js", "types": "dist/danfojs-node/src/index.d.ts", diff --git a/src/danfojs-node/test/io/csv.reader.test.ts b/src/danfojs-node/test/io/csv.reader.test.ts index 2f60b0a0..4cfef8dd 100644 --- a/src/danfojs-node/test/io/csv.reader.test.ts +++ b/src/danfojs-node/test/io/csv.reader.test.ts @@ -2,14 +2,19 @@ import path from "path"; import chai, { assert, expect } from "chai"; import { describe, it } from "mocha"; import chaiAsPromised from "chai-as-promised"; -import { DataFrame, readCSV, Series, streamCSV, toCSV } from "../../dist/danfojs-node/src"; +import { DataFrame, readCSV, Series, streamCSV, toCSV, toJSON } from "../../dist/danfojs-node/src"; +import fs from 'fs'; +import process from 'process'; chai.use(chaiAsPromised); describe("readCSV", function () { this.timeout(10000); + + const testSamplesDir = path.join(process.cwd(), "test", "samples"); + it("Read local csv file works", async function () { - const filePath = path.join(process.cwd(), "test", "samples", "titanic.csv"); + const filePath = path.join(testSamplesDir, "titanic.csv"); let df = await readCSV(filePath, { header: true, preview: 5 }); assert.deepEqual(df.shape, [5, 8]); assert.deepEqual(df.columns, [ @@ -29,8 +34,9 @@ describe("readCSV", function () { 'int32', 'float32' ]); }); + it("Read local CSV file with config works", async function () { - const filePath = path.join(process.cwd(), "test", "samples", "titanic.csv"); + const filePath = path.join(testSamplesDir, "titanic.csv"); const frameConfig = { columns: [ 'A', @@ -62,8 +68,9 @@ describe("readCSV", function () { 'int32', 'float32' ]); }); + it("Read local csv with correct types and format works", async function () { - const filePath = path.join(process.cwd(), "test", "samples", "iris.csv"); + const filePath = path.join(testSamplesDir, "iris.csv"); let df = await readCSV(filePath, { header: true, preview: 5 }); const values = [ [5.1, 3.5, 1.4, 0.2, 0.0], @@ -72,47 +79,101 @@ describe("readCSV", function () { [4.6, 3.1, 1.5, 0.2, 0.0], [5.0, 3.6, 1.4, 0.2, 0.0] ]; - console.log(df.values); assert.deepEqual(df.values, values); }); + it("Throws error if file not found", async function () { const filePath = "notfound.csv"; - // assert.isRejected(readCSV(filePath, { header: true, preview: 5 })); - await expect(readCSV(filePath, { header: true, preview: 5 })).to.be.rejectedWith("ENOENT: no such file or directory"); + await expect(readCSV(filePath)).to.be.rejectedWith("ENOENT: no such file or directory"); }); + it("Throws error if file not found over http", async function () { const filePath = "https://getdata.com/notfound.csv"; - // assert.isRejected(readCSV(filePath, { header: true, preview: 5 })); - await expect(readCSV(filePath)).to.be.rejected; - }); - // it("Read remote csv file works", async function () { - // const remoteFile = "https://raw.githubusercontent.com/opensource9ja/danfojs/dev/danfojs-node/tests/samples/titanic.csv"; - // let df = await readCSV(remoteFile, { header: true, preview: 5 }); - // assert.deepEqual(df.shape, [5, 8]); - // assert.deepEqual(df.columns, [ - // 'Survived', - // 'Pclass', - // 'Name', - // 'Sex', - // 'Age', - // 'Siblings/Spouses Aboard', - // 'Parents/Children Aboard', - // 'Fare' - // ]); - // assert.deepEqual(df.dtypes, [ - // 'int32', 'int32', - // 'string', 'string', - // 'int32', 'int32', - // 'int32', 'float32' - // ]); - // }); + await expect(readCSV(filePath)).to.be.rejectedWith(/HTTP \d+:/); + }); + + it("Throws error when reading empty CSV file", async function () { + const filePath = path.join(testSamplesDir, "empty.csv"); + // Create empty file + fs.writeFileSync(filePath, ""); + await expect(readCSV(filePath)).to.be.rejectedWith("No data found in CSV file"); + fs.unlinkSync(filePath); // Clean up + }); + + it("Throws error when reading malformed CSV", async function () { + const filePath = path.join(testSamplesDir, "malformed.csv"); + // Create malformed CSV file + fs.writeFileSync(filePath, "a,b,c\n1,2\n3,4,5,6"); + await expect(readCSV(filePath)).to.be.rejectedWith("CSV parsing errors"); + fs.unlinkSync(filePath); // Clean up + }); + + it("Throws error when DataFrame creation fails", async function () { + const filePath = path.join(testSamplesDir, "invalid.csv"); + await expect(readCSV(filePath)).to.be.rejectedWith("ENOENT: no such file or directory"); + }); + + it("Preserves leading zeros when dtype is string", async function () { + const filePath = path.join(testSamplesDir, "leading_zeros.csv"); + // Create test CSV file + fs.writeFileSync(filePath, "codes\n012345\n001234"); + + try { + const df = await readCSV(filePath, { + frameConfig: { + dtypes: ["string"] + } + }); + + assert.deepEqual(df.values, [["012345"], ["001234"]]); + assert.deepEqual(df.dtypes, ["string"]); + + // Verify the values are actually strings + const jsonData = toJSON(df); + assert.deepEqual(jsonData, [{ codes: "012345" }, { codes: "001234" }]); + + // Clean up + fs.unlinkSync(filePath); + } catch (error) { + // Clean up even if test fails + fs.unlinkSync(filePath); + throw error; + } + }); + it("Converts to numbers when dtype is not string", async function () { + const filePath = path.join(testSamplesDir, "leading_zeros.csv"); + // Create test CSV file + fs.writeFileSync(filePath, "codes\n012345\n001234"); + + try { + const df = await readCSV(filePath); // default behavior without string dtype + + // Values should be converted to numbers + assert.deepEqual(df.values, [[12345], [1234]]); + assert.deepEqual(df.dtypes, ["int32"]); + + // Verify JSON output + const jsonData = toJSON(df); + assert.deepEqual(jsonData, [{ codes: 12345 }, { codes: 1234 }]); + + // Clean up + fs.unlinkSync(filePath); + } catch (error) { + // Clean up even if test fails + fs.unlinkSync(filePath); + throw error; + } + }); }); describe("streamCSV", function () { this.timeout(100000); + + const testSamplesDir = path.join(process.cwd(), "test", "samples"); + it("Streaming local csv file with callback works", async function () { - const filePath = path.join(process.cwd(), "test", "samples", "titanic.csv"); + const filePath = path.join(testSamplesDir, "titanic.csv"); await streamCSV(filePath, (df) => { if (df) { assert.deepEqual(df.shape, [1, 8]); @@ -130,60 +191,55 @@ describe("streamCSV", function () { assert.deepEqual(df, null); } }, { header: true }); - }); - // it("Streaming remote csv file with callback works", async function () { - // const remoteFile = "https://raw.githubusercontent.com/opensource9ja/danfojs/dev/danfojs-node/tests/samples/titanic.csv"; - // await streamCSV(remoteFile, (df) => { - // if (df) { - // assert.deepEqual(df.shape, [1, 8]); - // assert.deepEqual(df.columns, [ - // 'Survived', - // 'Pclass', - // 'Name', - // 'Sex', - // 'Age', - // 'Siblings/Spouses Aboard', - // 'Parents/Children Aboard', - // 'Fare' - // ]); - // } else { - // assert.deepEqual(df, null); - // } - // }, { header: true }); - - // }); + it("Throws error when streaming non-existent file", async function () { + const filePath = "notfound.csv"; + await expect(streamCSV(filePath, () => {})).to.be.rejectedWith("ENOENT: no such file or directory"); + }); + it("Throws error when streaming malformed CSV", async function () { + const filePath = path.join(testSamplesDir, "malformed_stream.csv"); + // Create malformed CSV file + fs.writeFileSync(filePath, "a,b,c\n1,2\n3,4,5,6"); + await expect(streamCSV(filePath, () => {})).to.be.rejectedWith("CSV parsing errors"); + fs.unlinkSync(filePath); // Clean up + }); }); - describe("toCSV", function () { + const testSamplesDir = path.join(process.cwd(), "test", "samples"); + it("toCSV works", async function () { const data = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]; let df = new DataFrame(data, { columns: ["a", "b", "c", "d"] }); assert.deepEqual(toCSV(df, {}), `a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11,12\n`); }); - it("toCSV works for specified seperator", async function () { + + it("toCSV works for specified separator", async function () { const data = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]; let df = new DataFrame(data, { columns: ["a", "b", "c", "d"] }); assert.deepEqual(toCSV(df, { sep: "+" }), `a+b+c+d\n1+2+3+4\n5+6+7+8\n9+10+11+12\n`); }); + it("toCSV write to local file works", async function () { const data = [[1, 2, 3, "4"], [5, 6, 7, "8"], [9, 10, 11, "12"]]; let df = new DataFrame(data, { columns: ["a", "b", "c", "d"] }); - const filePath = path.join(process.cwd(), "test", "samples", "test_write.csv"); + const filePath = path.join(testSamplesDir, "test_write.csv"); toCSV(df, { sep: ",", filePath }); + // Clean up + fs.unlinkSync(filePath); }); + it("toCSV works for series", async function () { const data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]; let df = new Series(data); assert.deepEqual(toCSV(df, { sep: "+" }), `1+2+3+4+5+6+7+8+9+10+11+12`); }); + it("calling df.toCSV works", async function () { const data = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]; let df = new DataFrame(data, { columns: ["a", "b", "c", "d"] }); assert.deepEqual(df.toCSV(), `a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11,12\n`); }); - }); diff --git a/src/danfojs-node/test/io/excel.reader.test.ts b/src/danfojs-node/test/io/excel.reader.test.ts index 5508dba9..8e4df028 100644 --- a/src/danfojs-node/test/io/excel.reader.test.ts +++ b/src/danfojs-node/test/io/excel.reader.test.ts @@ -106,4 +106,23 @@ describe("toExcel", function () { assert.equal(fs.existsSync(filePath), true) }); + it("handles null values correctly", async function () { + const data = [ + { 'name': 'Alice', 'age': 25 }, + { 'name': null, 'age': 30 }, + { 'name': 'Charlie', 'age': 35 } + ]; + const df: any = new DataFrame(data); + const filePath = path.join(process.cwd(), "test", "samples", "people.xlsx"); + + toExcel(df, { filePath, sheetName: 'Sheet1' }); + const df2: any = await readExcel(filePath); + + assert.deepEqual(df2.values, [ + ['Alice', 25], + [null, 30], + ['Charlie', 35] + ]); + assert.deepEqual(df2.columns, ['name', 'age']); + }); }) diff --git a/src/danfojs-node/test/samples/sample.xlsx b/src/danfojs-node/test/samples/sample.xlsx index 67a11847..d294b205 100644 Binary files a/src/danfojs-node/test/samples/sample.xlsx and b/src/danfojs-node/test/samples/sample.xlsx differ diff --git a/src/danfojs-node/test/utils.test.ts b/src/danfojs-node/test/utils.test.ts index a8d08fe5..0a8da92d 100644 --- a/src/danfojs-node/test/utils.test.ts +++ b/src/danfojs-node/test/utils.test.ts @@ -40,6 +40,49 @@ describe("Utils", function () { assert.isTrue(utils.isUndefined(arr)); }); + describe("isEmpty", function () { + it("should return true for null values", function () { + assert.isTrue(utils.isEmpty(null)); + }); + + it("should return true for undefined values", function () { + assert.isTrue(utils.isEmpty(undefined)); + }); + + it("should return true for NaN values", function () { + assert.isTrue(utils.isEmpty(NaN)); + }); + + it("should return false for strings (including empty strings)", function () { + assert.isFalse(utils.isEmpty("")); + assert.isFalse(utils.isEmpty(" ")); + assert.isFalse(utils.isEmpty("hello")); + }); + + it("should return false for numbers (except NaN)", function () { + assert.isFalse(utils.isEmpty(0)); + assert.isFalse(utils.isEmpty(-1)); + assert.isFalse(utils.isEmpty(42.5)); + }); + + it("should return false for BigInt values", function () { + assert.isFalse(utils.isEmpty(BigInt(9007199254740991))); + assert.isFalse(utils.isEmpty(BigInt(0))); + }); + + it("should return false for objects and arrays", function () { + assert.isFalse(utils.isEmpty({})); + assert.isFalse(utils.isEmpty([])); + assert.isFalse(utils.isEmpty({ key: "value" })); + assert.isFalse(utils.isEmpty([1, 2, 3])); + }); + + it("should return false for boolean values", function () { + assert.isFalse(utils.isEmpty(true)); + assert.isFalse(utils.isEmpty(false)); + }); + }); + it("Checks if value is a valid Date object", function () { let date1 = new Date(); let date2 = "2021-01-01 00:00:00";