{"id":4863,"date":"2020-11-06T13:20:02","date_gmt":"2020-11-06T13:20:02","guid":{"rendered":"https:\/\/machinelearningplus.com\/?p=4863"},"modified":"2022-04-04T07:43:19","modified_gmt":"2022-04-04T07:43:19","slug":"dask-tutorial","status":"publish","type":"post","link":"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/","title":{"rendered":"Dask &#8211; How to handle large dataframes in python using parallel computing"},"content":{"rendered":"<p><em>Dask provides efficient parallelization for data analytics in python. Dask Dataframes allows you to work with large datasets for both data manipulation and building ML models with only minimal code changes. It is open source and works well with python libraries like NumPy, scikit-learn, etc. Let&#8217;s understand how to use Dask with hands-on examples.<\/em><\/p>\n<p><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/www.localhost:8080\/wp-content\/uploads\/2020\/11\/Dask-cover.jpg\" alt=\"\" width=\"1280\" height=\"853\" class=\"alignnone size-full wp-image-4896\" srcset=\"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/Dask-cover.jpg 1280w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/Dask-cover-300x200.jpg 300w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/Dask-cover-1024x682.jpg 1024w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/Dask-cover-768x512.jpg 768w\" sizes=\"(max-width: 1280px) 100vw, 1280px\" \/>Dask &#8211; How to handle large data in python using parallel computing<\/img><\/p>\n<h2>Contents<\/h2>\n<ol>\n<li><a href=\"#why-do-you-need-dask\">Why do you need Dask?<\/a><\/li>\n<li><a href=\"#what-is-dask\">What is Dask?<\/a><\/li>\n<li><a href=\"#parallel-processing-with-dask\">Quickly about Parallel Processing<\/a><\/li>\n<li><a href=\"#how-to-implement-parallel-processing-with-dask\">How to implement Parallel Processing with Dask<\/a><\/li>\n<li><a href=\"#what-does-dask-delayed-do\">What does dask.delayed do?<\/a><\/li>\n<li><a href=\"#example-parallelizing-a-for-loop-with-dask\">Example: Parallelizing a for loop with Dask<\/a><\/li>\n<li><a href=\"#how-to-use-dask-dataframes\">How to use Dask DataFrames<\/a><\/li>\n<li><a href=\"#how-is-dask-dataframe-different-from-pandas-dataframe\">How is dask.dataframe different from pandas.dataframe?<\/a><\/li>\n<li><a href=\"#introduction-to-dask-bags\">Introduction to Dask Bags<\/a><\/li>\n<li><a href=\"#how-to-use-dask-bag-for-various-operations\">How to use Dask Bag for various operations?<\/a><\/li>\n<li><a href=\"#distributed-computing-with-dask-hands-on-example\">Distributed computing with Dask &#8211; Hands-on Example<\/a><\/li>\n<\/ol>\n<h2 id=\"why-do-you-need-dask\">Why do you need Dask?<\/h2>\n<p>Python packages like numpy, pandas, sklearn, seaborn etc. makes the data manipulation and ML tasks very convenient. For most data analysis tasks, the python <a href=\"https:\/\/www.localhost:8080\/python\/101-pandas-exercises-python\/\">pandas<\/a> package is good enough. You can do all sorts of data manipulation and is compatible for building ML models.<\/p>\n<p>But, as your data gets bigger, bigger than what you can fit in the RAM, pandas won&#8217;t be sufficient.<\/p>\n<p>This is a very common problem.<\/p>\n<p>You may use Spark or Hadoop to solve this. But, these are not python environments. This stops you from using numpy, sklearn, pandas, tensorflow, and all the commonly used Python libraries for ML.<\/p>\n<p>Is there a solution for this?<\/p>\n<p>Yes! This is where Dask comes in.<\/p>\n<h2 id=\"what-is-dask\">What is Dask?<\/h2>\n<p>Dask is a open-source library that provides <strong>advanced parallelization for analytics<\/strong>, especially when you are working with large data.<\/p>\n<p>It is built to help you improve code performance and scale-up without having to re-write your entire code. The good thing is, you can use all your favorite python libraries as Dask is built in coordination with numpy, scikit-learn, scikit-image, pandas, xgboost, RAPIDS and others.<\/p>\n<p>That means you can now use Dask to not only speed up computations on datasets using parallel processing, but also build ML models using scikit-learn, XGBoost on much larger datasets.<\/p>\n<p>You can use it to scale your python code for data analysis. If you think, this sounds a bit complicated to implement, just read on.<\/p>\n<p><strong>Related Post:<\/strong> <a href=\"https:\/\/www.localhost:8080\/python\/parallel-processing-python\/\">Basics of python parallel processing with multiprocessing, clearly explained<\/a>.<\/p>\n<h2 id=\"parallel-processing-with-dask\">Quickly about Parallel Processing<\/h2>\n<p>So, What is Parallel Processing?<\/p>\n<p>Parallel processing refers to executing multiple tasks at the same time, using multiple processors in the same machine.<\/p>\n<p>Generally, the code is executed in sequence, one task at a time. But, let&#8217;s suppose, you have a complex code that takes a long time to run, but mostly the code logics are independent, that is, no data or logic dependency on each other. This is the case for most matrix operations.<\/p>\n<p>So, instead of waiting for the previous task to complete, we <strong>compute multiple steps simultaneously at the same time<\/strong>. This lets you take advantage of the available processing power, which is the case in most modern computers, thereby reducing the total time taken.<\/p>\n<p>Dask is designed to do this efficiently on datasets with minimal learning curve. Let&#8217;s see how.<\/p>\n<h2>How to implement Parallel Processing with Dask<\/h2>\n<p>A very simple way is to use the <code>dask.delayed<\/code> decorator to implement parallel processing. Let me explain it through an example.<\/p>\n<p>Consider the below code snippet.<\/p>\n<pre><code class=\"language-python\">from time import sleep\n\ndef apply_discount(x):\n  sleep(1)\n  x=x-0.2*x\n  return x\n\ndef get_total(a,b):\n  sleep(1)\n  return a+b\n\n\ndef get_total_price(x,y):\n  sleep(1)\n  a=apply_discount(x)\n  b=apply_discount(y)\n  get_total(a,b)\n<\/code><\/pre>\n<p>Given a number, the above code simply applies a 20 percent discount on price and then add them. I&#8217;ve inserted a <code>sleep<\/code> function explicitly so both the functions take 1sec to run. This is a small code that will run quickly, but I have chosen this to demonstrate for beginners.<\/p>\n<pre><code class=\"language-python\">%%time\n# This takes three seconds to run because we call each\n# function sequentially, one after the other\n\nx = apply_discount(100)\ny = apply_discount(200)\nz = get_total_price(x,y)\n<\/code><\/pre>\n<pre><code>CPU times: user 859 \u00b5s, sys: 202 \u00b5s, total: 1.06 ms\nWall time: 6.01 s\n<\/code><\/pre>\n<p>I have recorded the time taken for this execution using <code>%%time<\/code> as shown. You can observe that time taken is 6.01 seconds, when it is executed sequentially. Now, let&#8217;s see how to use <code>dask.delayed<\/code> to reduce this time.<\/p>\n<pre><code class=\"language-python\"># Import dask and and dask.delayed\nimport dask\nfrom dask import delayed\n<\/code><\/pre>\n<p>Now, you can transform the functions <code>apply_discount()<\/code> and <code>get_total_price()<\/code>. You can use  <code>delayed()<\/code> function to wrap the function calls that you want to turn into tasks.<\/p>\n<pre><code class=\"language-python\"># Wrapping the function calls using dask.delayed\nx = delayed(apply_discount)(100)\ny = delayed(apply_discount)(200)\nz = delayed(get_total_price)(x, y)\n<\/code><\/pre>\n<h2 id=\"what-does-dask-delayed-do\">What does dask.delayed do?<\/h2>\n<p>It creates a <code>delayed<\/code> object, that keeps track of all the functions to call and the arguments to pass to it. Basically, it builds a task graph that explains the entire computation.  It helps to spot opportunities for parallel execution.<\/p>\n<p>So, the <code>z<\/code> object created in the above code is a delayed object OR &#8220;lazy object&#8221; which has all information for executing the logic. You can see the optimal task graph created by dask by calling the <code>visualize()<\/code> function.<\/p>\n<pre><code class=\"language-python\">z.visualize()\n<\/code><\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/www.localhost:8080\/wp-content\/uploads\/2020\/11\/picture1-min.png\" alt=\"\" width=\"343\" height=\"492\" class=\"alignnone size-full wp-image-4873\" srcset=\"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/picture1-min.png 343w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/picture1-min-209x300.png 209w\" sizes=\"(max-width: 343px) 100vw, 343px\" \/><\/p>\n<p>Clearly from the above image, you can see there are two instances of <code>apply_discount()<\/code> function called in parallel. This is an opportunity to save time and processing power by executing them simultaneously.<\/p>\n<p>Up until now, only the logic to compute the output, that is the task graph is computed. To actually execute it, let&#8217;s call the <code>compute()<\/code> method of <code>z<\/code>.<\/p>\n<pre><code class=\"language-python\">%%time\nz.compute()\n<\/code><\/pre>\n<pre><code>CPU times: user 6.33 ms, sys: 1.35 ms, total: 7.68 ms\nWall time: 5.01 s\n<\/code><\/pre>\n<p>Though it&#8217;s just 1 sec, the total time taken has reduced. This is the basic concept of parallel computing. Dask makes it very convenient.<\/p>\n<p>Let&#8217;s now look at more useful examples.<\/p>\n<h2 id=\"example-parallelizing-a-for-loop-with-dask\">Example: Parallelizing a for loop with Dask<\/h2>\n<p>In the previous section, you understood how <code>dask.delayed<\/code> works. Now, let&#8217;s see how to do parallel computing in a <code>for-loop<\/code>.<\/p>\n<p>Consider the below code.<\/p>\n<p>You have a <code>for-loop<\/code>, where for each element a series of functions is called.<\/p>\n<p>In this case, there is a lot of opportunity for parallel computing. Again, we wrap the function calls with <code>delayed()<\/code>, to get the parallel computing task graph.<\/p>\n<pre><code class=\"language-python\"># Functions to perform mathematics operations\ndef square(x):\n    return x*x\n\ndef double(x):\n    return x*2\n\ndef add(x, y):\n    return x + y\n\n# For loop that calls the above functions for each data\noutput = []\nfor i in range(6):\n    a = delayed(square)(i)\n    b = delayed(double)(i)\n    c = delayed(add)(a, b)\n    output.append(c)\n\ntotal = dask.delayed(sum)(output)\n\n# Visualizing the task graph for the problem\ntotal.visualize()\n<\/code><\/pre>\n<p>For this case, the <code>total<\/code> variable is the lazy object. Let&#8217;s visualize the task graph using <code>total.visualize()<\/code>.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.localhost:8080\/wp-content\/uploads\/2020\/11\/picture2-min.png\" alt=\"\" width=\"795\" height=\"353\" class=\"alignnone size-full wp-image-4875\" srcset=\"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/picture2-min.png 795w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/picture2-min-300x133.png 300w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/picture2-min-768x341.png 768w\" sizes=\"(max-width: 795px) 100vw, 795px\" \/><\/p>\n<p>You can see from above that as problems get more complex, so here, parallel computing becomes more useful and necessary.<\/p>\n<p>Now, wrapping every function call inside <code>delayed()<\/code> becomes laborious. But then, the <code>delayed<\/code> function is actually a <strong>Decorator<\/strong>. So, you can just add the <code>@delayed<\/code> decorator before the function definitions as shown below. This reduces the number of code changes.<\/p>\n<pre><code class=\"language-python\"># Using delayed as a decorator to achieve parallel computing.\n\n@delayed\ndef square(x):\n    return x*x\n\n@delayed\ndef double(x):\n    return x*2\n\n@delayed\ndef add(x, y):\n    return x + y\n\n# No change has to be done in function calls\noutput = []\nfor i in range(6):\n    a = square(i)\n    b = double(i)\n    c = add(a, b)\n    output.append(c)\n\ntotal = dask.delayed(sum)(output)\ntotal.visualize()\n<\/code><\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.localhost:8080\/wp-content\/uploads\/2020\/11\/Example_-Parallelizing_a_for_loop_with_Dask_2-min.png\" alt=\"Example : Parallelizing a for loop with Dask 2\" width=\"901\" height=\"398\" class=\"alignnone size-full wp-image-4871\" srcset=\"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/Example_-Parallelizing_a_for_loop_with_Dask_2-min.png 901w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/Example_-Parallelizing_a_for_loop_with_Dask_2-min-300x133.png 300w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/Example_-Parallelizing_a_for_loop_with_Dask_2-min-768x339.png 768w\" sizes=\"(max-width: 901px) 100vw, 901px\" \/><\/p>\n<p>As expected, you get the same output.<\/p>\n<p>So you can use <code>delayed<\/code> as a decorator as is and it will parallelize a for-loop as well. Isn&#8217;t that awesome?<\/p>\n<h2 id=\"how-to-use-dask-dataframes\">Dask DataFrames &#8211; How to use them?<\/h2>\n<p>You saw how Dask helps to overcome the problem of long execution and training time. Another important problem we discussed was the <strong>larger-than-memory datasets<\/strong>.<\/p>\n<p>The commonly used library for working with datasets is Pandas. But, many real-life ML problems have datasets that are larger than your RAM memory!<\/p>\n<p>In these cases, Dask Dataframes is useful. You can simply import the dataset as <code>dask.dataframe<\/code> instead, which you can later convert to a pandas dataframe after necessary wrangling\/calculations are done.<\/p>\n<h2 id=\"how-is-dask-dataframe-different-from-pandas-dataframe\">How is dask.dataframe different from pandas.dataframe?<\/h2>\n<p>A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index. One Dask DataFrame is comprised of many in-memory pandas DataFrames separated along with the index.<\/p>\n<p>These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask DataFrame operation triggers many operations on the constituent Pandas DataFrames.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.localhost:8080\/wp-content\/uploads\/2020\/11\/dask-df.png\" alt=\"Dask dataframe\" width=\"929\" height=\"688\" class=\"alignnone size-full wp-image-4971\" srcset=\"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/dask-df.png 929w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/dask-df-300x222.png 300w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/dask-df-768x569.png 768w\" sizes=\"(max-width: 929px) 100vw, 929px\" \/><\/p>\n<p>The Dask Dataframe interface is very similar to Pandas, so as to ensure familiarity for pandas users. There are some differences which we shall see.<\/p>\n<p>For understanding the interface, let&#8217;s start with a default dataset provided by Dask. I have used <code>dask.datasets.timeseries()<\/code> function, which can create time-series from random data.<\/p>\n<pre><code class=\"language-python\">import dask\nimport dask.dataframe as dd\ndata_frame = dask.datasets.timeseries()\n<\/code><\/pre>\n<p>The <code>data_frame<\/code> variable is now our dask dataframe. In padas, if you the variable, it&#8217;ll print a shortlist of contents. Let&#8217;s see what happens in Dask.<\/p>\n<pre><code class=\"language-python\">data_frame\n<\/code><\/pre>\n<p>You can see that only the structure is there, no data has been printed. It&#8217;s because Dask Dataframes are lazy and do not perform operations unless necessary. You can use the <code>head()<\/code> method to visualize data<\/p>\n<pre><code class=\"language-python\">data_frame.head()\n<\/code><\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.localhost:8080\/wp-content\/uploads\/2020\/11\/table1-min.png\" alt=\"data set\" width=\"431\" height=\"228\" class=\"alignnone size-full wp-image-4865\" srcset=\"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/table1-min.png 431w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/table1-min-300x159.png 300w\" sizes=\"(max-width: 431px) 100vw, 431px\" \/><\/p>\n<p>Now, let&#8217;s just perform a few basic operations which are expected from pandas using dask dataframe now. One of the most standard operations is to <code>groupby()<\/code>.<\/p>\n<pre><code class=\"language-python\"># Applying groupby operation\ndf = data_frame.groupby('name').y.std()\ndf\n<\/code><\/pre>\n<pre><code>Dask Series Structure:\nnpartitions=1\n    float64\n        ...\nName: y, dtype: float64\nDask Name: sqrt, 67 tasks\n<\/code><\/pre>\n<p>If you want the results then you can call <code>compute()<\/code> function as shown below.<\/p>\n<pre><code class=\"language-python\">df.compute()\n<\/code><\/pre>\n<pre><code>name\nAlice       0.575963\nBob         0.576803\nCharlie     0.577633\nDan         0.578868\nEdith       0.577293\nFrank       0.577018\nGeorge      0.576834\nHannah      0.577177\nIngrid      0.578378\nJerry       0.577362\nKevin       0.577626\nLaura       0.577829\nMichael     0.576828\nNorbert     0.576417\nOliver      0.576665\nPatricia    0.577810\nQuinn       0.578222\nRay         0.577239\nSarah       0.577831\nTim         0.578482\nUrsula      0.576405\nVictor      0.577622\nWendy       0.577442\nXavier      0.578316\nYvonne      0.577285\nZelda       0.576796\nName: y, dtype: float64\n<\/code><\/pre>\n<p>Sometimes the original dataframe may be larger than RAM, so you would have loaded it as Dask dataframe. After performing some operations, you might get a smaller dataframe which you would like to have in Pandas. You can easily convert a Dask dataframe into a Pandas dataframe by storing <code>df.compute()<\/code>.<\/p>\n<p>The <code>compute()<\/code> function turns a lazy Dask collection into its in-memory equivalent (in this case pandas dataframe). You can verify this with <code>type()<\/code> function as shown below.<\/p>\n<pre><code class=\"language-python\"># Converting dask dataframe into pandas dataframe\nresult_df=df.compute()\ntype(result_df)\n<\/code><\/pre>\n<pre><code>pandas.core.series.Series\n<\/code><\/pre>\n<p>Another useful feature is the <code>persist()<\/code> function of dask dataframe.<\/p>\n<p><strong>So, what does <code>persist()<\/code> function do?<\/strong><\/p>\n<p>This function turns a lazy Dask collection into a Dask collection with the same metadata. The difference is earlier the results were not computed, it just had the information. Now, the results are fully computed or actively computing in the background.<\/p>\n<p>This function is particularly useful when using distributed systems, because the results will be kept in distributed memory, rather than returned to the local process as with compute.<\/p>\n<pre><code class=\"language-python\"># Calling the persist function of dask dataframe\ndf = df.persist()\n<\/code><\/pre>\n<p>The majority of the normal operations have a similar syntax to theta of pandas. Just that here for actually computing results at a point, you will have to call the <code>compute()<\/code> function. Below are a few examples that demonstrate the similarity of Dask with Pandas API.<\/p>\n<pre><code class=\"language-python\">df.loc['2000-01-05']\n<\/code><\/pre>\n<pre><code>Dask Series Structure:\nnpartitions=1\n    float64\n\nName: y, dtype: float64\nDask Name: try_loc, 2 tasks\n<\/code><\/pre>\n<p>Now using <code>compute()<\/code> on this materializes it.<\/p>\n<pre><code class=\"language-python\">%time \ndf.loc['2000-01-05'].compute()\n<\/code><\/pre>\n<pre><code>CPU times: user 3.03 ms, sys: 0 ns, total: 3.03 ms\nWall time: 2.87 ms\n\nSeries([], Name: y, dtype: float64)\n<\/code><\/pre>\n<h2 id=\"introduction-to-dask-bags\">Introduction to Dask Bags<\/h2>\n<p>In many cases, the raw input has a lot of messy data that needs processing. The messy data is often processed and represented as a sequence of arbitrary inputs. Usually, they are processed in form of lists, dicts, sets, etc. A common problem is when they take up a lot of storage and iterating through them takes time.<\/p>\n<p>Is there a way to optimize data processing at raw-level?<\/p>\n<p>Yes! The answer is Dask Bags.<\/p>\n<p><strong>What are Dask Bags?<\/strong><\/p>\n<p><code>Dask.bag<\/code> is a high-level Dask collection used as an alternative for the regular python lists, etc. The main difference is Dask Bags are lazy and distributed.<\/p>\n<p>Dask Bag implements operations like map, filter, fold, and groupby on collections of generic Python objects. We prefer Dask bags because it provides the best optimization.<\/p>\n<p><strong>What are the advantages of using Dask bags ?<\/strong><\/p>\n<ol>\n<li>It lets you process large volumes of data in a small space, just like <code>toolz<\/code>.<\/li>\n<li>Dask bags follow parallel computing. The data is split up, allowing multiple cores or machines to execute in parallel<\/li>\n<li>The execution part usually consists of running many iterations. In these iterations,  data is processed lazily in the case of Dask bag. It allows for smooth execution. <\/li>\n<\/ol>\n<p>Because of the above points, Dask bags are often used on unstructured or semi-structured data like text data, log files, JSON records, etc.<\/p>\n<h3>How to create Dask Bags?<\/h3>\n<p>Dask provides you different ways to create a bag from various python objects. Let&#8217;s look at each method with an example.<\/p>\n<p>Method 1. Create a bag from a sequence :<\/p>\n<p>You can create a dask Bag from Python sequence using the <code>dask.bag.from_sequence()<\/code> function.<br \/>\n  The parameters are :<br \/>\n  <code>seq<\/code>: The sequence of elements you wish to input<\/p>\n<p><code>partition_size<\/code>: An integer to denote the size of each partition<\/p>\n<p>The below example shows how to create a bag from a list. After creating, you can perform a wide variety of functions on the bag. For, example, <code>visualize()<\/code> function returns a dot graph to represent the bag.<\/p>\n<pre><code class=\"language-python\">bag_1 = dask.bag.from_sequence(['Haritha', 'keerthi', 'Newton','Swetha','Sinduja'], partition_size=2)\nbag_1.visualize()\n<\/code><\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.localhost:8080\/wp-content\/uploads\/2020\/11\/output_62_0-min.png\" alt=\"\" width=\"275\" height=\"59\" class=\"alignnone size-full wp-image-4910\" \/><\/p>\n<p>Method 2. Create bag from  dask Delayed objects :<\/p>\n<pre><code> You can create a dask Bag from dask Delayed objects using the `dask.bag.from_delayed()` function. The parameters is `values`. It refers to the list of dask Delayed objects you wish to input\n<\/code><\/pre>\n<pre><code class=\"language-python\"># Creating dask delayed objects\nx, y, z =[delayed(load_sequence_from_file)(fn) for fn in filenames]\n\n# Creating a bask using from_delayed()\nb = dask.bag.from_delayed([x, y, z])\n<\/code><\/pre>\n<p>Method 3. Create a bag from text files:<\/p>\n<pre><code> You can create a dask Bag from a text file using the `dask.bag.read_text()` function. The main parameters are : \n\n `urlpath`: You can pass the path of the desired text file here.\n\n `blocksize`: In case the files are large, you can provide an option to cut them using this parameter\n\n `collection`: It is a boolean value parameter. The function will return `dask.bag` if True. Otherwise will return a list of delayed values.\n\n `include_path`: It is again a boolean parameter that decides\n<\/code><\/pre>\n<p>whether or not to include the path in the bag. If true, elements are tuples of (line, path). By default, it is set to False.<\/p>\n<pre><code> The below example shows how to create a bag from a textfile\n<\/code><\/pre>\n<pre><code class=\"language-python\">b = read_text('myfiles.1.txt')  # doctest: +SKIP\nb = read_text('myfiles.*.txt')  \n\n# Parallelize a large file by providing the number of uncompressed bytes to load into each partition\nb = read_text('largefile.txt', blocksize='10MB')\n\n# Get file paths of the bag by setting include_path=True\nb = read_text('myfiles.*.txt', include_path=True)\n<\/code><\/pre>\n<p>Method 4. Create a Dask bag from url:<\/p>\n<p>You can create a dask Bag from a URL using the <code>dask.bag.from_url()<\/code> function. You just need to input the url path, no other parameter<\/p>\n<pre><code> The below example shows how to create a bag from a url\n<\/code><\/pre>\n<pre><code class=\"language-python\">a = dask.bag.from_url('http:\/\/raw.githubusercontent.com\/dask\/dask\/master\/README.rst',) \na.npartitions  \n<\/code><\/pre>\n<pre><code class=\"language-python\">b = dask.bag.from_url(['http:\/\/github.com', 'http:\/\/google.com']) \nb.npartitions\n<\/code><\/pre>\n<h2 id=\"how-to-use-dask-bag-for-various-operations\">How to use Dask Bag for various operations?<\/h2>\n<p>The previous section told us the different ways of creating dask bags. Now that you are familiar with the idea, let&#8217;s see how to perform various processing operations.<\/p>\n<p>For our purpose,let&#8217;s create a dask bag using the <code>make_people()<\/code> function available in <code>dask.datasets<\/code>. This function <code>make_people()<\/code> makes a Dask Bag with dictionary records of randomly generated people. To do this, it requires the library <code>mimesis<\/code> to generate records. So, you have to install that too.<\/p>\n<pre><code class=\"language-python\">!pip install mimesis\n!pip install dask==1.0.0 distributed'&gt;=1.21.6,&lt;2.0.0'\nimport dask\nimport json\nimport os\n\n # Create data\/ directory\nos.makedirs('\/content\/my_data', exist_ok=True)       \n\n\nmy_bag = dask.datasets.make_people()\nmy_bag\n<\/code><\/pre>\n<pre><code>dask.bag\n<\/code><\/pre>\n<p>The above code has successfully created a dask bag <code>my_bag<\/code> that stores information. You can also see that the number of partitions is 10. Sometimes, you may need to write the data into a disk.<\/p>\n<h3>How to write the data in <code>my_bag<\/code> (of 10 partitions) into 10 JSON files and store them?<\/h3>\n<p>In situations like these, the <code>dask.bag.map()<\/code> is pretty useful.dask.<br \/>\nThe syntax is : <code>bag.map(func, *args, **kwargs)<\/code><\/p>\n<p>It is used to apply a function elementwise across one or more bags. In our case, the function to be called is <code>json.dumps<\/code>. This is responsible for writing data into JSON format files. So, provide <code>json.dumps<\/code> as input to <code>map()<\/code> function as shown below.<\/p>\n<pre><code class=\"language-python\">my_bag.map(json.dumps).to_textfiles('data\/*.json')\n<\/code><\/pre>\n<pre><code>['data\/0.json',\n 'data\/1.json',\n 'data\/2.json',\n 'data\/3.json',\n 'data\/4.json',\n 'data\/5.json',\n 'data\/6.json',\n 'data\/7.json',\n 'data\/8.json',\n 'data\/9.json']\n<\/code><\/pre>\n<p>Yay! That was successful. Now as you might guess, dask bag is also a lazy collection. So, if you want to know or compute the actual data, you have to call the function <code>take()<\/code> or <code>compute()<\/code>.<\/p>\n<p>For using the <code>take()<\/code> function you need to provide input <code>k<\/code>. This <code>k<\/code> denotes that the first k elements should be taken<\/p>\n<pre><code class=\"language-python\">my_bag.take(3)\n<\/code><\/pre>\n<pre><code>({'address': {'address': '812 Lakeshore Cove', 'city': 'Downers Grove'},\n  'age': 63,\n  'credit-card': {'expiration-date': '07\/25', 'number': '3749 138185 40967'},\n  'name': ('Jed', 'Munoz'),\n  'occupation': 'Clergyman',\n  'telephone': '+1-(656)-064-7533'},\n {'address': {'address': '1067 Colby Turnpike', 'city': 'Huntington Beach'},\n  'age': 62,\n  'credit-card': {'expiration-date': '01\/17', 'number': '4391 0642 7046 4592'},\n  'name': ('Emilio', 'Vega'),\n  'occupation': 'Sound Engineer',\n  'telephone': '829-959-9408'},\n {'address': {'address': '572 Boardman Route', 'city': 'Lewiston'},\n  'age': 28,\n  'credit-card': {'expiration-date': '07\/17', 'number': '4521 0738 3441 8096'},\n  'name': ('Lakia', 'Elliott'),\n  'occupation': 'Clairvoyant',\n  'telephone': '684-025-2843'})\n<\/code><\/pre>\n<p>You can see that first 3 data printed in above output.<\/p>\n<p>Now, let&#8217;s move on to some processing codes. For any given data, we often perform filter operations based on certain conditions. Dask bags provides the ready-made <code>filter()<\/code> function especially for this.<\/p>\n<p>Let&#8217;s say from <code>my_bag<\/code> collection, you want to filter out the people whose age is greater than 60.<br \/>\nFor this need to write the predicate function to check record of each age. This has to be provided as input to <code>dask.bag.filter()<\/code> function.<\/p>\n<pre><code class=\"language-python\">my_bag.filter(lambda record: record['age'] &gt; 60).take(4)\n<\/code><\/pre>\n<pre><code>({'address': {'address': '812 Lakeshore Cove', 'city': 'Downers Grove'},\n  'age': 63,\n  'credit-card': {'expiration-date': '07\/25', 'number': '3749 138185 40967'},\n  'name': ('Jed', 'Munoz'),\n  'occupation': 'Clergyman',\n  'telephone': '+1-(656)-064-7533'},\n {'address': {'address': '1067 Colby Turnpike', 'city': 'Huntington Beach'},\n  'age': 62,\n  'credit-card': {'expiration-date': '01\/17', 'number': '4391 0642 7046 4592'},\n  'name': ('Emilio', 'Vega'),\n  'occupation': 'Sound Engineer',\n  'telephone': '829-959-9408'},\n {'address': {'address': '480 Rotteck Cove', 'city': 'Havelock'},\n  'age': 66,\n  'credit-card': {'expiration-date': '11\/20', 'number': '2338 5735 7231 3240'},\n  'name': ('Dewey', 'Ruiz'),\n  'occupation': 'Green Keeper',\n  'telephone': '1-445-365-1344'},\n {'address': {'address': '187 Greenwich Plaza', 'city': 'Denver'},\n  'age': 63,\n  'credit-card': {'expiration-date': '02\/20', 'number': '4879 9327 9343 8130'},\n  'name': ('Charley', 'Woods'),\n  'occupation': 'Quarry Worker',\n  'telephone': '+1-(606)-335-1595'})\n<\/code><\/pre>\n<p>The earlier discussed <code>map()<\/code> function can also be used to extract specific information. Let&#8217;s say we want to know only the occupations which people have for analysis. You can choose the occupations alone and save it in a new bag as shown below<\/p>\n<pre><code class=\"language-python\">bag_occupation=my_bag.map(lambda record: record['occupation'])\nbag_occupation.take(6)\n<\/code><\/pre>\n<pre><code>('Clergyman',\n 'Sound Engineer',\n 'Clairvoyant',\n 'Agent',\n 'Representative',\n 'Ornamental')\n<\/code><\/pre>\n<p>I have printed the first 6 data stored in the processed bag above. What if you want to know many values are there in <code>bag_occupation<\/code>?<\/p>\n<p>Your first go would be to do <code>bag_occupation.count()<\/code>. But, remember you won&#8217;t get any result as <code>dask.bag<\/code> is lazy. So, make sure to call <code>compute()<\/code> at the end<\/p>\n<pre><code class=\"language-python\"># computing the no of data stored\nbag_occupation.count().compute() \n<\/code><\/pre>\n<pre><code>10000\n<\/code><\/pre>\n<p>Another important function is <strong><code>dask.bag.groupby()<\/code><\/strong>.<br \/>\nThis function groups collection by key function. Below is a simple example we group even and odd numbers.<\/p>\n<pre><code class=\"language-python\">!pip install partd\nb = dask.bag.from_sequence(range(10))\niseven = lambda x: x % 2 == 0\nb.groupby(iseven).compute() \n\n<\/code><\/pre>\n<pre><code>[(False, [1, 3, 5, 7, 9]), (True, [0, 2, 4, 6, 8])]\n<\/code><\/pre>\n<p>It&#8217;s also possible to perform multiple data processing like filtering, mapping together in one step. This is called Chain computation. You can perform each call followed by others and finally call the <code>compute()<\/code> function. This will save memory and time. The below code is an example of Chain Computation on the <code>my_bag<\/code> collection.<\/p>\n<pre><code class=\"language-python\">result = (my_bag.filter(lambda record: record['age'] &gt; 60)\n           .map(lambda record: record['occupation'])\n           .frequencies(sort=True)\n           .topk(10, key=1))\nresult.compute()\n<\/code><\/pre>\n<pre><code>[('Councillor', 6),\n ('Shop Keeper', 5),\n ('Taxi Controller', 5),\n ('Horse Riding Instructor', 4),\n ('Press Officer', 4),\n ('Nursing Manager', 4),\n ('Systems Engineer', 4),\n ('Medal Dealer', 4),\n ('Storeman', 4),\n ('Architect', 4)]\n<\/code><\/pre>\n<p>Yay! we performed all processing in a single step.<\/p>\n<h3>Converting Dask Bag to other forms<\/h3>\n<p>Many times, after processing is completed we have to convert dask bags into other forms. These other forms are generally dask dataframes, dask delayed objects, textfiles, and so on.<\/p>\n<p>This section will brief you on these methods with examples.<\/p>\n<p><strong>1. How to transform Dask Bag into Dask Dataframe?<\/strong><\/p>\n<pre><code>To create Dask Dataframe from a Dask Bag, you can use **`dask.bag.to_dataframe()`** function.\n\nBag should contain tuples, dict records, or scalars. The index will not be particularly meaningful. Use reindex afterward if necessary.\n<\/code><\/pre>\n<pre><code class=\"language-python\"># Converting dask bag into dask dataframe\ndataframe=my_bag.to_dataframe()\ndataframe.compute()\n<\/code><\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.localhost:8080\/wp-content\/uploads\/2020\/11\/table2-min.png\" alt=\"data set\" width=\"1062\" height=\"338\" class=\"alignnone size-full wp-image-4867\" srcset=\"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/table2-min.png 1062w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/table2-min-300x95.png 300w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/table2-min-1024x326.png 1024w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/table2-min-768x244.png 768w\" sizes=\"(max-width: 1062px) 100vw, 1062px\" \/><\/p>\n<p><strong>2. How to create <code>Dask.Delayed<\/code> object from Dask bag<\/strong><\/p>\n<pre><code>You can convert `dask.bag` into a list of `dask.delayed` objects, one per partition using the `dask.bagto_delayed()` function. A main parameter of this function is `optimize_graph`. It is a boolean parameter. If it is set to True, then the task graph will be optimized before converting it into delayed objects.\n<\/code><\/pre>\n<pre><code class=\"language-python\">my_bag.to_delayed(True)\n<\/code><\/pre>\n<pre><code>[Delayed(('mimesis-04d0f03e80a0b650adc596eba7851142', 0)),\n Delayed(('mimesis-04d0f03e80a0b650adc596eba7851142', 1)),\n Delayed(('mimesis-04d0f03e80a0b650adc596eba7851142', 2)),\n Delayed(('mimesis-04d0f03e80a0b650adc596eba7851142', 3)),\n Delayed(('mimesis-04d0f03e80a0b650adc596eba7851142', 4)),\n Delayed(('mimesis-04d0f03e80a0b650adc596eba7851142', 5)),\n Delayed(('mimesis-04d0f03e80a0b650adc596eba7851142', 6)),\n Delayed(('mimesis-04d0f03e80a0b650adc596eba7851142', 7)),\n Delayed(('mimesis-04d0f03e80a0b650adc596eba7851142', 8)),\n Delayed(('mimesis-04d0f03e80a0b650adc596eba7851142', 9))]\n<\/code><\/pre>\n<p><strong>3. How to convert Dask bag to text files<\/strong><\/p>\n<pre><code>You can write dask Bag to disk using the `dask.bag.to_textfiles()` function. As there are 10 partitions, 10 textfiles will be written. You have to provide the path or directory as input.\n<\/code><\/pre>\n<pre><code class=\"language-python\">my_bag.to_textfiles('\/content\/textfile')\n<\/code><\/pre>\n<p>You have now learned how to create, operate and transform Dask bags. Next comes the most important concept in Dask.<\/p>\n<h2 id=\"distributed-computing-with-dask-hands-on-example\">Distributed computing with Dask &#8211; Hands on Example<\/h2>\n<p>In this section, we shall load a csv file and perform the same task using pandas and Dask to compare performance. For this, first load <code>Client<\/code> from <code>dask.distributed<\/code>.<\/p>\n<p><code>Dask.distributed<\/code>  will store the results of tasks in the distributed memory of the worker nodes. The central scheduler will track all the data on cluster. Once a result is completed, it is often erased from memory to create more space.<\/p>\n<h3>What is a Dask Client?<\/h3>\n<p>The <code>Client<\/code> is a primary entry point for users of <code>dask.distributed<\/code>.<\/p>\n<p>After we setup a cluster, we initialize a Client by pointing it to the address of a Scheduler. The Client registers itself as the default Dask scheduler, and so runs all dask collections like <code>dask.array<\/code>, <code>dask.bag<\/code>, <code>dask.dataframe<\/code> and <code>dask.delayed<\/code>.<\/p>\n<pre><code class=\"language-python\"># Import dask.distributed.Client and pandas\nfrom dask.distributed import Client\nimport pandas as pd\nimport time\n\n# Initializing a client\nclient = Client(processes=False)\nclient\n<\/code><\/pre>\n<table style=\"border: 2px solid white\">\n<tr>\n<td style=\"vertical-align: top;border: 0px solid white\">\n<h3>Client<\/h3>\n<ul>\n<li><b>Scheduler: <\/b>inproc:\/\/172.28.0.2\/1245\/1\n<\/ul>\n<\/td>\n<td style=\"vertical-align: top;border: 0px solid white\">\n<h3>Cluster<\/h3>\n<ul>\n<li><b>Workers: <\/b>1<\/li>\n<li><b>Cores: <\/b>2<\/li>\n<li><b>Memory: <\/b>13.65 GB<\/li>\n<\/ul>\n<\/td>\n<\/tr>\n<\/table>\n<p>Now, let&#8217;s do a logic \/ operation using pandas dataframe. Then do the same logic using <code>dask.distibuted<\/code> and compare the time taken.<\/p>\n<p>First, read a csv <a href=\"https:\/\/drive.google.com\/file\/d\/16yle7FZBWt4PccZRTZ-dd89ZgDUZvIO6\/view?usp=sharing\" rel=\"noopener noreferrer\" target=\"_blank\">(download from here)<\/a>file into a normal pandas data frame. Clean the data and set index as per requirement. Below code prints the processed pandas data frame we have.<\/p>\n<pre><code class=\"language-python\"># Read csv  file into a pandas dataframe and process it\ndf = pd.read_csv('forecast_pivoted.csv')\ndf = df.drop('Unnamed: 0', axis=1)\ndf = df.set_index('itm_nb')\ndf.head()\n<\/code><\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.localhost:8080\/wp-content\/uploads\/2020\/11\/table3-min.png\" alt=\"table\" width=\"1000\" height=\"200\" class=\"alignnone size-full wp-image-4869\" srcset=\"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/table3-min.png 1000w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/table3-min-300x60.png 300w, https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/table3-min-768x154.png 768w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><\/p>\n<pre><code class=\"language-python\">dates = df.columns\nfor date in dates:\n  print(date)\n<\/code><\/pre>\n<p>Now, say we need to perform a particular function on the dataset. In the below example, for each date column, I am calculating sum of all values. We shall first execute these using pandas and record the time taken using <code>%%time<\/code>.<\/p>\n<pre><code class=\"language-python\"># A function to perform desired operation\ndef do_operation(df, index, date):\n    new_df=df[date]\n<\/code><\/pre>\n<p>Iterating through the indices of dataframe and calling the function. This is execution in pandas<\/p>\n<pre><code class=\"language-python\">%%time\n# Loop through the indices and columns and call the function.\nfor index in df.index:\n    for date in dates:\n        do_operation(df, index, date)\n<\/code><\/pre>\n<pre><code>CPU times: user 9.85 s, sys: 456 \u00b5s, total: 9.85 s\nWall time: 9.79 s\n<\/code><\/pre>\n<p>Observe the time taken for the above process. Now let&#8217;s see how to implement this in Dask and record the time.  To reduce the time, we will use Dask client to parallelize the workload.<\/p>\n<p>We had already imported and initialized a Client. Now, distribute the contents of the dataframe on which you need to do the processing using <code>client.scatter()<\/code>.<\/p>\n<p>To create a future, call the <code>client.scatter()<\/code> function. What will this function do?<\/p>\n<p>Basically, it moves data from the local client process into the workers of the distributed scheduler.<\/p>\n<p>Next, you can start looping over the indices of the dataframe. Here instead of simply calling the function, we will use <code>client.submit()<\/code> function. The <code>client.submit()<\/code> function is responsible for submitting a function application to the scheduler.  To this function, you can pass the function defined, the future and other parameters.<\/p>\n<p>The process is one. But, how to collect or gather the results?<\/p>\n<p>We have <code>client.gather()<\/code> function for that. This function gathers futures from the distributed memory. It accepts a future, nested container of futures. The return type will match the input type. In the below example, we have passed the futures as input to this function.<\/p>\n<pre><code class=\"language-python\">%%time\n# Use Dask client to parallelize the workload.\n\n# Create a futures array to store the futures returned by Dask\nfutures = []\n\n# Scatter the dataframe beforehand\ndf_future = client.scatter(df)\n\nfor index in df.index:\n    for date in dates:\n        # Submit tasks to the dask client in parallel\n        future = client.submit(do_operation, df_future, index, date)\n        # Store the returned future in futures list\n        futures.append(future)\n\n# Gather the results.\n_ = client.gather(futures)\n<\/code><\/pre>\n<p>Observe the time taken. Dask will significantly speed up your program.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Dask provides efficient parallelization for data analytics in python. Dask Dataframes allows you to work with large datasets for both data manipulation and building ML models with only minimal code changes. It is open source and works well with python libraries like NumPy, scikit-learn, etc. Let&#8217;s understand how to use Dask with hands-on examples. Dask [&hellip;]<\/p>\n","protected":false},"author":14,"featured_media":4924,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"site-sidebar-layout":"default","site-content-layout":"default","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"default","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[21],"tags":[1912,1913,22],"class_list":["post-4863","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-python","tag-dask","tag-parallel-computing","tag-python"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Dask - How to handle large dataframes in python using parallel computing - machinelearningplus<\/title>\n<meta name=\"description\" content=\"Learn how to use Dask to handle large datasets in Python using parallel computing. Covers Dask DataFrames, delayed execution, and integration with NumPy and scikit-learn.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Dask - How to handle large dataframes in python using parallel computing - machinelearningplus\" \/>\n<meta property=\"og:description\" content=\"Learn how to use Dask to handle large datasets in Python using parallel computing. Covers Dask DataFrames, delayed execution, and integration with NumPy and scikit-learn.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/\" \/>\n<meta property=\"og:site_name\" content=\"machinelearningplus\" \/>\n<meta property=\"article:published_time\" content=\"2020-11-06T13:20:02+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-04-04T07:43:19+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/dask-min.png\" \/>\n\t<meta property=\"og:image:width\" content=\"560\" \/>\n\t<meta property=\"og:image:height\" content=\"315\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Shrivarsheni\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Shrivarsheni\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"21 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"TechArticle\",\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/python\\\/dask-tutorial\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/python\\\/dask-tutorial\\\/\"},\"author\":{\"name\":\"Shrivarsheni\",\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/#\\\/schema\\\/person\\\/31d782fda106181b2c88151a1d50a90d\"},\"headline\":\"Dask &#8211; How to handle large dataframes in python using parallel computing\",\"datePublished\":\"2020-11-06T13:20:02+00:00\",\"dateModified\":\"2022-04-04T07:43:19+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/python\\\/dask-tutorial\\\/\"},\"wordCount\":2951,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/python\\\/dask-tutorial\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/machinelearningplus.com\\\/wp-content\\\/uploads\\\/2020\\\/11\\\/dask-min.png\",\"keywords\":[\"dask\",\"parallel computing\",\"Python\"],\"articleSection\":[\"Python\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/machinelearningplus.com\\\/python\\\/dask-tutorial\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/python\\\/dask-tutorial\\\/\",\"url\":\"https:\\\/\\\/machinelearningplus.com\\\/python\\\/dask-tutorial\\\/\",\"name\":\"Dask - How to handle large dataframes in python using parallel computing - machinelearningplus\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/python\\\/dask-tutorial\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/python\\\/dask-tutorial\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/machinelearningplus.com\\\/wp-content\\\/uploads\\\/2020\\\/11\\\/dask-min.png\",\"datePublished\":\"2020-11-06T13:20:02+00:00\",\"dateModified\":\"2022-04-04T07:43:19+00:00\",\"description\":\"Learn how to use Dask to handle large datasets in Python using parallel computing. Covers Dask DataFrames, delayed execution, and integration with NumPy and scikit-learn.\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/machinelearningplus.com\\\/python\\\/dask-tutorial\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/python\\\/dask-tutorial\\\/#primaryimage\",\"url\":\"https:\\\/\\\/machinelearningplus.com\\\/wp-content\\\/uploads\\\/2020\\\/11\\\/dask-min.png\",\"contentUrl\":\"https:\\\/\\\/machinelearningplus.com\\\/wp-content\\\/uploads\\\/2020\\\/11\\\/dask-min.png\",\"width\":560,\"height\":315,\"caption\":\"dask parallel computing in python\"},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/#website\",\"url\":\"https:\\\/\\\/machinelearningplus.com\\\/\",\"name\":\"machinelearningplus\",\"description\":\"Learn Data Science (AI \\\/ ML) Online\",\"publisher\":{\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/machinelearningplus.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/#organization\",\"name\":\"machinelearningplus\",\"url\":\"https:\\\/\\\/machinelearningplus.com\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/machinelearningplus.com\\\/wp-content\\\/uploads\\\/2022\\\/05\\\/MachineLearningplus-logo.svg\",\"contentUrl\":\"https:\\\/\\\/machinelearningplus.com\\\/wp-content\\\/uploads\\\/2022\\\/05\\\/MachineLearningplus-logo.svg\",\"width\":348,\"height\":36,\"caption\":\"machinelearningplus\"},\"image\":{\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/#\\\/schema\\\/logo\\\/image\\\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/#\\\/schema\\\/person\\\/31d782fda106181b2c88151a1d50a90d\",\"name\":\"Shrivarsheni\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/machinelearningplus.com\\\/wp-content\\\/litespeed\\\/avatar\\\/0d1ccfbd64a15f0b94a69cc383b21358.jpg?ver=1776971399\",\"url\":\"https:\\\/\\\/machinelearningplus.com\\\/wp-content\\\/litespeed\\\/avatar\\\/0d1ccfbd64a15f0b94a69cc383b21358.jpg?ver=1776971399\",\"contentUrl\":\"https:\\\/\\\/machinelearningplus.com\\\/wp-content\\\/litespeed\\\/avatar\\\/0d1ccfbd64a15f0b94a69cc383b21358.jpg?ver=1776971399\",\"caption\":\"Shrivarsheni\"},\"url\":\"https:\\\/\\\/machinelearningplus.com\\\/author\\\/shrivarsheni\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Dask - How to handle large dataframes in python using parallel computing - machinelearningplus","description":"Learn how to use Dask to handle large datasets in Python using parallel computing. Covers Dask DataFrames, delayed execution, and integration with NumPy and scikit-learn.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/","og_locale":"en_US","og_type":"article","og_title":"Dask - How to handle large dataframes in python using parallel computing - machinelearningplus","og_description":"Learn how to use Dask to handle large datasets in Python using parallel computing. Covers Dask DataFrames, delayed execution, and integration with NumPy and scikit-learn.","og_url":"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/","og_site_name":"machinelearningplus","article_published_time":"2020-11-06T13:20:02+00:00","article_modified_time":"2022-04-04T07:43:19+00:00","og_image":[{"width":560,"height":315,"url":"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/dask-min.png","type":"image\/png"}],"author":"Shrivarsheni","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Shrivarsheni","Est. reading time":"21 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"TechArticle","@id":"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/#article","isPartOf":{"@id":"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/"},"author":{"name":"Shrivarsheni","@id":"https:\/\/machinelearningplus.com\/#\/schema\/person\/31d782fda106181b2c88151a1d50a90d"},"headline":"Dask &#8211; How to handle large dataframes in python using parallel computing","datePublished":"2020-11-06T13:20:02+00:00","dateModified":"2022-04-04T07:43:19+00:00","mainEntityOfPage":{"@id":"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/"},"wordCount":2951,"commentCount":0,"publisher":{"@id":"https:\/\/machinelearningplus.com\/#organization"},"image":{"@id":"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/#primaryimage"},"thumbnailUrl":"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/dask-min.png","keywords":["dask","parallel computing","Python"],"articleSection":["Python"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/machinelearningplus.com\/python\/dask-tutorial\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/","url":"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/","name":"Dask - How to handle large dataframes in python using parallel computing - machinelearningplus","isPartOf":{"@id":"https:\/\/machinelearningplus.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/#primaryimage"},"image":{"@id":"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/#primaryimage"},"thumbnailUrl":"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/dask-min.png","datePublished":"2020-11-06T13:20:02+00:00","dateModified":"2022-04-04T07:43:19+00:00","description":"Learn how to use Dask to handle large datasets in Python using parallel computing. Covers Dask DataFrames, delayed execution, and integration with NumPy and scikit-learn.","inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/machinelearningplus.com\/python\/dask-tutorial\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/machinelearningplus.com\/python\/dask-tutorial\/#primaryimage","url":"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/dask-min.png","contentUrl":"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2020\/11\/dask-min.png","width":560,"height":315,"caption":"dask parallel computing in python"},{"@type":"WebSite","@id":"https:\/\/machinelearningplus.com\/#website","url":"https:\/\/machinelearningplus.com\/","name":"machinelearningplus","description":"Learn Data Science (AI \/ ML) Online","publisher":{"@id":"https:\/\/machinelearningplus.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/machinelearningplus.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/machinelearningplus.com\/#organization","name":"machinelearningplus","url":"https:\/\/machinelearningplus.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/machinelearningplus.com\/#\/schema\/logo\/image\/","url":"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2022\/05\/MachineLearningplus-logo.svg","contentUrl":"https:\/\/machinelearningplus.com\/wp-content\/uploads\/2022\/05\/MachineLearningplus-logo.svg","width":348,"height":36,"caption":"machinelearningplus"},"image":{"@id":"https:\/\/machinelearningplus.com\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/machinelearningplus.com\/#\/schema\/person\/31d782fda106181b2c88151a1d50a90d","name":"Shrivarsheni","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/machinelearningplus.com\/wp-content\/litespeed\/avatar\/0d1ccfbd64a15f0b94a69cc383b21358.jpg?ver=1776971399","url":"https:\/\/machinelearningplus.com\/wp-content\/litespeed\/avatar\/0d1ccfbd64a15f0b94a69cc383b21358.jpg?ver=1776971399","contentUrl":"https:\/\/machinelearningplus.com\/wp-content\/litespeed\/avatar\/0d1ccfbd64a15f0b94a69cc383b21358.jpg?ver=1776971399","caption":"Shrivarsheni"},"url":"https:\/\/machinelearningplus.com\/author\/shrivarsheni\/"}]}},"_links":{"self":[{"href":"https:\/\/machinelearningplus.com\/wp-json\/wp\/v2\/posts\/4863","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/machinelearningplus.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/machinelearningplus.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/machinelearningplus.com\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/machinelearningplus.com\/wp-json\/wp\/v2\/comments?post=4863"}],"version-history":[{"count":0,"href":"https:\/\/machinelearningplus.com\/wp-json\/wp\/v2\/posts\/4863\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/machinelearningplus.com\/wp-json\/wp\/v2\/media\/4924"}],"wp:attachment":[{"href":"https:\/\/machinelearningplus.com\/wp-json\/wp\/v2\/media?parent=4863"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/machinelearningplus.com\/wp-json\/wp\/v2\/categories?post=4863"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/machinelearningplus.com\/wp-json\/wp\/v2\/tags?post=4863"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}