| title | Virtualize CSV data from storage pool |
|---|---|
| subtitle | SQL Server Big Data Clusters |
| description | Steps detailing the create external table for virtualization of a CSV file in a Big Data Cluster |
| author | MikeRayMSFT |
| ms.author | mikeray |
| ms.reviewer | mikeray |
| ms.date | 04/24/2020 |
| ms.topic | conceptual |
| ms.prod | sql |
| ms.technology | polybase |
| monikerRange | >= sql-server-ver15 || = sqlallproducts-allversions |
| ms.metadata | seo-lt-2019 |
SQL Server Big Data Clusters can virtualize data from CSV files in HDFS. This process allows the data to stay in its original location, but can be queried from a SQL Server instance like any other table. This feature uses PolyBase connectors, and minimizes the need for ETL processes. For more information on data virtualization, see What is PolyBase?
In Azure Data Studio (ADS) connect to the SQL Server master instance of your Big Data Cluster. Once connected, expand the HDFS elements in the object explorer to locate the CSV file(s) you would like to data virtualize.
For the purposes of this tutorial, create a new directory named Data.
- Right-click on the HDFS root directory context menu.
- Click New directory.
- Name the new directory Data.
Upload sample data. For a simple walk through, you can use a sample csv data file. This article uses airline delay cause data from the US Department of Transportation. Download the raw data, and extract the data to your computer. Name the file airline_delay_causes.csv.
To upload the sample file after you extract it:
- In Azure Data Studio, right-click the new directory you created.
- Click Upload files.
Azure Data Studio uploads the files to HDFS on the Big Data Cluster.
The storage pool external data source is not created in a database by default in your Big Data Cluster. Before you can create the external table, create the default SqlStoragePool External Data Source in your target database with the following Transact-SQL query. Make sure you first change the context of the query to your target database.
-- Create the default storage pool source for SQL Big Data Cluster
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlStoragePool')
CREATE EXTERNAL DATA SOURCE SqlStoragePool
WITH (LOCATION = 'sqlhdfs://controller-svc/default');From ADS, right-click on the CSV file and select Create External Table From CSV File from the context menu. You can also create external tables from CSV files from a directory in HDFS if the files under the directory follow the same schema. This would allow the virtualization of the data at a directory level without the need to process individual files and get a joined result set over the combined data.Azure data studio guides you through the steps to create the external table.
Specify the database, the data source, a table name, the schema, and the name for the table's external file format.
Click Next.
Azure Data Studio provides a preview of the imported data.
Once done viewing the preview, click Next to continue
In the next window, you may modify the columns of the external table you intend to create. You are able to alter the column name, change the data type and allow for nullable rows.
After you verify the destination columns, click Next.
This step provides a summary of your selections. It provides the SQL Server name, database name, table name, table schema, and external table information. In this step, you have the option to generate a script or create a table. Generate Script creates a script in T-SQL to create the external data source. Create Table creates the external data source.
If you click Create Table, SQL Server creates the external table in the destination database.
If you click, Generate Script, you Azure Data Studio creates the T-SQL query for creating the external table.
Once created the table can now be queried directly using T-SQL from the SQL Server instance.
For more information on SQL Server Big Data Cluster and related scenarios, see What is SQL Server Big Data Cluster?.



