MS SQL 2019 BDC – query data from an S3 bucket

 

As you probably know, Microsoft made a huge step up with MS SQL 2019 Big Data Cluster (BDC) what converts old OLTP based SQL server to a new shinny data hub with unlimited possibilities. One of the new features is to query directly from an S3 AWS bucket using T-SQL. Sounds good right? To get this done we need – S3 bucket with files (I’ll be using parquet files) and running BDC. Tools I’ll be using – Azure Data Studio and CMD. Let’s get started!

First, we need to mount the S3 bucket to BDC. Run the code below in CMD:

set MOUNT_CREDENTIALS=fs.s3a.access.key=<access key>, fs.s3a.secret.key=<secret key>

azdata login -e https://<BDC ip address>:30080/

azdata bdc hdfs mount create --remote-uri s3a://<s3 bucket> --mount-path /<BDC path>

Let’s get through these commands. To set credentials, replace <access key> to your S3 bucket access key and <secret key> accordingly to secret key:

Set credentials

Now login to BDC. Replace IP address to your BDC IP and submit login details:

Login to BDC using azdataLogin to BDC using azdata

And the last step is to mount S3 bucket to BDC. Replace <s3 bucket> to your S3 bucket/folder and <BDC path> – the path to hook up in your HDFS:


Mount S3 to HDFSMount S3 to HDFS

To check the status of your mount, you may run the command below:

azdata bdc hdfs mount status

S3 bucket mount statusMount status of the S3 bucket

The state Ready means the bucket is mounted and ready to use.

Now, let’s switch to the Azure Data Studio (ADS). To verify the mounted S3 bucket is available, we may check under HDFS folder after connected to BDC:

Verify S3 mounted on ADSVerify S3 mounted on ADS

The next step is to create an external file format, external data source and external table to be able to finally query the data. To create the format execute a script:

CREATE EXTERNAL FILE FORMAT s3_parquet_format
WITH (
    FORMAT_TYPE = PARQUET,
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);

I set the name of the format to s3_parquet_format and since the files in the S3 bucket are parquet with snappy compression – format type is parquet and data compression is org.apache.hadoop.io.compress.SnappyCodec accordingly. Here is the list of all available formats.

To create an external data source:

CREATE EXTERNAL DATA SOURCE SqlStoragePool
    WITH (LOCATION = 'sqlhdfs://controller-svc/default'); 

And the very last step is to create an external table. To do this, need to know the schema of parquet files:

CREATE EXTERNAL TABLE [S3_demo_data]
(
"column1" bigint,
"column2" int,
"column3" int,
"column4" int,
"column5" int,
"column6" int,
"column7" int,
"column8" int,
"column9" varchar(3),
"column10" varchar(3),
"column11" int,
"column12" varchar(3),
"column13" varchar(3),
"column14" varchar(3),
"column15" varchar(3),
"column16" varchar(10),
"column17" int,
<...>
"column_n" varchar(3)
)
WITH
(
    DATA_SOURCE = SqlStoragePool,
    LOCATION = '/demo/S3',
    FILE_FORMAT = s3_parquet_format
)
GO

Where location is the S3 bucket mount on BDC and the file format s3_parquet_format we’ve created recently.

Finally, let’s try to query the data:

query data t-sqlQuery data t-sql

Hope this tip gonna be helpful. Thanks for reading!