Parquet to redshift data types

8/4/2023

csv files into Parquet format using Python and Apache’s PyArrow package (see here for more details on using PyArrow). I’m working with a Civil Aviation dataset and converted our standard gzipped. There are a number of ways to create Parquet data, which is a common output from EMR clusters and other components in the Hadoop ecosystem.

Now let’s look at how to configure the various components required to make this work. Here we rely on Amazon Redshift’s Spectrum feature, which allows Matillion ETL to query Parquet files in S3 directly once the crawler has identified and cataloged the files’ underlying data structure. This article is about how to use a Glue Crawler in conjunction with Matillion ETL for Amazon Redshift to access Parquet files. Once that’s done, table data can be accessed similar to the User-Defined External Table approach above, the only difference being that the data types were defined by the crawler rather than the user. Using this approach, the crawler creates the table entry in the external catalog on the user’s behalf after it determines the column data types.

Crawler-Defined External Table – Amazon Redshift can access tables defined by a Glue Crawler through Spectrum as well.
This creates an entry for the table in an external catalog but requires that the users know and correctly specify column data types. Here the user specifies the S3 location of the underlying Parquet files and the data types of the columns in those data files.
User-Defined External Table – Matillion ETL can create external tables through Spectrum.
Given the newness of this development, Matillion ETL does not yet support this command, but we plan to add that support in a future release coming soon.
COPY Command – Amazon Redshift recently added support for Parquet files in their bulk load command COPY.
So how do you load Parquet files into Amazon Redshift? There’s a number of ways: Such formats offer advantages in data warehouse environments over more traditional, row-orientated files, notably preventing unnecessary I/O for columns you exclude from a given SQL statement’s SELECT or WHERE clauses. A popular file format in these use cases is Parquet, which stores data in a columnar format. Given the wide adoption of Data Lake architectures in recent years, users often call on Matillion ETL to load a variety of file formats from S3, a common persistence layer behind such data lakes, into Amazon Redshift. Our method quickly extracts and loads the data, and then transforms it as needed using Amazon Redshift’s innate, clustered capabilities. Instead of extracting, transforming, and then loading data (ETL), we use an ELT approach. Matillion is a cloud-native and purpose-built solution for loading data into Amazon Redshift by taking advantage of Amazon Redshift’s Massively Parallel Processing (MPP) architecture.

0 Comments

Parquet to redshift data types

Leave a Reply.

Author

Archives

Categories