Обсуждение: Columnar format export in Postgres

Поиск
Список
Период
Сортировка

Columnar format export in Postgres

От
Sushrut Shivaswamy
Дата:
Hey Postgres team,

I have been working on adding support for columnar format export to Postgres to speed up analytics queries.
I've created an extension that achieves this functionality here.

I"m looking to improve the performance of this extension to enable drop-in analytics support for Postgres. Some immediate improvements I have in mind are:
 - Reduce memory consumption when exporting table data to columnar format
 - Create a native planner / execution hook that can read columnar data with vectorised operations.

It would be very helpful if you could take a look and suggest improvements to the extension.
Hopefully, this extension can be shipped by default with postgres at some point in the future.

Thanks,
Sushrut

Re: Columnar format export in Postgres

От
Ranier Vilela
Дата:
Em qua., 12 de jun. de 2024 às 13:56, Sushrut Shivaswamy <sushrut.shivaswamy@gmail.com> escreveu:
Hey Postgres team,

I have been working on adding support for columnar format export to Postgres to speed up analytics queries.
I've created an extension that achieves this functionality here.

I"m looking to improve the performance of this extension to enable drop-in analytics support for Postgres. Some immediate improvements I have in mind are:
 - Reduce memory consumption when exporting table data to columnar format
 - Create a native planner / execution hook that can read columnar data with vectorised operations.

It would be very helpful if you could take a look and suggest improvements to the extension.
Hopefully, this extension can be shipped by default with postgres at some point in the future.
If you want to have any hope, the license must be BSD.
GPL is incompatible.

best regards,
Ranier Vilela

Re: Columnar format export in Postgres

От
Sutou Kouhei
Дата:
Hi,

In <CAH5mb9_YW76_0xBU2T4C7HF33b+b2w3QBtV50_ZZMy8SA8ChjA@mail.gmail.com>
  "Columnar format export in Postgres" on Wed, 12 Jun 2024 22:26:30 +0530,
  Sushrut Shivaswamy <sushrut.shivaswamy@gmail.com> wrote:

> I have been working on adding support for columnar format export to
> Postgres to speed up analytics queries.

FYI: I'm proposing making COPY format extendable:

* https://www.postgresql.org/message-id/flat/20231204.153548.2126325458835528809.kou@clear-code.com
* https://commitfest.postgresql.org/48/4681/

If it's accepted, we can implement extensions for COPY
FORMAT arrow and COPY FORMAT parquet. With these extensions,
we can use file_fdw to read Apache Arrow and Apache Parquet
file because file_fdw is based on COPY FROM:
https://www.postgresql.org/docs/current/file-fdw.html

If you're interested in this proposal, you can review the
latest proposed patch set to proceed this proposal.


>  - Reduce memory consumption when exporting table data to columnar format

The above COPY support will help this.


Thanks,
-- 
kou



Re: Columnar format export in Postgres

От
Sushrut Shivaswamy
Дата:
Thanks for the response.

I had considered using COPY TO to export columnar data but gave up on it since the formats weren't extensible.
It's great to see that you are making it extensible.

I'm still going through the thread of comments on your patch but I have some early thoughts about using it for columnar data export.

 - To maintain data freshness there would need to be a way to schedule exports using `COPY TO 'parquet`` periodically
      - pg_analytica has the scheduling logic, once available COPY TO can be used to export the data instead of reading table in chunks being used currently.

 - To facilitate efficient querying it would help to export multiple parquet files for the table instead of a single file.
   Having multiple files allows queries to skip chunks if the key range in the chunk does not match query filter criteria.
   Even within a chunk it would help to be able to configure the size of a row group.
      - I'm not sure how these parameters will be exposed within `COPY TO`. 
        Or maybe the extension implementing the `COPY TO` handler will allow this configuration?

 - Regarding using file_fdw to read Apache Arrow and Apache Parquet file because file_fdw is based on COPY FROM:
     - I'm not too clear on this. file_fdw seems to allow creating a table from  data on disk exported using COPY TO.
       But is the newly created table still using the data on disk(maybe in columnar format or csv) or is it just reading that data to create a row based table.
       I'm not aware of any capability in the postgres planner to read columnar files currently without using an extension like parquet_fdw.
        - For your usecase how do you plan to query the arrow / parquet data?

       

Re: Columnar format export in Postgres

От
Sushrut Shivaswamy
Дата:
If you want to have any hope, the license must be BSD.
GPL is incompatible.

Ack, will update the license to BSD. Thanks 

On Wed, Jun 12, 2024 at 10:49 PM Ranier Vilela <ranier.vf@gmail.com> wrote:
Em qua., 12 de jun. de 2024 às 13:56, Sushrut Shivaswamy <sushrut.shivaswamy@gmail.com> escreveu:
Hey Postgres team,

I have been working on adding support for columnar format export to Postgres to speed up analytics queries.
I've created an extension that achieves this functionality here.

I"m looking to improve the performance of this extension to enable drop-in analytics support for Postgres. Some immediate improvements I have in mind are:
 - Reduce memory consumption when exporting table data to columnar format
 - Create a native planner / execution hook that can read columnar data with vectorised operations.

It would be very helpful if you could take a look and suggest improvements to the extension.
Hopefully, this extension can be shipped by default with postgres at some point in the future.
If you want to have any hope, the license must be BSD.
GPL is incompatible.

best regards,
Ranier Vilela

Re: Columnar format export in Postgres

От
Sutou Kouhei
Дата:
Hi,

In <CAH5mb98Dq7ssrQq9n5yW3G1YznH=Q7VvOZ20uhG7Vxg33ZBLDg@mail.gmail.com>
  "Re: Columnar format export in Postgres" on Thu, 13 Jun 2024 22:30:24 +0530,
  Sushrut Shivaswamy <sushrut.shivaswamy@gmail.com> wrote:

>  - To facilitate efficient querying it would help to export multiple
> parquet files for the table instead of a single file.
>    Having multiple files allows queries to skip chunks if the key range in
> the chunk does not match query filter criteria.
>    Even within a chunk it would help to be able to configure the size of a
> row group.
>       - I'm not sure how these parameters will be exposed within `COPY TO`.
>         Or maybe the extension implementing the `COPY TO` handler will
> allow this configuration?

Yes. But adding support for custom COPY TO options is
out-of-scope in the first version. We will focus on only the
minimal features in the first version. We can improve it
later based on use-cases.

See also: https://www.postgresql.org/message-id/20240131.141122.279551156957581322.kou%40clear-code.com

>  - Regarding using file_fdw to read Apache Arrow and Apache Parquet file
> because file_fdw is based on COPY FROM:
>      - I'm not too clear on this. file_fdw seems to allow creating a table
> from  data on disk exported using COPY TO.

Correct.

>        But is the newly created table still using the data on disk(maybe in
> columnar format or csv) or is it just reading that data to create a row
> based table.

The former.

>        I'm not aware of any capability in the postgres planner to read
> columnar files currently without using an extension like parquet_fdw.

Correct. We still need another approach such as parquet_fdw
with the COPY format extensible feature to optimize query
against Apache Parquet data. file_fdw can just read Apache
Parquet data by SELECT. Sorry for confusing you.


Thanks,
-- 
kou