Hello,
I made a utility program to dump PostgreSQL database in Apache Arrow format.
Apache Arrow is a kind of data format for columnar-based structured
data; actively
developed by Spark and comprehensive communities.
It is suitable data representation for static and read-only but large
number of rows.
Many of data analytics tools support Apache Arrow as a common data
exchange format.
See, https://arrow.apache.org/
* pg2arrow
https://github.com/heterodb/pg2arrow
usage:
$ ./pg2arrow -h localhost postgres -c 'SELECT * FROM hogehoge LIMIT
10000' -o /tmp/hogehoge.arrow
--> fetch results of the query, then write out "/tmp/hogehoge"
$ ./pg2arrow --dump /tmp/hogehoge
--> shows schema definition of the "/tmp/hogehoge"
$ python
>>> import pyarrow as pa
>>> X = pa.RecordBatchFileReader("/tmp/hogehoge").read_all()
>>> X.schema
id: int32
a: int64
b: double
c: struct<x: int32, y: double, z: decimal(30, 11), memo: string>
child 0, x: int32
child 1, y: double
child 2, z: decimal(30, 11)
child 3, memo: string
d: string
e: double
ymd: date32[day]
--> read the Apache Arrow file using PyArrow, then shows its schema definition.
It is also a groundwork for my current development - arrow_fdw; which
allows to scan
on the configured Apache Arrow file(s) as like regular PostgreSQL table.
I expect integration of the arrow_fdw support with SSD2GPU Direct SQL
of PG-Strom
can pull out maximum capability of the latest hardware (NVME and GPU).
Likely, it is an ideal configuration for log-data processing generated
by many sensors.
Please check it.
Comments, ideas, bug-reports, and other feedbacks are welcome.
As an aside, NVIDIA announced their RAPIDS framework; to exchange data frames
on GPU among multiple ML/Analytics solutions. It also uses Apache
Arrow as a common
format for data exchange, and this is also our groundwork for them.
https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/
Thanks,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai@heterodb.com>