Обсуждение: GSoC - Idea Discussion

Поиск

Список

Период

Сортировка

GSoC - Idea Discussion

От

hitesh ramani

Дата:

18 марта 2015 г., 23:11:25

Hello devs,

As stated earlier I was thinking to propose the integration of Postgres and CUDA for faster execution of order by queries thru optimizing the sorting code and sorting it with CUDA. I saw and tried to run PG Strom and ran into issues. Moreover, PG Strom is implemented in OpenCL, not CUDA.

I have hardware to run CUDA and currently I'm at a point where I have almost integrated Postgres and CUDA. This opens up gates for a lot of features which can be optimized thru CUDA and parallel processing, though here I only want to focus on sorting, hence kind of feasible for the time period.

As I did some research, I found CUDA is more efficient in not just the parallel performance but data transfer latency too. My idea is to create a branch of Postgres with the CUDA integrated code.

For the feasibility, I guess it's very much feasible because I've almost integrated CUDA execution and the code needs to be optimized as per CUDA.

Please give in your valuable suggestions and views on this.

Thanks and Regards,

Hitesh Ramani

Re: GSoC - Idea Discussion

От

Tomas Vondra

Дата:

19 марта 2015 г., 19:30:24

Hi Hitesh,

On 18.3.2015 21:11, hitesh ramani wrote:
> Hello devs,
> 
> As stated earlier I was thinking to propose the integration of
> Postgres and CUDA for faster execution of order by queries thru
> optimizing the sorting code and sorting it with CUDA. I saw and tried
> to run PG Strom and ran into issues. Moreover, PG Strom is
> implemented in OpenCL, not CUDA.

Could you please elaborate more why to choose CUDA, a nvidia-only
technology, rather than OpenCL, supported by much wider range of
companies and projects? Why do you consider OpenCL unsuitable?

Not that CUDA is bad - it certainly works better in some scenarios, but
this is a cost/benefits question, and it only works with devices
manufactured by a single company. That significantly limits the
usefulness of the work, IMHO.

You mention that you ran into issues with PG Strom.  What issues?

> 
> I have hardware to run CUDA and currently I'm at a point where I
> have almost integrated Postgres and CUDA. This opens up gates for a
> lot of features which can be optimized thru CUDA and parallel
> processing, though here I only want to focus on sorting, hence kind
> of feasible for the time period.

Can we see some examples, what this actually means? What you can and
can't do at this point, etc.? Can you share some numbers how this
improves the performance?

> 
> As I did some research, I found CUDA is more efficient in not just
> the parallel performance but data transfer latency too. My idea is to
> create a branch of Postgres with the CUDA integrated code.
>

More efficient than what?

> 
> For the feasibility, I guess it's very much feasible because I've
> almost integrated CUDA execution and the code needs to be optimized
> as per CUDA.

That's really difficult to judge, because you have not provided any
source code, examples or anything else to support this.

> 
> Please give in your valuable suggestions and views on this.

From where I sit, this looks interesting, but rather as a research
project rather than something than can be integrated into PostgreSQL in
a foreseeable future. Not sure that's what GSoC is intended for.

Also, we badly need more details on this - current status, examples, and
especially project plan explaining the scope. It's impossible to say
whether the sort can be implemented within the GSoC time frame.

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: GSoC - Idea Discussion

От

hitesh ramani

Дата:

19 марта 2015 г., 23:41:54

Hello Tomas,

> Could you please elaborate more why to choose CUDA, a nvidia-only
> technology, rather than OpenCL, supported by much wider range of
> companies and projects? Why do you consider OpenCL unsuitable?
>
> Not that CUDA is bad - it certainly works better in some scenarios, but
> this is a cost/benefits question, and it only works with devices
> manufactured by a single company. That significantly limits the
> usefulness of the work, IMHO.

I will never say OpenCL is unsuitable, I just meant, as per the research I did, CUDA came out with better results. I do agree OpenCL is also a great tool to exploit the power of GPUs. My aim is to enhance the performance using CUDA, though OpenCL implementation might work great too!

> You mention that you ran into issues with PG Strom. What issues?

While I was trying to compile, I ran into the error "src/main.c:27:29: fatal error: utils/ruleutils.h: No such file or directory", when I did make to the branch of Postgres suggested in the description, i.e the custom_join branch, I still ran into the same issue. Moreover, I couldn't locate the file.

> Can we see some examples, what this actually means? What you can and
> can't do at this point, etc.? Can you share some numbers how this
> improves the performance?

I did some benchmarking on quicksort for 1M random numbers(range 0 to 0xffffff) on GPU and CPU, the results showed enhancement of 700% on the GPU.

What this means and what I can do at this point - My aim was to integrate CUDA with Postgres so that I can make a call to the GPU for sorting operation. To start, I made a simple CUDA hello world program, and edited the code to call it from qsort, ran into name mangling issues, so sorted that out by creating 2 different .h files one for CUDA program and for the call I made from qsort. Finally, edited the make file to compile the CUDA program with the Postgres compilation itself and now when I compile my Postgres code, the CUDA file gets compiled too and prints the needed on the server end.

What I still haven't done - I still haven't actually enhanced the sorting yet, I'm still analyzing the code, how to tinkle with it, the right approach.

> That's really difficult to judge, because you have not provided any
> source code, examples or anything else to support this.
>
> >
> > Please give in your valuable suggestions and views on this.
>
> From where I sit, this looks interesting, but rather as a research
> project rather than something than can be integrated into PostgreSQL in
> a foreseeable future. Not sure that's what GSoC is intended for.
>
> Also, we badly need more details on this - current status, examples, and
> especially project plan explaining the scope. It's impossible to say
> whether the sort can be implemented within the GSoC time frame.

What I actually see it is as is to be a branch of Postgres which has CUDA compatible features. I wanted to start it by sorting which can further be improved. To be honest, I'm still analyzing the sort code for elements above a million integer elements(in a single row, for now) so that the use of GPUs is actually significant. As I saw, Postgres uses external sort for that.

If you feel this isn't feasible in such a time span, I would love to hear any suggestion for any small function which can leverage off by parallelism.

Thanks and Regards,

Hitesh Ramani

Re: GSoC - Idea Discussion

От

Tomas Vondra

Дата:

20 марта 2015 г., 00:31:31


On 03/19/15 21:41, hitesh ramani wrote:
> Hello Tomas,
>
>
>  > Could you please elaborate more why to choose CUDA, a nvidia-only
>  > technology, rather than OpenCL, supported by much wider range of
>  > companies and projects? Why do you consider OpenCL unsuitable?
>  >
>  > Not that CUDA is bad - it certainly works better in some scenarios, but
>  > this is a cost/benefits question, and it only works with devices
>  > manufactured by a single company. That significantly limits the
>  > usefulness of the work, IMHO.
>
>
> I will never say OpenCL is unsuitable, I just meant, as per the research
> I did, CUDA came out with better results. I do agree OpenCL is also a
> great tool to exploit the power of GPUs. My aim is to enhance the
> performance using CUDA, though OpenCL implementation might work great too!

My point was that using open standards and frameworks (OpenCL) has much 
higher chance of being welcomed by the community of open source 
projects, compared to proprietary technologies like CUDA.

>
>  > You mention that you ran into issues with PG Strom. What issues?
>
> While I was trying to compile, I ran into the error "src/main.c:27:29:
> fatal error: utils/ruleutils.h: No such file or directory", when I did
> make to the branch of Postgres suggested in the description, i.e the
> custom_join branch, I still ran into the same issue. Moreover, I
> couldn't locate the file.

That's strange, and you should probably ask people on the PG Strom 
projects. Haven't tried PG Strom for a long time, but the compilation 
worked fine some time ago.

>
>  > Can we see some examples, what this actually means? What you can and
>  > can't do at this point, etc.? Can you share some numbers how this
>  > improves the performance?
>
> I did some benchmarking on quicksort for 1M random numbers(range 0 to
> 0xffffff) on GPU and CPU, the results showed enhancement of 700% on the GPU.

So you've created an array of 1M integers, and it's 7x faster on GPU 
compared to pg_qsort(), correct?

Well, it might surprise you, but PostgreSQL almost never sorts numbers 
like this. PostgreSQL sorts tuples, which is way more complicated and, 
considering the variable length of tuples (causing issues with memory 
access), rather unsuitable for GPU devices. I might be missing 
something, of course.

Also, it often needs additional information, like collations when 
sorting by a text field, for example.

> What this means and what I can do at this point - My aim was to
> integrate CUDA with Postgres so that I can make a call to the GPU for
> sorting operation. To start, I made a simple CUDA hello world program,
> and edited the code to call it from qsort, ran into name mangling
> issues, so sorted that out by creating 2 different .h files one for CUDA
> program and for the call I made from qsort. Finally, edited the make
> file to compile the CUDA program with the Postgres compilation itself
> and now when I compile my Postgres code, the CUDA file gets compiled too
> and prints the needed on the server end.

Why don't you show us the source code? Would be simpler than explaining 
what it does.

>
> What I still haven't done - I still haven't actually enhanced the
> sorting yet, I'm still analyzing the code, how to tinkle with it, the
> right approach.
>

I'd recommend discussing the code here. It's certainly quite complex, 
especially if this is your first encounter with it.

>
>  > That's really difficult to judge, because you have not provided any
>  > source code, examples or anything else to support this.
>  >
>  > >
>  > > Please give in your valuable suggestions and views on this.
>  >
>  > From where I sit, this looks interesting, but rather as a research
>  > project rather than something than can be integrated into PostgreSQL in
>  > a foreseeable future. Not sure that's what GSoC is intended for.
>  >
>  > Also, we badly need more details on this - current status, examples, and
>  > especially project plan explaining the scope. It's impossible to say
>  > whether the sort can be implemented within the GSoC time frame.
>
> What I actually see it is as is to be a branch of Postgres which has
> CUDA compatible features. I wanted to start it by sorting which can

I find it very unlikely that this project will choose something that is 
intended as a fork.

> further be improved. To be honest, I'm still analyzing the sort code
> for elements above a million integer elements(in a single row, for
> now) so that the use of GPUs is actually significant. As I saw,
> Postgres uses external sort for that.

PostgreSQL uses adaptive sort - in-memory when it fits into work_mem, 
on-disk when it does not. This is decided at runtime.

You'll have to do the same thing, because the amount of memory available 
on GPUs is limited to a few GBs, and it needs to work for datasets 
exceeding that limit (the amount of data is uncertain at planning time).

>
> If you feel this isn't feasible in such a time span, I would love to
> hear any suggestion for any small function which can leverage off by
> parallelism.

I honestly don't know.

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: GSoC - Idea Discussion

От

Alvaro Herrera

Дата:

20 марта 2015 г., 01:00:08

hitesh ramani wrote:

> > You mention that you ran into issues with PG Strom.  What issues?
> 
> While I was trying to compile, I ran into the error "src/main.c:27:29: fatal error: utils/ruleutils.h: No such file
ordirectory", when I did make to the branch of Postgres suggested in the description, i.e the custom_join branch, I
stillran into the same issue. Moreover, I couldn't locate the file.
 

You're using an old postgres branch.  That file is pretty recent.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: GSoC - Idea Discussion

От

Kouhei Kaigai

Дата:

20 марта 2015 г., 01:09:50

> > Could you please elaborate more why to choose CUDA, a nvidia-only
> > technology, rather than OpenCL, supported by much wider range of
> > companies and projects? Why do you consider OpenCL unsuitable?
> >
> > Not that CUDA is bad - it certainly works better in some scenarios, but
> > this is a cost/benefits question, and it only works with devices
> > manufactured by a single company. That significantly limits the
> > usefulness of the work, IMHO.
>
>
> I will never say OpenCL is unsuitable, I just meant, as per the research I did,
> CUDA came out with better results. I do agree OpenCL is also a great tool to exploit
> the power of GPUs. My aim is to enhance the performance using CUDA, though OpenCL
> implementation might work great too!
>
Let me say CUDA is better than OpenCL :-)
Because of software quality of OpenCL runtime drivers provided by each vendor,
I've often faced mysterious problems. Only nvidia's runtime are enough reliable
from my point of view. In addition, when we implement using OpenCL is a feature
fully depends on hardware characteristics, so we cannot ignore physical hardware
underlying the abstraction layer.
So, I'm now reworking the code to move CUDA from OpenCL.

> > You mention that you ran into issues with PG Strom. What issues?
>
> While I was trying to compile, I ran into the error "src/main.c:27:29: fatal error:
> utils/ruleutils.h: No such file or directory", when I did make to the branch of
> Postgres suggested in the description, i.e the custom_join branch, I still ran
> into the same issue. Moreover, I couldn't locate the file.
>
I think you reference the old branch in my personal repository.
Could you confirm the repository URL? Below is the latest. https://github.com/pg-strom/devel

> > Can we see some examples, what this actually means? What you can and
> > can't do at this point, etc.? Can you share some numbers how this
> > improves the performance?
>
> I did some benchmarking on quicksort for 1M random numbers(range 0 to 0xffffff)
> on GPU and CPU, the results showed enhancement of 700% on the GPU.
>
> What this means and what I can do at this point - My aim was to integrate CUDA
> with Postgres so that I can make a call to the GPU for sorting operation. To start,
> I made a simple CUDA hello world program, and edited the code to call it from
> qsort, ran into name mangling issues, so sorted that out by creating 2 different .h
> files one for CUDA program and for the call I made from qsort. Finally, edited
> the make file to compile the CUDA program with the Postgres compilation itself
> and now when I compile my Postgres code, the CUDA file gets compiled too and prints
> the needed on the server end.
>
> What I still haven't done - I still haven't actually enhanced the sorting yet,
> I'm still analyzing the code, how to tinkle with it, the right approach.
>
It seems to me you are a little bit optimistic.
Unlike CPU code, GPU-Sorting logic has to reference device memory space,
so all the data to be compared needs to be transferred to GPU devices.
Any pointer on host address space is not valid on GPU calculation.
Amount of device memory is usually smaller than host memory, so your code
needs a capability to combined multiple chunks that is partially sorted...
Probably, it is not all here.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: GSoC - Idea Discussion

От

Kouhei Kaigai

Дата:

20 марта 2015 г., 03:53:06

> I think you reference the old branch in my personal repository.
> Could you confirm the repository URL? Below is the latest.
>   https://github.com/pg-strom/devel
>
Sorry, it is not a problem of pg-strom repository.

Please use the "custom_join" branch of the tree below: https://github.com/kaigai/sepgsql/tree/custom_join

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: GSoC - Idea Discussion

От

hitesh ramani

Дата:

20 марта 2015 г., 14:26:19

Hello devs,

Thank you so much for the feedback, to answer to your questions:

Tomas:

>So you've created an array of 1M integers, and it's 7x faster on GPU

>compared to pg_qsort(), correct?

No, I meant general sorting, not on pg_qsort()

>Well, it might surprise you, but PostgreSQL almost never sorts numbers

>like this. PostgreSQL sorts tuples, which is way more complicated and,

>considering the variable length of tuples (causing issues with memory

>access), rather unsuitable for GPU devices. I might be missing

>something, of course.

>Also, it often needs additional information, like collations when

>sorting by a text field, for example.

I totally agree with you on this point, my current target area is very confined as this is the beginning, I'm only considering integer values in one row.

>Why don't you show us the source code? Would be simpler than explaining

>what it does.

You can have a look at the code here: https://github.com/hiteshramani/Postgres-CUDA

This is a compiled code, you can see the call to CUDA function in src/port/qsort.c and .h files - qsort_normal.h and qsort_cuda.h. The hello world program is in src/port/qsort_cuda.cu. Compilation happens in 2 phases - compile and link, I compiled the cuda file with nvcc and for linked I edited the makefile of src/timezone/ because zic build needed the linking of the cuda file.
Suggestions are welcome.

>I'd recommend discussing the code here. It's certainly quite complex,

>especially if this is your first encounter with it.

Yes, I felt it's a little complex but couldn't find a lot of help resources online. I'm looking for help.

>PostgreSQL uses adaptive sort - in-memory when it fits into work_mem,

>on-disk when it does not. This is decided at runtime.

>You'll have to do the same thing, because the amount of memory available

>on GPUs is limited to a few GBs, and it needs to work for datasets

>exceeding that limit (the amount of data is uncertain at planning time).

Yes, I thought of that too. A call could be made with the integer array as an input to the GPU. The GPU then returns the result with a sorted array. I want to proceed step by step, as there are methods to sort amount which exceed the GPU memory.

Álvaro Herrera:

I downloaded the zip of the latest custom_join repo I saw 2 days ago. I'll check once again. Thank you. :)

KaiGai Kohei:

>Let me say CUDA is better than OpenCL :-)

>Because of software quality of OpenCL runtime drivers provided by each vendor,

>I've often faced mysterious problems. Only nvidia's runtime are enough reliable

>from my point of view. In addition, when we implement using OpenCL is a feature

>fully depends on hardware characteristics, so we cannot ignore physical hardware

>underlying the abstraction layer.

>So, I'm now reworking the code to move CUDA from OpenCL.

That's great, I'd love to help you with that and contribute in it.

>It seems to me you are a little bit optimistic.

>Unlike CPU code, GPU-Sorting logic has to reference device memory space,

>so all the data to be compared needs to be transferred to GPU devices.

>Any pointer on host address space is not valid on GPU calculation.

>Amount of device memory is usually smaller than host memory, so your code

>needs a capability to combined multiple chunks that is partially sorted...

>Probably, it is not all here.

Aren't there algorithms which help you if the device memory is limited and the data is massive? I have a rough memory because I did a course online, where I saw algorithms to deal with such problems I suppose.

Thanks and Regards,

Hitesh Ramani

Re: GSoC - Idea Discussion

От

Kouhei Kaigai

Дата:

21 марта 2015 г., 04:24:19

> KaiGai Kohei:
> >It seems to me you are a little bit optimistic.
> >Unlike CPU code, GPU-Sorting logic has to reference device memory space,
> >so all the data to be compared needs to be transferred to GPU devices.
> >Any pointer on host address space is not valid on GPU calculation.
> >Amount of device memory is usually smaller than host memory, so your code
> >needs a capability to combined multiple chunks that is partially sorted...
> >Probably, it is not all here.
> 
> Aren't there algorithms which help you if the device memory is limited and the
> data is massive? I have a rough memory because I did a course online, where I
> saw algorithms to deal with such problems I suppose.
>
What I took is a hybrid approach to process data set overs device memory
limitation. First, it split input data stream into multiple (= more than
or equal to 1) chunks. Second, it kicks kernel of bitonic-sorting with
key-comparison function generated on the fly. Third, it kicks dynamic
background worker to run merge-sorting logic by CPU.
It does not try to handle all the sorting stuff in GPU. The point we
should not forget is, CPU/GPU is a way to sorting but not a purpose.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: GSoC - Idea Discussion