Обсуждение: [HACKERS] [GSOC 17] Eliminate O(N^2) scaling from rw-conflict tracking inserializable transactions
Hi, all My name is Mengxing Liu. I am interested in the project "Eliminate O(N^2) scaling from rw-conflict tracking in serializabletransactions”. After discussing with Kevin off-list, I think it's time to post discussion here. I am afraid thatthere are two things that I need your help. Thank you very much. > > > So the task is clear. We can use a tree-like or hash-like data > > structure to speed up this function. > > Right -- especially with a large number of connections holding a > large number of conflicts. In one paper with high concurrency, they > found over 50% of the CPU time for PostgreSQL was going to these > functions (including functions called by them). This seems to me to > be due to the O(N^2) (or possibly worse) performance from the number > of connections. Anyone knows the title of this paper? I want to reproduce its workloads. > > Remember, I think most of the work here is going to be in > benchmarking. We not only need to show improvements in simple test > cases using readily available tools like pgbench, but think about > what types of cases might be worst for the approach taken and show > that it still does well -- or at least not horribly. It can be OK > to have some slight regression in an unusual case if the common > cases improve a lot, but any large regression needs to be addressed > before the patch can be accepted. There are some community members > who are truly diabolical in their ability to devise "worst case" > tests, and we don't want to be blind-sided by a bad result from one > of them late in the process. > Are there any documents or links introducing how to test and benchmark PostgreSQL? I may need some time to learn about it. Thanks. -- Mengxing Liu
On Fri, Mar 10, 2017 at 9:12 PM, Mengxing Liu <liu-mx15@mails.tsinghua.edu.cn> wrote: > My name is Mengxing Liu. I am interested in the project "Eliminate > O(N^2) scaling from rw-conflict tracking in serializable > transactions”. After discussing with Kevin off-list, I think it's > time to post discussion here. I am afraid that there are two things > that I need your help. Thank you very much. Welcome to the -hackers list! This is a key part of how development happens in the community. Don't be shy about posting questions and ideas, but also expect fairly blunt discussion to occur at times; don't let that put you off. >>> So the task is clear. We can use a tree-like or hash-like data >>> structure to speed up this function. >> >> Right -- especially with a large number of connections holding a >> large number of conflicts. In one paper with high concurrency, they >> found over 50% of the CPU time for PostgreSQL was going to these >> functions (including functions called by them). This seems to me to >> be due to the O(N^2) (or possibly worse) performance from the number >> of connections. > > Anyone knows the title of this paper? I want to reproduce its > workloads. I seem to remember there being a couple other papers or talks, but this is probably the most informative: http://sydney.edu.au/engineering/it/research/tr/tr693.pdf >> Remember, I think most of the work here is going to be in >> benchmarking. We not only need to show improvements in simple test >> cases using readily available tools like pgbench, but think about >> what types of cases might be worst for the approach taken and show >> that it still does well -- or at least not horribly. It can be OK >> to have some slight regression in an unusual case if the common >> cases improve a lot, but any large regression needs to be addressed >> before the patch can be accepted. There are some community members >> who are truly diabolical in their ability to devise "worst case" >> tests, and we don't want to be blind-sided by a bad result from one >> of them late in the process. >> > > Are there any documents or links introducing how to test and > benchmark PostgreSQL? I may need some time to learn about it. There is pgbench: https://www.postgresql.org/docs/devel/static/pgbench.html A read-only load and a read/write mix should both be tested, probably with a few different scales and client counts to force the bottleneck to be in different places. The worst problems have been seen with 32 or more cores on 4 or more sockets with a large number of active connections. I don't know whether you have access to a machine capable of putting this kind of stress on it (perhaps at your university?), but if not, the community has access to various resources we should be able to schedule time on. It may pay for you to search through the archives of the last year or two to look for other benchmarks and see what people have previously done. -- Kevin Grittner
> -----Original Messages----- > From: "Kevin Grittner" <kgrittn@gmail.com> > Sent Time: 2017-03-12 04:24:29 (Sunday) > To: "Mengxing Liu" <liu-mx15@mails.tsinghua.edu.cn> > Cc: "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org> > Subject: Re: [GSOC 17] Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions > > On Fri, Mar 10, 2017 at 9:12 PM, Mengxing Liu > <liu-mx15@mails.tsinghua.edu.cn> wrote: > > > My name is Mengxing Liu. I am interested in the project "Eliminate > > O(N^2) scaling from rw-conflict tracking in serializable > > transactions”. After discussing with Kevin off-list, I think it's > > time to post discussion here. I am afraid that there are two things > > that I need your help. Thank you very much. > > Welcome to the -hackers list! This is a key part of how development > happens in the community. Don't be shy about posting questions and > ideas, but also expect fairly blunt discussion to occur at times; > don't let that put you off. > Thanks, Kevin. I will keep that in mind. > >>> So the task is clear. We can use a tree-like or hash-like data > >>> structure to speed up this function. > >> > >> Right -- especially with a large number of connections holding a > >> large number of conflicts. In one paper with high concurrency, they > >> found over 50% of the CPU time for PostgreSQL was going to these > >> functions (including functions called by them). This seems to me to > >> be due to the O(N^2) (or possibly worse) performance from the number > >> of connections. > > > > Anyone knows the title of this paper? I want to reproduce its > > workloads. > > I seem to remember there being a couple other papers or talks, but > this is probably the most informative: > > http://sydney.edu.au/engineering/it/research/tr/tr693.pdf > Thanks, It's exactly what I need. > >> Remember, I think most of the work here is going to be in > >> benchmarking. We not only need to show improvements in simple test > >> cases using readily available tools like pgbench, but think about > >> what types of cases might be worst for the approach taken and show > >> that it still does well -- or at least not horribly. It can be OK > >> to have some slight regression in an unusual case if the common > >> cases improve a lot, but any large regression needs to be addressed > >> before the patch can be accepted. There are some community members > >> who are truly diabolical in their ability to devise "worst case" > >> tests, and we don't want to be blind-sided by a bad result from one > >> of them late in the process. > >> > > > > Are there any documents or links introducing how to test and > > benchmark PostgreSQL? I may need some time to learn about it. > > There is pgbench: > > https://www.postgresql.org/docs/devel/static/pgbench.html > > A read-only load and a read/write mix should both be tested, > probably with a few different scales and client counts to force the > bottleneck to be in different places. The worst problems have been > seen with 32 or more cores on 4 or more sockets with a large number > of active connections. I don't know whether you have access to a > machine capable of putting this kind of stress on it (perhaps at > your university?), but if not, the community has access to various > resources we should be able to schedule time on. > There is a NUMA machine ( 120 cores, 8 sockets) in my lab. I think it's enough to put this kind of stress. Anyway, I would ask for help if necessary. > It may pay for you to search through the archives of the last year > or two to look for other benchmarks and see what people have > previously done. > I would try. -- Mengxing Liu
On Sat, Mar 11, 2017 at 8:39 PM, Mengxing Liu <liu-mx15@mails.tsinghua.edu.cn> wrote: >> The worst problems have been >> seen with 32 or more cores on 4 or more sockets with a large number >> of active connections. I don't know whether you have access to a >> machine capable of putting this kind of stress on it (perhaps at >> your university?), but if not, the community has access to various >> resources we should be able to schedule time on. > > There is a NUMA machine ( 120 cores, 8 sockets) in my lab. Fantastic! Can you say a bit more about the architecture and OS? > I think it's enough to put this kind of stress. The researchers who saw this bottleneck reported that performance started to dip at 16 cores and the problem was very noticeable at 32 cores. A stress test with 120 cores on 8 sockets will be great! I think perhaps the first milestone on the project should be to develop a set of benchmarks we want to compare to at the end. That would need to include a stress test that clearly shows the problem we're trying to solve, along with some cases involving 1 or two connections -- just to make sure we don't harm performance for low-contention situations. -- Kevin Grittner
I send this email to Tony, too. Because he promised to help me with testing and benchmarking. > > >> The worst problems have been > >> seen with 32 or more cores on 4 or more sockets with a large number > >> of active connections. I don't know whether you have access to a > >> machine capable of putting this kind of stress on it (perhaps at > >> your university?), but if not, the community has access to various > >> resources we should be able to schedule time on. > > > > There is a NUMA machine ( 120 cores, 8 sockets) in my lab. > > Fantastic! Can you say a bit more about the architecture and OS? > Intel(R) Xeon(R) CPU at 2.3GHz, with 1TB physical DRAM and 1.5 TB SSD, running Ubuntu 14.04, Kernel 3.19. I guess NUMA is disabled in BIOS, but I will check that. However, there is only one NIC, so network could be the bottleneck if we use too many cores? > > I think it's enough to put this kind of stress. > > The researchers who saw this bottleneck reported that performance > started to dip at 16 cores and the problem was very noticeable at 32 > cores. A stress test with 120 cores on 8 sockets will be great! > > I think perhaps the first milestone on the project should be to > develop a set of benchmarks we want to compare to at the end. That > would need to include a stress test that clearly shows the problem > we're trying to solve, along with some cases involving 1 or two > connections -- just to make sure we don't harm performance for > low-contention situations. > Thank for your advice! It's really helpful for my proposal. There are several alternative benchmarks. Tony suggested that we should use TPC-E and TPC-DS. Personally, I am more familiar with TPC-C. And pgbench seems only have TPC-B built-in benchmark. Well, I think we can easily find the implementations of these benchmarks for PostgreSQL. The paper you recommended to me used a special benchmark defined by themselves. But it is quite simple and easy to implement. For me, the challenge is profiling the execution. Is there any tools in PostgreSQL to analysis where is the CPU cycles consumed? Or I have to instrument and time by myself? Regards. -- Mengxing Liu
Hi Mengxing Please read my comments : On 3/14/17 17:34, Mengxing Liu wrote: > I send this email to Tony, too. Because he promised to help me with testing and benchmarking. > >>>> The worst problems have been >>>> seen with 32 or more cores on 4 or more sockets with a large number >>>> of active connections. I don't know whether you have access to a >>>> machine capable of putting this kind of stress on it (perhaps at >>>> your university?), but if not, the community has access to various >>>> resources we should be able to schedule time on. >>> There is a NUMA machine ( 120 cores, 8 sockets) in my lab. >> Fantastic! Can you say a bit more about the architecture and OS? >> > Intel(R) Xeon(R) CPU at 2.3GHz, with 1TB physical DRAM and 1.5 TB SSD, running Ubuntu 14.04, Kernel 3.19. > I guess NUMA is disabled in BIOS, but I will check that. > However, there is only one NIC, so network could be the bottleneck if we use too many cores? The configuration is really cool, for the SSD, is it SATA interface? NVMe interface, or is PCIe Flash? different SSD interface had different performance characters. for the NUMA, because the affinity issue will impact PostgreSQL performance. so, Please disabled it if possible > >>> I think it's enough to put this kind of stress. >> The researchers who saw this bottleneck reported that performance >> started to dip at 16 cores and the problem was very noticeable at 32 >> cores. A stress test with 120 cores on 8 sockets will be great! >> >> I think perhaps the first milestone on the project should be to >> develop a set of benchmarks we want to compare to at the end. That >> would need to include a stress test that clearly shows the problem >> we're trying to solve, along with some cases involving 1 or two >> connections -- just to make sure we don't harm performance for >> low-contention situations. >> > Thank for your advice! It's really helpful for my proposal. > > There are several alternative benchmarks. Tony suggested that we should use TPC-E and TPC-DS. > Personally, I am more familiar with TPC-C. And pgbench seems only have TPC-B built-in benchmark. > Well, I think we can easily find the implementations of these benchmarks for PostgreSQL. > The paper you recommended to me used a special benchmark defined by themselves. But it is quite simple and easy to implement. for benchmark tool, TPC-E is the latest spec, could use it to stress PG for OLTP application/mode. but for PostgreSQL, pgbench is developed by community, and used to benchmark PG performance. however, you could use any you like, it depends on your situation. ^_^ > > For me, the challenge is profiling the execution. Is there any tools in PostgreSQL to analysis where is the CPU cyclesconsumed? > Or I have to instrument and time by myself? in Solaris system, there is a strong probe system available, it named Dtrace, you might use it to profiling but for how to use it, I think you need to consult illumos(a fork from Solaris/OpenSolaris), IRC channel is #illumos on freechat.net, (but I think there is Dtrace expert in this mail list, you could write other thread to ask) > > > Regards. > > -- > Mengxing Liu > > > > > > > > > >
On Tue, Mar 14, 2017 at 6:00 AM, DEV_OPS <devops@ww-it.cn> wrote: > On 3/14/17 17:34, Mengxing Liu wrote: >>>>> The worst problems have been >>>>> seen with 32 or more cores on 4 or more sockets with a large number >>>>> of active connections. I don't know whether you have access to a >>>>> machine capable of putting this kind of stress on it (perhaps at >>>>> your university?), but if not, the community has access to various >>>>> resources we should be able to schedule time on. >>>> There is a NUMA machine ( 120 cores, 8 sockets) in my lab. >>> Fantastic! Can you say a bit more about the architecture and OS? >> >> Intel(R) Xeon(R) CPU at 2.3GHz, with 1TB physical DRAM and 1.5 TB >> SSD, running Ubuntu 14.04, Kernel 3.19. >> I guess NUMA is disabled in BIOS, but I will check that. I'm not sure what it would mean to "disable" NUMA -- if the CPU chips are each functioning as memory controller for a subset of the RAM you will have non-uniform memory access speeds from any core to different RAM locations. You can switch it to "interleaved" access, so the speed of access from a core to any logical memory address will be rather random, rather than letting the OS try to arrange things so that processes do most of their access to memory that is faster for them. It is generally better to tune NUMA in the OS than to randomize things at the BIOS level. >> However, there is only one NIC, so network could be the >> bottleneck if we use too many cores? Well, if we run the pgbench client on the database server box, the NIC won't matter at all. If we move the client side to another box, I still think that when we hit this problem, it will dwarf any impact of the NIC throughput. > The configuration is really cool, for the SSD, is it SATA interface? > NVMe interface, or is PCIe Flash? different SSD interface had different > performance characters. Yeah, knowing model of SSD, as well as which particular Xeon we're using, would be good. > for the NUMA, because the affinity issue will impact PostgreSQL > performance. so, Please disabled it if possible I disagree. (see above) >> There are several alternative benchmarks. Tony suggested that we >> should use TPC-E and TPC-DS. More benchmarks is better, all other things being equal. Keep in mind that good benchmarking practice with PostgreSQL generally requires a lot of setup time (so that we're starting from the exact same conditions for every run), a lot of run time (so that the effects of vacuuming, bloat, and page splitting all comes into play, like it would in the real world), and a lot of repetitions of each run (to account for variation). In particular, on a NUMA machine it is not at all unusual to see bifurcated >> Personally, I am more familiar with TPC-C. Unfortunately, the TPC-C benchmark does not create any cycles in the transaction dependencies, meaning that it is not a great tool for benchmarking serializable transactions. I know there are variations on TPC-C floating around that add a transaction type to do so, but on a quick search right now, I was unable to find one. >> And pgbench seems only have TPC-B built-in benchmark. You can feed it your own custom queries, instead. >> Well, I think we can easily find the implementations of these >> benchmarks for PostgreSQL. Reportedly, some of the implementations of TPC-C are not very good. There is DBT-2, but I've known a couple of people to look at that and find that it needed work before they could use it. Based on the PostgreSQL versions mentioned on the Wiki page, it has been neglected for a while: https://wiki.postgresql.org/wiki/DBT-2 >> The paper you recommended to me used a special benchmark defined >> by themselves. But it is quite simple and easy to implement. It also has the distinct advantage that we *know* they created a scenario where the code we want to tune was using most of the CPU on the machine. >> For me, the challenge is profiling the execution. Is there any >> tools in PostgreSQL to analysis where is the CPU cycles consumed? >> Or I have to instrument and time by myself? Generally oprofile or perf is used if you want to know where the time is going. This creates a slight dilemma -- if you configure your build with --enable-cassert, you get the best stack traces and you can more easily break down execution profiles. That especially true if you disable optimization and don't omit frame pointers. But all of those things distort the benchmarks -- adding a lot of time, and not necessarily in proportion to where the time goes with an optimized build. -- Kevin Grittner
On Tue, Mar 14, 2017 at 3:45 PM, Kevin Grittner <kgrittn@gmail.com> wrote: >> On 3/14/17 17:34, Mengxing Liu wrote: >>> There are several alternative benchmarks. Tony suggested that we >>> should use TPC-E and TPC-DS. > > More benchmarks is better, all other things being equal. Keep in > mind that good benchmarking practice with PostgreSQL generally > requires a lot of setup time (so that we're starting from the exact > same conditions for every run), a lot of run time (so that the > effects of vacuuming, bloat, and page splitting all comes into play, > like it would in the real world), and a lot of repetitions of each > run (to account for variation). In particular, on a NUMA machine it > is not at all unusual to see bifurcated Sorry I didn't finish that sentence. On a NUMA machine It is not at all unusual to see bifurcated results -- with each run coming in very close to one number or a second number, often at about a 50/50 rate, with no numbers falling anywhere else. This seems to be based on where the processes and memory allocations happen to land. Different people have dealt with this in different ways. If you can get five or more runs of a given test, perhaps the best is to throw away the high and low values and average the rest. Other approaches I've seen are to average all, take the median, or take the best result. The worst result of a set of runs is rarely interesting, as it may be due to some periodic maintenance kicking in (perhaps at the OS level and not related to database activity at all). -- Kevin Grittner
> -----Original Messages----- > From: "Kevin Grittner" <kgrittn@gmail.com> > Sent Time: 2017-03-15 23:20:07 (Wednesday) > To: DEV_OPS <devops@ww-it.cn> > Cc: "Mengxing Liu" <liu-mx15@mails.tsinghua.edu.cn>, "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org> > Subject: Re: [HACKERS] Re: [GSOC 17] Eliminate O(N^2) scaling from rw-conflict tracking in serializable transactions > > On Tue, Mar 14, 2017 at 3:45 PM, Kevin Grittner <kgrittn@gmail.com> wrote: > >> On 3/14/17 17:34, Mengxing Liu wrote: > >>> There are several alternative benchmarks. Tony suggested that we > >>> should use TPC-E and TPC-DS. > > > > More benchmarks is better, all other things being equal. Keep in > > mind that good benchmarking practice with PostgreSQL generally > > requires a lot of setup time (so that we're starting from the exact > > same conditions for every run), a lot of run time (so that the > > effects of vacuuming, bloat, and page splitting all comes into play, > > like it would in the real world), and a lot of repetitions of each > > run (to account for variation). In particular, on a NUMA machine it > > is not at all unusual to see bifurcated > > Sorry I didn't finish that sentence. > > On a NUMA machine It is not at all unusual to see bifurcated results > -- with each run coming in very close to one number or a second > number, often at about a 50/50 rate, with no numbers falling > anywhere else. This seems to be based on where the processes and > memory allocations happen to land. > Do you mean that for a NUMA machine, there usually exists two different results of its performance? Just two? Neither three nor four? Anyway, firstly, I think I should get familiar with PostgreSQL and tools you recommended to me at first. Then I will try to have a better comprehension about it, to make our discussion more efficient. Recently, I am busy preparing for the presentation at ASPLOS'17 and other lab works. But I promise I can finish all preparation works before May. Here is my working plan: At first, I will compile and install PostgreSQL by myself and try the profile tools (perf or oprofile). Then I will run one or two benchmarks using different config, where I may need your help to ensure that my tests are closeto the practical situation. PS: Disable NUMA in BIOS means that CPU can use its own memory controller when accessing local memory to reduce hops. On the contrary, "enable" means UMA. Therefore, I think Tony is right, we should disable this setting. I got the information from here. http://frankdenneman.nl/2010/12/28/node-interleaving-enable-or-disable/ Regards. -- Mengxing Liu
On Wed, Mar 15, 2017 at 11:35 AM, Mengxing Liu <liu-mx15@mails.tsinghua.edu.cn> wrote: >> On a NUMA machine It is not at all unusual to see bifurcated results >> -- with each run coming in very close to one number or a second >> number, often at about a 50/50 rate, with no numbers falling >> anywhere else. This seems to be based on where the processes and >> memory allocations happen to land. >> > > Do you mean that for a NUMA machine, there usually exists two > different results of its performance? > Just two? Neither three nor four? In my personal experience, I have often seen two timings that each run randomly matched; I have not seen nor heard of more, but that doesn't mean it can't happen. ;-) > At first, I will compile and install PostgreSQL by myself and try > the profile tools (perf or oprofile). perf is newer, and generally better if you can use it. Don't try to use either on HP hardware -- the BIOS uses some of the same hardware registers that other manufacturers leave for use of profilers; an HP machine is likely to freeze or reboot if you try to run either of those profilers under load. > Then I will run one or two benchmarks using different config, > where I may need your help to ensure that my tests are close to the > practical situation. Yeah, we should talk about OS and PostgreSQL configuration before you run any benchmarks. Neither tends to come configured as I would run a production system. > PS: Disable NUMA in BIOS means that CPU can use its own memory > controller when accessing local memory to reduce hops. NUMA means that each CPU chip directly controls some of the RAM (possibly with other, non-CPU controllers for some RAM). The question is whether the BIOS or the OS controls the memory allocation. The OS looks at what processes are on what cores and tries to use "nearby" memory for allocations. This can be pessimal if the amount of RAM that is under contention is less than the size of one memory segment, since all CPU chips need to ask the one managing that RAM for each access. In such a case, you actually get best performance using a cpuset which just uses one CPU package and the memory segments directly managed by that CPU package. Without the cpuset you may actually see better performance for this workload by letting the BIOS interleave allocations, which spreads the RAM allocations around to memory managed by all CPUs, and no one CPU becomes the bottleneck. The access is still not uniform, but you dodge a bottleneck -- albeit less efficiently than using a custom cpuset. > On the contrary, "enable" means UMA. No. Think about this: regardless of this BIOS setting each RAM package is directly connected to one CPU package, which functions as its memory controller. Most of the RAM used by PostgreSQL is for disk buffers -- shared buffers in shared memory and OS cache. Picture processes running on different CPU packages accessing a single particular shared buffer. Also picture processes running on different CPU packages using the same spinlock at the same time. No BIOS setting can avoid the communications and data transfer among the 8 CPU packages, and the locking of the cache lines. > Therefore, I think Tony is right, we should disable this setting. > > I got the information from here. > http://frankdenneman.nl/2010/12/28/node-interleaving-enable-or-disable/ Ah, that explains it. There is no such thing as "disabling NUMA" -- you can have the BIOS force interleaving, or you can have the BIOS leave the NUMA memory assignment to the OS. I was assuming that by "disabling NUMA" you meant to have BIOS control it through interleaving. You meant the opposite -- disabling the BIOS override of OS NUMA control. I agree that we should leave NUMA scheduling to the OS. There are, however, some non-standard OS configuration options that allow NUMA to behave better with PostgreSQL than the defaults allow. We will need to tune a little. The author of that article seems to be assuming that the usage will be with applications like word processing, spreadsheets, or browsers -- where the OS can place all the related processes on a single CPU package and all (or nearly all) memory allocations can be made from associated memory -- yielding a fairly uniform and fast access when the BIOS override is disabled. On a database product which wants to use all the cores and almost all of the memory, with heavy contention on shared memory, the situation is very different. Shared resource access time is going to be non-uniform no matter what you do. The difference is that the OS can still make *process local* allocations from nearby memory segments, while BIOS cannot. -- Kevin Grittner