Обсуждение: Re: [HACKERS] sort on huge table
--- Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > > RedHat Linux 6.0 (kernel 2.2.5-smp) > > Pentium III 500MHz x 2 > > RAM: 512MB > > Disk: Ultra Wide SCSI 9GB x 4 + Hardware RAID (RAID 5). > > OK, no problem with inadequate hardware anyway ;-). Bruce's concern > about simplistic read-ahead algorithm in Linux may apply though. > > > Also, I could provide testing scripts to reproduce my tests. > > Please. That would be very handy so that we can make sure we are all > comparing the same thing. I assume the scripts can be tweaked to vary > the amount of disk space used? I can't scare up more than a couple > hundred meg at the moment. (The natural state of a disk drive is > "full" ...) > > > I think it depends on the disk space available. Ideally it should be > > able to choice the sort algorithm. > > I was hoping to avoid that, because of the extra difficulty of testing > and maintenance. But it may be the only answer. > > regards, tom lane I know this is a VERY long shot, but... what were the READ/WRITE ratios between the old version and the new version? Perhaps the computation of the checksum (sic) blocks under RAID5 caused the unexpected behavior. With RAID 5 increasing read performance but decreasing writes, one might expect a new algorithm which say, halves reads, but increases writes slightly to not realize the same benefits as under a normal disk system or a RAID 1 (or, better yet, a RAID 0+1) array. Like I said...a VERY long shot theory. Mike Mascari (mascarim@yahoo.com) ===== __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com
At 06:24 PM 11/1/99 -0800, Mike Mascari wrote: >I know this is a VERY long shot, but... what were the READ/WRITE ratios >between the old version and the new version? Perhaps the computation >of the checksum (sic) blocks under RAID5 caused the unexpected behavior. >With RAID 5 increasing read performance but decreasing writes, one might >expect a new algorithm which say, halves reads, but increases writes >slightly to not realize the same benefits as under a normal disk system or >a RAID 1 (or, better yet, a RAID 0+1) array. RAID 5, not the operating system, might be getting in the way...it would be interesting to test this on a Linux 2.2 kernel without the RAID 5 complication. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
> At 06:24 PM 11/1/99 -0800, Mike Mascari wrote: >> I know this is a VERY long shot, but... what were the READ/WRITE ratios >> between the old version and the new version? Perhaps the computation >> of the checksum (sic) blocks under RAID5 caused the unexpected behavior. Good try but no cigar --- we're dealing with a merge algorithm here, and it's inherently the same amount of data in and out. You write a block once, you read the same block once later on. But... Don Baccus <dhogaza@pacifier.com> writes: > RAID 5, not the operating system, might be getting in the way...it > would be interesting to test this on a Linux 2.2 kernel without > the RAID 5 complication. ... I agree this'd be worth trying. There could be some subtle effect somewhere in RAID5 that's tripping things up. It'd also be useful if someone could try it on similar RAID hardware with a non-Linux kernel. regards, tom lane
>> At 06:24 PM 11/1/99 -0800, Mike Mascari wrote: >>> I know this is a VERY long shot, but... what were the READ/WRITE ratios >>> between the old version and the new version? Perhaps the computation >>> of the checksum (sic) blocks under RAID5 caused the unexpected behavior. > >Good try but no cigar --- we're dealing with a merge algorithm here, >and it's inherently the same amount of data in and out. You write >a block once, you read the same block once later on. But... > >Don Baccus <dhogaza@pacifier.com> writes: >> RAID 5, not the operating system, might be getting in the way...it >> would be interesting to test this on a Linux 2.2 kernel without >> the RAID 5 complication. > >... I agree this'd be worth trying. There could be some subtle effect >somewhere in RAID5 that's tripping things up. It'd also be useful if >someone could try it on similar RAID hardware with a non-Linux kernel. I have compared current with 6.5 using 1000000 tuple-table (243MB) (I wanted to try 2GB+ table but 6.5 does not work in this case). The result was strange in that current is *faster* than 6.5! RAID5current 2:296.5.2 3:15 non-RAIDcurrent 1:506.5.2 2:13 Seems my previous testing was done in wrong way or the behavior of sorting might be different if the table size is changed? Anyway, here is my test script. First, edit Makefile to set DB and number of tuples. Then run type make. That's all. -- Tatsuo Ishii -------------------------------------------------------------------- begin 644 sort.tar.gz M'XL(`(-F'C@``^V737/:,!"&N5:_8@MD@@D8VQ@\`R$S!=)V.J3I).TIR<'8 M,H@:F]@BDTQ+?WNU_H#2-L.ED![T<$`?[Z[6R"LM<1CQQH7]E7K,IX7]H&M: MVS2A`-!LZDW\UJV6AM\I+5,#L#3-,LVV9NIBVA"?`FA[BF>+9<SM"*#`ZRR> M,O:L;CZ)#A'.H2GM@I3`I9Z]]#G$E',63&+PP@@XC;&CDF$?>DF/?/S\Y=/H M_%ITQ8XC9+=W,A@(_<1QR.#MZ,T[-*Y?&N3Z_?EH)-J-,0L:\91<7>2=:`YU MCUQ?#5`ZH8'JD,O^A[P3$F+[?@>B99"$1+)&!V?)JW(E<:R`VG`B:G.JQE,H M5[+`%=$<]A5"A+8CVNA70:/!`*?2`+&%JRM0#Q.GQ/&I'710=W6AX%!N"M4? MY*7W=Q<QYG_R,^YOC9WYWVZE^6\8>E/'_->;EBGS_Q"46.#X2Y?":<Q=%JK3 M,T+F-@LJ+.!@1Q.G!LY4_$#5JN@\*.0;`<`I5IMULV:0)WR7X(@'%;2#,]`5 M0#DD"IN'#"<>;O0[!4U7J/86D7#A5<3B-(IJ1=P,/%;@R`4G7#P5:T$B1JDX M="JLIW79:=!E)R>Y\\Q#\<B]Y<4:2^29>B;4LU-#1#;;Z#<6C\5,O-IR=!OD MX_@L["AY-NCU0-NX^#UN]5=7J[\\F!L&5$T]K_ZG0R')__PLO/?WLL:N_(>6 ME>6_WFZU+;S_35.3^7\(W"A<`+?'/@6N=TGZ)JP'*@PSO`9<7/"/7+R[+QVN MY!^SE?_3_:RQ*_]-2\_J_Z:E-S7,?Z-MR/P_!*77ZQI;W'8W4"Y!/:!@P%V7 M3T5UB_LS[/>28AK;6:G<2RY%0OV8;@V7]=RB;!"/$;(0EXHHV&%SQ4!YV"=J M(ZF3,[/OJ<J!8[SSQ;D#7A3.`2N2X#C14V<:0K&'P(/M+)=S2#I%DO;<,=3K M^:GUA\6ZJKA?TN@IM^1L3F&]<DQ]ZG"HIDL+)V$D;FX8/P$3?X!B!WPV9QP+ FG32BE]XXB40BD4@D$HE$(I%()!*)1"*12"229_@)GWZ*5``H```` ` end
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > I have compared current with 6.5 using 1000000 tuple-table (243MB) (I > wanted to try 2GB+ table but 6.5 does not work in this case). The > result was strange in that current is *faster* than 6.5! > RAID5 > current 2:29 > 6.5.2 3:15 > non-RAID > current 1:50 > 6.5.2 2:13 > Seems my previous testing was done in wrong way or the behavior of > sorting might be different if the table size is changed? Well, I feel better now, anyway ;-). I thought that my first cut ought to have been about the same speed as 6.5, and after I added the code to slurp up multiple tuples in sequence, it should've been faster than 6.5. The above numbers seem to be in line with that theory. Next question: is there some additional effect that comes into play once the table size gets really huge? I am thinking maybe there's some glitch affecting performance once the temp file size goes past one segment (1Gb). Tatsuo, can you try sorts of say 0.9 and 1.1 Gb to see if something bad happens at 1Gb? I could try rebuilding here with a small RELSEG_SIZE, but right at the moment I'm not certain I'd see the same behavior you do... regards, tom lane
Tatsuo Ishii wrote: > > >... I agree this'd be worth trying. There could be some subtle effect > >somewhere in RAID5 that's tripping things up. It'd also be useful if > >someone could try it on similar RAID hardware with a non-Linux kernel. > > I have compared current with 6.5 using 1000000 tuple-table (243MB) (I > wanted to try 2GB+ table but 6.5 does not work in this case). The > result was strange in that current is *faster* than 6.5! > > RAID5 > current 2:29 > 6.5.2 3:15 > > non-RAID > current 1:50 > 6.5.2 2:13 > > Seems my previous testing was done in wrong way or the behavior of > sorting might be different if the table size is changed? Or the behaviour of RAID5 changes at some size. I have set up an IBM Netfinity server with specs similar to yours, except that it has 1G memory and 5x9GB disks. The RAID controller is IBM ServeRAID. It seems that when I try to write over 60 MB sequentially, the write performance drops from over 50MB/s to under 2MB/s. Maybe such behaviour would suggest that building an index and traversing that could be faster than full sort ? The same tests on my single Celeron 450 produced ~10MB/s writes whatever the size. > Anyway, here is my test script. > First, edit Makefile to set DB and number of tuples. Then run type > make. That's all. I'll try to run it tonight (in GMT+2 tz). Can't run it earlyer as it is a production site and a highly visible web-server. If I have the time I'll even try my index theory. -------- Hannu
> >Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> I have compared current with 6.5 using 1000000 tuple-table (243MB) (I >> wanted to try 2GB+ table but 6.5 does not work in this case). The >> result was strange in that current is *faster* than 6.5! > >> RAID5 >> current 2:29 >> 6.5.2 3:15 > >> non-RAID >> current 1:50 >> 6.5.2 2:13 > >> Seems my previous testing was done in wrong way or the behavior of >> sorting might be different if the table size is changed? > >Well, I feel better now, anyway ;-). I thought that my first cut >ought to have been about the same speed as 6.5, and after I added >the code to slurp up multiple tuples in sequence, it should've been >faster than 6.5. The above numbers seem to be in line with that >theory. Next question: is there some additional effect that comes >into play once the table size gets really huge? I am thinking maybe >there's some glitch affecting performance once the temp file size >goes past one segment (1Gb). Tatsuo, can you try sorts of say >0.9 and 1.1 Gb to see if something bad happens at 1Gb? I could >try rebuilding here with a small RELSEG_SIZE, but right at the >moment I'm not certain I'd see the same behavior you do... Ok. I have run some testings with various amount of data. RedHat Linux 6.0 Kernel 2.2.5-smp 512MB RAM Sort mem: 80MB RAID5 100 million tuples 1:31 200 4:24 300 7:27 400 11:11 <-- 970MB 500 14:01 <-- 1.1GB (segmented files) 600 18:31 700 22:24 800 24:36 900 28:12 1000 32:14 I didn't see any bad thing at 1.1GB (500 million). -- Tatsuo Ishii