Обсуждение: atoi-like function: is there a better way to do this?
One of our tables has a few columns that may be interpreted as strings or may be numbers (data type is varchar, numbers are stored as decimal). Generally, operations are performed on the string, but sometimes we need to parse out a number - without it failing on error. I wrote the following function to approximate to the semantics of atoi: create or replace function str2int(val varchar) returns bigint immutable as $$ begin val=substring(val from '[0-9]*'); if length(val) between 1 and 19 then return val::bigint; end if; return 0; end; $$ language plpgsql; It uses a regular expression to chop the string down to just the leading digits (leading only - '123.45' should parse as 123). Is there a more efficient way to achieve this? Thanks! Chris Angelico
(Hoping you meant for that reply to be on-list as I'm here responding on-list.) On Mon, Mar 5, 2012 at 2:16 PM, A.M. <agentm@themactionfaction.com> wrote: > > On Mar 4, 2012, at 9:13 PM, Chris Angelico wrote: > >> One of our tables has a few columns that may be interpreted as strings >> or may be numbers (data type is varchar, numbers are stored as >> decimal). Generally, operations are performed on the string, but >> sometimes we need to parse out a number - without it failing on error. >> I wrote the following function to approximate to the semantics of >> atoi: > > I would recommend against such a schema since different data types should warrant their own columns, but if you are leftwith no choice... The values have to be strings for other reasons (eg '' is valid everywhere, and this is subsequently processed by a script that expects all strings). So yeah, no choice there. But I agree that normally you DO want integers stored in integer columns, and we're paying a performance penalty for this. >> >> create or replace function str2int(val varchar) returns bigint immutable as $$ >> begin >> val=substring(val from '[0-9]*'); >> if length(val) between 1 and 19 then return val::bigint; end if; >> return 0; >> end; >> $$ language plpgsql; > > This can be written as: > select substring('35345345.45645654' from '\d{1,19}')::bigint; > > Be aware that this does not account for: > 3dogs > 3 dogs > 3,dogs > 3.5.6.7 > > all of which will return 3::bigint, but I assume that the column is not completely free-form or maybe this is what youwant. (?) > > Cheers, > M That is in fact the behaviour I want. Trouble is, that simpler version returns NULL if given 'asdf' as its input - I want it to return 0. It's also about the same performance (slightly slower in a quick test) than the original; it's still doing a regular expression parse. I was hoping very much to avoid the regex altogether. ChrisA
On Mar 4, 2012, at 22:31, Chris Angelico <rosuav@gmail.com> wrote: > (Hoping you meant for that reply to be on-list as I'm here responding on-list.) > > On Mon, Mar 5, 2012 at 2:16 PM, A.M. <agentm@themactionfaction.com> wrote: >> >> On Mar 4, 2012, at 9:13 PM, Chris Angelico wrote: >> >>> One of our tables has a few columns that may be interpreted as strings >>> or may be numbers (data type is varchar, numbers are stored as >>> decimal). Generally, operations are performed on the string, but >>> sometimes we need to parse out a number - without it failing on error. >>> I wrote the following function to approximate to the semantics of >>> atoi: >> >> I would recommend against such a schema since different data types should warrant their own columns, but if you are leftwith no choice... > > The values have to be strings for other reasons (eg '' is valid > everywhere, and this is subsequently processed by a script that > expects all strings). So yeah, no choice there. But I agree that > normally you DO want integers stored in integer columns, and we're > paying a performance penalty for this. > >>> >>> create or replace function str2int(val varchar) returns bigint immutable as $$ >>> begin >>> val=substring(val from '[0-9]*'); >>> if length(val) between 1 and 19 then return val::bigint; end if; >>> return 0; >>> end; >>> $$ language plpgsql; >> >> This can be written as: >> select substring('35345345.45645654' from '\d{1,19}')::bigint; >> >> Be aware that this does not account for: >> 3dogs >> 3 dogs >> 3,dogs >> 3.5.6.7 >> >> all of which will return 3::bigint, but I assume that the column is not completely free-form or maybe this is what youwant. (?) >> >> Cheers, >> M > > That is in fact the behaviour I want. Trouble is, that simpler version > returns NULL if given 'asdf' as its input - I want it to return 0. COALESCE is your friend > It's also about the same performance (slightly slower in a quick test) > than the original; it's still doing a regular expression parse. I was > hoping very much to avoid the regex altogether. > > ChrisA > Any efficient, non-RegEx, alternative would require more context to evaluate than you provide. Mainly, would it be fasterto have a separate field to store the parsed (at input) number and then query that field directly (even if it is atext field as well)? Basically cache the parse. David J..
On Mon, Mar 5, 2012 at 2:50 PM, David Johnston <polobo@yahoo.com> wrote: > Any efficient, non-RegEx, alternative would require more context to evaluate than you provide. Mainly, would it be fasterto have a separate field to store the parsed (at input) number and then query that field directly (even if it is atext field as well)? Basically cache the parse. Caching's looking tempting, but I don't know if it'll be worth it (these fields won't be searched-as-int very often compared to search-as-string, and there's potentially a lot of such fields). All I need out of it is the leading digits - I can strip them with trim(), but I can't keep _only_ those digits. The other possibility that may be of value is to write the function in C instead of pl/pgsql, which will then actually call atoi() itself. Is that going to be a better option? ChrisA
On 05/03/12 04:06, Chris Angelico wrote: > On Mon, Mar 5, 2012 at 2:50 PM, David Johnston<polobo@yahoo.com> wrote: >> Any efficient, non-RegEx, alternative would require more context to evaluate than you provide. Mainly, would it be fasterto have a separate field to store the parsed (at input) number and then query that field directly (even if it is atext field as well)? Basically cache the parse. > Caching's looking tempting, but I don't know if it'll be worth it > (these fields won't be searched-as-int very often compared to > search-as-string, and there's potentially a lot of such fields). All I > need out of it is the leading digits - I can strip them with trim(), > but I can't keep _only_ those digits. > > The other possibility that may be of value is to write the function in > C instead of pl/pgsql, which will then actually call atoi() itself. Is > that going to be a better option? Can you use to_number() here? It sounds like something along the lines of cast(to_number('0' || field::varchar, '999999999.') as int) might give the behaviour you're after, and a quick test seems to indicate that it's about 4x faster than the original function: postgres=# explain analyze select cast(to_number('0' || generate_series::varchar, '999999999.') as int) from generate_series(1,1000000); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series (cost=0.00..22.50 rows=1000 width=4) (actual time=137.720..1065.752 rows=1000000 loops=1) Total runtime: 1144.993 ms (2 rows) postgres=# explain analyze select str2int(generate_series::varchar) from generate_series(1,1000000); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Function Scan on generate_series (cost=0.00..265.00 rows=1000 width=4) (actual time=135.180..4022.408 rows=1000000 loops=1) Total runtime: 4121.233 ms (2 rows) Hopefully there's a cleaner way of writing that without a long list of 9s in the format string, and if the field is nullable I'd guess you probably need a coalesce(..., 0) around that as well. cheers, Tom
On Mon, Mar 5, 2012 at 3:15 PM, Tom Molesworth <tom@audioboundary.com> wrote: > Can you use to_number() here? It sounds like something along the lines of > cast(to_number('0' || field::varchar, '999999999.') as int) might give the > behaviour you're after, and a quick test seems to indicate that it's about > 4x faster than the original function: I looked at to_number but it seems to find digits anywhere inside the field - to_number('12.34','99999') returns 1234, but I want it to stop at the decimal. Nice trick with prepending a zero though - I didn't think of that. That may save some hassle! ChrisA
On Mar 5, 2012, at 0:08, Chris Angelico <rosuav@gmail.com> wrote: > On Mon, Mar 5, 2012 at 3:15 PM, Tom Molesworth <tom@audioboundary.com> wrote: >> Can you use to_number() here? It sounds like something along the lines of >> cast(to_number('0' || field::varchar, '999999999.') as int) might give the >> behaviour you're after, and a quick test seems to indicate that it's about >> 4x faster than the original function: > > I looked at to_number but it seems to find digits anywhere inside the > field - to_number('12.34','99999') returns 1234, but I want it to stop > at the decimal. > > Nice trick with prepending a zero though - I didn't think of that. > That may save some hassle! > > ChrisA > > 0alpha999 -> 0 alpha999 -> 999
On 05/03/12 05:08, Chris Angelico wrote: > On Mon, Mar 5, 2012 at 3:15 PM, Tom Molesworth<tom@audioboundary.com> wrote: >> Can you use to_number() here? It sounds like something along the lines of >> cast(to_number('0' || field::varchar, '999999999.') as int) might give the >> behaviour you're after, and a quick test seems to indicate that it's about >> 4x faster than the original function: > I looked at to_number but it seems to find digits anywhere inside the > field - to_number('12.34','99999') returns 1234, but I want it to stop > at the decimal. Right - that's why I included the . in the format string. Haven't done exhaustive testing but it seemed to do what you were after. cheers, Tom