Re: extracting location info from string

Поиск
Список
Период
Сортировка
От Tarlika Elisabeth Schmitz
Тема Re: extracting location info from string
Дата
Msg-id 20110523233949.07fb0005@dick.coachhouse
обсуждение исходный текст
Ответ на Re: extracting location info from string  (Andrej <andrej.groups@gmail.com>)
Ответы Re: extracting location info from string  (Craig Ringer <craig@postnewspapers.com.au>)
Список pgsql-sql
On Mon, 23 May 2011 13:11:24 +1200
Andrej <andrej.groups@gmail.com> wrote:

>On 23 May 2011 10:00, Tarlika Elisabeth Schmitz
><postgresql3@numerixtechnology.de> wrote:
>> On Sun, 22 May 2011 21:05:26 +0100
>> Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de> wrote:
>>
>>>A column contains location information, which may contain any of the
>>>following:
>>>
>>>1) null
>>>2) country name (e.g. "France")
>>>3) city name, region name (e.g. "Bonn, Nordrhein-Westfalen")
>>>4) city name, Rg. region name (e.g. "Frankfurt, Rg. Hessen")
>>>5) city name, Rg region name (e.g. "Frankfurt, Rg Hessen")
>>
>>
>> I also need to cope with variations of COUNTRY.NAME and REGION.NAME.
>
>sanitising that data will be tedious,  - particularly with the
>variations on region.

Indeed. However, the situation is not quite as bleak as it appears:
- I am only dealing with 50 countries (Luxemburg and Vatican are not
amongst them)
- Only for two countries will city/region be displayed instead of
country.
- Ultimately, where the only important bit of imformation is the
country.
- The only really important persons are those from the two countries.


>Another thing of great import is whether the city can occur in the
>data column all by itself; if yes, it's next to impossible to
>distinguish it from a country.

Unfortunately this is the case.

>Specially if we assume that
>typos/misspelling are feasible on top of punctuation ... 

and former countries like Czechoslovakia ...

>If I had a task like that to perform I'd dump the data out to file
>and have a good go at it w/ sed & awk, or perl, depending on
>how complex & voluminous the data is.

I had a quick look at the data (maybe not the most efficient SQL)

SELECT id, name,    CASE    WHEN location is null then null   WHEN location !~ '.*,.*' then (select id from country
wherename =
 
location)    ELSE (select country from county where name =
regexp_replace(location, '.*, (Rg\\.? )?(.*)', '\\2')) END AS country,   location
FROM temp_person


Of 17000 historical records, 4400 don't match this simple pattern.
Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900
"North America" whatever that is! There are plenty of common +
valid region abbreviations.

I get about 1000 new records of this type per year.

I presume that more recent data are more accurate. I know I won't be
able to clean them all up. 

However, the import process has to be as automatic as possible in such
a way that inconsistencies are flagged up for later manual
intervention. I say later because, for instance, a person's data will
have to be imported with or without location info because other new
data will link to it.

-- 

Best Regards,
Tarlika Elisabeth Schmitz


В списке pgsql-sql по дате отправления:

Предыдущее
От: Dean le Roux
Дата:
Сообщение: Re: problem with update data sets from front ends 8.4, Ubuntu 10.04
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: Which version of PostgreSQL should I use.