postgis/extensions/address_standardizer
Sandro Santilli 3d41b08dc0 Fix distlean
- Run clean on loader's distclean
- Do not generate extensions/upgrade-paths-rules.mk

Move the hint about install-extension-upgrades-from-known-versions
in top-level GNUMakefile

Closes #5346
2023-02-22 01:11:19 +01:00
..
expected Update address_standardizer_data_us tests. References #5299 for PostGIS 3.4 2022-12-08 20:22:36 -05:00
sql_bits #3191: address_standardizer_data_us should respect is_custom flag during restore (also revised make to discard artifacts when make clean is called), got rid of sql folder (gets recreated during make), and add sql_bits to hold config for restore 2015-08-06 05:20:14 +00:00
address_parser.c Support PCRE2 where available and PCRE otherwise, closes #5019 2021-11-25 08:52:53 -08:00
address_standardizer_data_us.control.in Generate control file from Makefile rather than ./configure 2018-01-12 12:39:37 +00:00
address_standardizer_functions.sql.in #3183: Logic to upgrade the address_standardizer extensions 2015-08-20 19:18:21 +00:00
address_standardizer_types.sql.in Formatting: remove trailing whitespace from .sql files. 2018-01-17 20:39:03 +00:00
address_standardizer.c Include for text_to_cstring 2019-02-22 17:26:46 +00:00
address_standardizer.control.in Generate control file from Makefile rather than ./configure 2018-01-12 12:39:37 +00:00
analyze.c Address standardizer warning 2018-08-01 09:49:55 +00:00
COPYING Prep to move address_standardizer into extensions folder 2014-08-06 19:35:52 +00:00
err_param.c Remove trailing blanks from .c files 2017-09-17 20:59:20 +00:00
export.c Style only change: remove trailing whitespaces 2016-04-27 14:45:28 +00:00
gamma.c #3847, coverity fixes in topology and address standardizer 2017-09-14 13:56:57 +00:00
gamma.h Prep to move address_standardizer into extensions folder 2014-08-06 19:35:52 +00:00
hash.c Prep to move address_standardizer into extensions folder 2014-08-06 19:35:52 +00:00
hash.h Prep to move address_standardizer into extensions folder 2014-08-06 19:35:52 +00:00
khash.h Prep to move address_standardizer into extensions folder 2014-08-06 19:35:52 +00:00
lexicon.c Style only change: remove trailing whitespaces 2016-04-27 14:45:28 +00:00
Makefile.in Fix distlean 2023-02-22 01:11:19 +01:00
micro-macro.sql Prep to move address_standardizer into extensions folder 2014-08-06 19:35:52 +00:00
mk-city-regex.pl Standardize shebangs 2020-07-01 11:00:58 +02:00
mk-sql.pl Standardize shebangs 2020-07-01 11:00:58 +02:00
mk-st-regexp.pl Standardize shebangs 2020-07-01 11:00:58 +02:00
pagc_api.h Fix Travis GCC warnings 2018-09-13 13:48:42 +00:00
pagc_common.h Style only change: remove trailing whitespaces 2016-04-27 14:45:28 +00:00
pagc_std_api.h Get rid of compiler warnings 2015-09-18 06:28:49 +00:00
pagc_tools.c Style only change: remove trailing whitespaces 2016-04-27 14:45:28 +00:00
pagc_tools.h Style only change: remove trailing whitespaces 2016-04-27 14:45:28 +00:00
parseaddress-api.c parseaddress-api.c: remove unused variable 'comma' 2022-06-15 14:19:53 +02:00
parseaddress-api.h Support PCRE2 where available and PCRE otherwise, closes #5019 2021-11-25 08:52:53 -08:00
parseaddress-regex.h Prep to move address_standardizer into extensions folder 2014-08-06 19:35:52 +00:00
parseaddress-stcities.h Style only change: remove trailing whitespaces 2016-04-27 14:45:28 +00:00
README.address_standardizer Prep to move address_standardizer into extensions folder 2014-08-06 19:35:52 +00:00
standard.c Style only change: remove trailing whitespaces 2016-04-27 14:45:28 +00:00
std_pg_hash.c One more pg16 compile issue 2022-12-14 14:59:53 -08:00
std_pg_hash.h Prep to move address_standardizer into extensions folder 2014-08-06 19:35:52 +00:00
test_main.c Adding some commandline options to test_main.c for debugging. 2016-08-31 02:23:02 +00:00
test-init-extensions.sql.in Formatting: remove trailing whitespace from .sql files. 2018-01-17 20:39:03 +00:00
test-parseaddress.sql.in revert commit r16327 of this file. Was causing parse-address test to fail. 2018-02-20 22:11:36 +00:00
test-standardize_address_1.sql.in Update address_standardizer_data_us tests. References #5299 for PostGIS 3.4 2022-12-08 20:22:36 -05:00
test-standardize_address_2.sql.in #3184: regression tests for address_standardizer - implement make installcheck 2015-08-22 06:49:09 +00:00
test2.sql Prep to move address_standardizer into extensions folder 2014-08-06 19:35:52 +00:00
tokenize.c Address compiler warning 2018-09-24 11:13:47 +00:00
us_gaz.sql Style only change: remove trailing whitespaces 2016-04-27 14:45:28 +00:00
us_lex.sql 1. Change to VALUES so easier to skip over entries already present 2022-12-08 18:18:22 -05:00
us_rules.sql Formatting: remove trailing whitespace from .sql files. 2018-01-17 20:39:03 +00:00
usps-st-city-adds.txt Prep to move address_standardizer into extensions folder 2014-08-06 19:35:52 +00:00
usps-st-city-orig.txt Prep to move address_standardizer into extensions folder 2014-08-06 19:35:52 +00:00

This is a fork of the PAGC standardizer and a single line address parser.
The code is built into a single postgresql extension library.

Portions of this code belong to their respective contributors.
This code is released under an MIT-X license.

Copyright (c) 2006-2014 Stephen Woodbridge.
Copyright (c) 2008 Walter Bruce Sinclair

woodbri@swoodbridge.com
woodbr@imaptools.com

Also read files COPYING

-------------------------------------------------------------------------------

Makefile                - PGXS makefile
mk-city-regex.pl        - Perl script to create parseaddress-regex.h
mk-st-regexp.pl         - Perl script to create parseaddress-stcities.h
README.address_standardizer - this file
COPYING                 - License file
usps-st-city-adds.txt   - add local additions of ST<tab>CITY NAME
usps-st-city-orig.txt   - Steve's extract of USPS city names

mk-city-regex           - created by make
mk-st-regexp            - created by make
usps-st-city-name.txt   - created by make
                          from usps-st-city-orig.txt and usps-st-city-adds.txt
parseaddress-regex.h    - created by make and mk-st-regexp
parseaddress-stcities.h - created by make and mk-city-regex
                          from usps-st-city-name.txt

-------------------------------------------------------------------------------
PREREQUISITES:

o Postgresql headers and PGXS tools

o Perl 5 and Perl module Regexp::List which can be install with:
    sudo perl -MCPAN -e "install Regexp::Assemble"

o libpcre and headers
    sudo apt-get install libpcre3-dev libpcre3

-------------------------------------------------------------------------------
Build and Install:

make
sudo make install

For postgresql 9.1+ this will install all the files need for CREATE EXTENSION

createdb testdb
psql -c "create extension address_standardizer"


-------------------------------------------------------------------------------

How the parser works

The parser works from right to left looking first at the macro elements 
for postcode, state/province, city, and then looks micro elements to determine
if we are dealing with a house number street or intersection or landmark.
It currently does not look for a country code or name, but that could be
introduced in the future.

Country code
------------

Assumed to be US or CA based on:

    postcode as US or Canada
    state/province as US or Canada
    else US

Postcode/zipcode
----------------

These are recognized using Perl compatible regular expressions.
These regexs are currently in the parseaddress-api.c and are relatively
simple to make changes to if needed.

State/province
--------------

These are recognized using Perl compatible regular expressions.
These regexs are currently in the parseaddress-api.c but could get moved
into includes in the future for easier maintenance.

City name
---------

This part is rather complicated and there are lots of issues around ambiguities
as to where to split a series of tokens when a token might belong to either
the city or the street name. The current strategy follows something like this:

    1. if we have a state, then get the city regex for that state
    2. if we can match that to the end of our remaining address string then
       extract the city name and continue.
    3. if we do not have a state or fail to match it then
       cycle through a series of regex patterns that try to separate the city
       from the street, stop and extract the city if we match

Number street name

    1. check for a leading house number, and extract that
    2. if there is an '@' then split the string on the '@' into street and
       street2 else put the rest into street

------------------------------------------------------------------------------

Managing the regexes

The regexes are used to recognize US states and Canadian provinces
and USPS city names.

City regexes
------------

usps-st-city-orig.txt  - this file contains all the acceptable USPS city
                         names by state. I periodically extract these from the
                         USPS and generate this file. I do NOT recommend
                         editing this file. 
usps-st-city-adds.txt  - this file you can add new definitions to if you need
                         them. The format of both these files is:
                         <StateAbbrev><tab><CityName>

These files are assembled into usps-st-city-name.txt which is compiled by a
perl script mk-city-regex.pl into parseaddress-stcities.h which is used to
lookup the city regex for a specific state or province.

As I mentioned above is these fail to detect the city, then a secondary
strategy is is deployed by cycling through a list of regex patterns. These
patterns and regexes are generated by mk-st-regexp.pl which creates the
parseaddress-regex.h include. This is a perl script so you can view and edit
it if that is needed.

I think that there might be some room for improved in the area if coodinating
this process with PAGC's lexicon.csv and gazeteer.csv in the future.

----------------------------------------------------------------------------

Author: API: Stephen Woodbridge <woodbri@imaptools.com>
        PAGC: Walter Sinclar

This is a first attempt at extracting the PAGC standardizer code into a
separate library. This was done so I could attempt to wrap it into a 
postgresql stored procedure. (see the directory psql for that).

This code is a part of PAGC and is release under an MIT-X license.

Assumptions:

Linux
Sudo is installed and user has access to it.

PAGC does compile under Windows so you might get this to compile into a dll.

Build and installation:

Read the Makefile and change as appropriate.

make clean
make
make install
make test_main  # build the CLI interactive test program


Author: Stephen Woodbridge <woodbri@imaptools.com>

These are postgresql wrappers for PAGC standardizer and address parser.
These are released un an MIT-X style license.

Assumptions:

Linux
sudo is installed and user has sudo access (see Makefile)
postgresql 8.3 (make changes in the Makefile to change)

Build and Installation:

make
make install

# create a new database using a postgis template
createdb -T template_postgis -E LATIN1 mydb

# add the stored procedures
psql mydb -f /path/to/standardize_address.sql

Install PAGC lexicon.csv, gazeteer.csv or rules.txt using a perl script.

./pagc-data-psql lex lexicon.csv  | psql mydb
./pagc-data-psql gaz gazeteer.csv | psql mydb
./pagc-data-psql rules rules.txt  | psql mydb

Now you should be able to test some queries like:

psql mydb

select * from parse_address('2099 university ave w, saint paul, mn, 55104-3431');
select * from parse_address('university ave w @ main st, saint paul, mn, 55104-3431');

select * from parse_address('385 Landgrove Rd  Landgrove VT 05148');
-- "385";"Landgrove Rd";"";"385 Landgrove Rd";"Landgrove";"VT";"05148";"";"US"

select * from standardize_address(
        'select seq, word::text, stdword::text, token from gaz union all select seq, word::text, stdword::text, token from lex ',
        'select seq, word::text, stdword::text, token from gaz order by id',
        'select * from rules order by id',
        'select 0::int4 as id, ''1071 B Ave''::text as micro, ''Loxley, AL 36551''::text as macro');

select * from standardize_address(
        'select seq, word::text, stdword::text, token from lex order by id',
        'select seq, word::text, stdword::text, token from gaz order by id',
        'select * from rules order by id',
        'select 0::int4 as id, ''116 commonwealth ave apt a''::text as micro, ''west concord, ma 01742''::text as macro');

\q