I chose to look at the OpenStreetMap (OSM) data from Düsseldorf, in the very west of Germany. I was raised in the city and moved back to the city nine years ago. Fun fact: Düsseldorf is considered as a city with one of the highest living standards in in the world. I downloaded the already prepared file from mapzen.
Postal codes are fine
Inconsistent postal codes didn’t seem to be a problem as it was in the sample project. The German postal code system is very straight forward: always exactly five digits. Even the simple regex search in the Atom editor – \s[0-9]{5}\ and [0-9]{5}\s – to check if any postal code has been injected with an unintentional white space didn’t return any results. This must be due to German accuracy and that German postal code system is very straight forward: it’s always five digits. Letters are not allowed.
The Spelling of the Word “Straße”
Straße means Street in German. Obviously, one of the most important words in addresses. As Straße is written with a very German particular letter, the grapheme ß, I wondered if some preferred to write the grapheme’s substitute: the ss. However, this was not the case. SS was solely used for email addresses and URLs.
Problems were encountered with phone and house numbers. This function was used to print them to gain an overview:
def audit(osmfile):
osm_file = open(osmfile, "r")
phone_numbers = defaultdict(set)
for event, elem in ET.iterparse(osm_file, events=("start",)):
if elem.tag == "node" or elem.tag == "way":
for tag in elem.iter("tag"):
if tag.attrib['k'] == "phone":
#print phone numbers as entered, then the cleaned version
print tag.attrib['v']
print clean_phone(tag.attrib['v'])
# uncomment if you want to see the housenumers
# if tag.attrib['k'] == "addr:housenumber":
# print tag.attrib['v']
osm_file.close()
return phone_numbers
As there’s no single standard here, phone numbers were entered in multiple ways:
This function was written to mitigate the issue:
def clean_phone(phone_numbers):
clean_number = re.sub(r'^00','+', phone_numbers)
clean_number = re.sub(r'-',' ', clean_number)
clean_number = re.sub(r'\/\s','', clean_number)
clean_number = re.sub(r'\(0\)','', clean_number)
clean_number = re.sub(r'^0211','+49 211', clean_number)
return clean_number
House numbers stretching more than a single number (e.g., 22-24) have been entered differently, such as:
However, most numbers are in the first format and should be the stated as the correct form.
All queries can be found in the query.py file. This is the output of the file:
duesseldorf.osm.........................: 662M
dussosm.db..............................: 390M
nodes.csv...............................: 212M
nodes_tags.csv..........................: 23M
sample.osm..............................: 6M
ways.csv................................: 29M
ways_nodes.csv..........................: 91M
ways_tags.csv...........................: 64M
SELECT COUNT(*) FROM nodes;
2760081
SELECT COUNT(*) FROM ways;
number of ways 523867
SELECT COUNT(DISTINCT(uid)) FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways;
number of contributors (2292,)
SELECT user, COUNT(*) as num
FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways)
GROUP BY user
ORDER BY num DESC LIMIT 10;
(u'black_bike', 585316)
(u'Antikalk', 442759)
(u'Zyras', 230609)
(u'mighty_eighty', 180603)
(u'WoGraSo', 173613)
(u'EinKonstanzer', 173138)
(u'rurseekatze', 158304)
(u'rabenkind', 135121)
(u'Athemis', 133331)
(u'Kettwicht', 99781)
SELECT sub.value, COUNT(*) as num
FROM (SELECT key, value FROM nodes_tags UNION ALL SELECT key, value FROM ways_tags) sub
WHERE sub.key = "amenity"
GROUP BY sub.value
ORDER BY num DESC LIMIT 10;
(u'parking', 5573)
(u'bench', 2934)
(u'recycling', 1358)
(u'restaurant', 1187)
(u'waste_basket', 1107)
(u'post_box', 803)
(u'bicycle_parking', 718)
(u'shelter', 686)
(u'vending_machine', 685)
(u'fast_food', 576)
SELECT sub.value, COUNT(*) as num
FROM (SELECT key, value FROM nodes_tags UNION ALL SELECT key, value FROM ways_tags) sub
WHERE sub.key = "suburb" GROUP BY sub.value ORDER BY num DESC LIMIT 10;
(u'Strümp', 1703)
(u'Wersten', 1012)
(u'Bilk', 993)
(u'Gerresheim', 989)
(u'Unterrath', 793)
(u'Benrath', 770)
(u'Pempelfort', 732)
(u'Eller', 725)
(u'Düsseltal', 700)
(u'Oberkassel', 680)
For this project, I just corrected some erroneous ways of wrongly formatted phone numbers. However, people have entered the phone numbers in even more different ways (s. above). So the function needs to be extended to all use cases. Maybe, OSM could provide a strict mask/form how to enter phone numbers. The same goes for house numbers regarding being saved consistently.
Benefits are:
Potential issues are:
Having a background in the insurance industry, many insurance companies are more and more communicating with (potential) customers when they reach a specific spot or location (e.g. customer is 2000 m above sea level, suggesting an accident insurance) They call it situative offers.
Google maps is going a similar way when they ask questions about a location when users enter restaurants or cafés. Like Google, OSM could combine the situative approach with some gamification.
An idea would be, whenever a contributor enters a to OSM known amenity, missing information are asked. Contributor receive points, badges and can compare themselves in a regional leaderboard.
Benefits are:
Potential issues are:
The data from Düsseldorf is generally in a good shape. However, more consistency and contribution is always beneficial.