January 6, 2015 by
Standardized Data Trees (UN M.49, ISO 3166-1, ISO 3366-2, country population/area, NAICS)
This was originally posted on blogger here.
To help in aggregating, comparing, and validating data, I've created a pair of data of graphs which represent tree hierarchies of standard formatted data.The World Graph contains:
- UN M.49 country codes
- ISO 3166-1 alpha-2 and alpha-3 codes from pycountry
- ISO 3166-2 country subdivision codes from pycountry
- World Bank country population data
- World Bank country geographical area data
- Aggregation of population and geographical area data at higher levels of the graph
- (Some population and geographical data filled in from wikipedia where missing from the World Bank)
World Graph Visualized |
The NAICS Graph contains:
- 2012 NAICS codes in hierarchical form
- Percentage of the graph under each node
The percentage under the graph in the NAICS graph and the aggregate population/geographic area allow two things:
- Provide an amount as a dimension for data coded in these systems, (whether that amount be percentage of NAICS codes, population, or geographic area).
- Provide a means of comparing the similarity of two records by finding the Lowest Common Ancestor (LCA) and retrieving the score from that node. The greater the score, the greater the distance between the nodes.
The hierarchies can be used for validating data as well as comparing things (as above) that are not on the same level of the graph. For example, a US state could be compared to the country to South-Eastern Asia.
None of this is groundbreaking, but hopefully some find the graphs of use.