Compute a matching table between two lists of municipality IDs
Source:R/swc_get_mapping.R
swc_get_mapping.Rd
For two lists of Swiss municipality IDs at any two points in time, this function creates a data frame with two columns where each row represents a match between municipality IDs. This can be used as an intermediate table for merging two data sets with municipality identifiers taken at different, possibly unknown, points in time.
Arguments
- ids_from
A list of "source" municipality IDs, preferably a factor
- ids_to
A list of "target" municipality IDs, preferably a factor
Value
A data frame with columns prefixed by from.
and to
that
represents the computed match. The municipality IDs are stored in the
columns from.mId
and to.mId
. The columns
from.MergeType
and to.MergeType
contain valid
if
the municipality is contained in both the input and the mapping table,
missing
if the municipality is missing from the input, and
extra
if the municipality is in the input but not in the mapping
table; most columns are NA
for such rows. In addition, the column
MergeType
offers a summary of the "from" and "to" status: Rows with
values other than "valid"
or "missing"
should be examined.
Details
It is advisable to use factors as list of municipality IDs. By that, comparisons and merges for municipality IDs are automatically checked for compatibility.
Note that the "from" list must be from an earlier time than the "to" list. Trying to compute the mapping the other way round results in an error. This is intentional: As municipalities are usually merged, it makes sense to use the most recent data set as target for the mapping. This can also be a file with suitable geometries to allow for visualization.
For two lists of municipalities, we construct a mapping from the first list to the second. First, the most probable mutation number in the "municipality mutations" data set is computed.
Examples
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
data(SwissPop)
data(SwissBirths)
# Show mismatch of municipality IDs:
ids_from <- with(SwissPop, MunicipalityID)
ids_to <- with(SwissBirths, MunicipalityID)
setdiff(ids_from, ids_to)
#> [1] "327" "328" "330" "343" "384" "531" "601" "618" "621" "752"
#> [11] "753" "851" "854" "862" "864" "882" "887" "926" "933" "974"
#> [21] "984" "986" "994" "1027" "1028" "1029" "1034" "1035" "1036" "1038"
#> [31] "1042" "1060" "1087" "1090" "1101" "1105" "1106" "1124" "1133" "1134"
#> [41] "1141" "1144" "1148" "1149" "1601" "1602" "1603" "1604" "1605" "1606"
#> [51] "1607" "1608" "1609" "1610" "1611" "1612" "1613" "1614" "1615" "1616"
#> [61] "1617" "1618" "1619" "1620" "1621" "1622" "1623" "1624" "1625" "1626"
#> [71] "1627" "1628" "1629" "2001" "2002" "2003" "2007" "2012" "2017" "2018"
#> [81] "2020" "2023" "2026" "2031" "2032" "2036" "2037" "2042" "2046" "2062"
#> [91] "2064" "2069" "2070" "2071" "2074" "2075" "2077" "2081" "2082" "2083"
#> [101] "2085" "2088" "2091" "2092" "2093" "2094" "2095" "2100" "2101" "2103"
#> [111] "2105" "2107" "2108" "2109" "2110" "2112" "2132" "2133" "2136" "2139"
#> [121] "2141" "2142" "2144" "2146" "2150" "2151" "2154" "2156" "2158" "2159"
#> [131] "2161" "2176" "2180" "2181" "2182" "2188" "2190" "2191" "2202" "2203"
#> [141] "2204" "2210" "2214" "2215" "2219" "2227" "2229" "2232" "2241" "2247"
#> [151] "2253" "2263" "2269" "2282" "2322" "2324" "2326" "2327" "2329" "2330"
#> [161] "2331" "2332" "2443" "2447" "2460" "2494" "2496" "2531" "2552" "2905"
#> [171] "2911" "2912" "2913" "2916" "2918" "2934" "3335" "3336" "3351" "3354"
#> [181] "3355" "3357" "3371" "3376" "3406" "3523" "3573" "3591" "3600" "3601"
#> [191] "3602" "3632" "3639" "3641" "3642" "3664" "3665" "3666" "3667" "3692"
#> [201] "3702" "3704" "3706" "3709" "3710" "3771" "3773" "3774" "3775" "3776"
#> [211] "3841" "3842" "3843" "3844" "3845" "3846" "3892" "3912" "3913" "3914"
#> [221] "3915" "3928" "3930" "3942" "3943" "3944" "3971" "3973" "3984" "4011"
#> [231] "4036" "4043" "4070" "4101" "4116" "4118" "4162" "4168" "4171" "4174"
#> [241] "4178" "4180" "4225" "4278" "5011" "5016" "5031" "5032" "5033" "5034"
#> [251] "5035" "5036" "5037" "5038" "5039" "5040" "5041" "5042" "5043" "5044"
#> [261] "5045" "5046" "5047" "5062" "5065" "5066" "5067" "5068" "5069" "5070"
#> [271] "5074" "5075" "5080" "5092" "5093" "5094" "5098" "5101" "5104" "5106"
#> [281] "5107" "5110" "5111" "5114" "5116" "5122" "5123" "5127" "5128" "5134"
#> [291] "5142" "5145" "5147" "5150" "5153" "5156" "5158" "5159" "5164" "5165"
#> [301] "5168" "5169" "5175" "5177" "5179" "5182" "5183" "5184" "5185" "5188"
#> [311] "5190" "5191" "5201" "5204" "5209" "5211" "5215" "5217" "5218" "5220"
#> [321] "5223" "5228" "5232" "5234" "5235" "5241" "5244" "5245" "5246" "5247"
#> [331] "5248" "5252" "5256" "5258" "5259" "5262" "5264" "5265" "5267" "5301"
#> [341] "5302" "5303" "5305" "5306" "5308" "5311" "5312" "5313" "5314" "5316"
#> [351] "5318" "5319" "5320" "5321" "5322" "5433" "5452" "5453" "5454" "5455"
#> [361] "5457" "5459" "5460" "5461" "5462" "5463" "5517" "5519" "5524" "5525"
#> [371] "5526" "5528" "5532" "5536" "5538" "5558" "5567" "5569" "5570" "5602"
#> [381] "5603" "5605" "5608" "5612" "5630" "5641" "5647" "5677" "5681" "5687"
#> [391] "5691" "5781" "5783" "5784" "5786" "5787" "5793" "5794" "5795" "5796"
#> [401] "5797" "5800" "5801" "5802" "5811" "5814" "5815" "5818" "5820" "5823"
#> [411] "5824" "5825" "5826" "5829" "5901" "5916" "5917" "5918" "5920" "5927"
#> [421] "5936" "6051" "6059" "6060" "6062" "6063" "6065" "6066" "6067" "6070"
#> [431] "6071" "6085" "6086" "6088" "6103" "6107" "6174" "6175" "6176" "6179"
#> [441] "6180" "6196" "6200" "6231" "6233" "6237" "6245" "6247" "6251" "6457"
#> [451] "6460" "6501" "6502" "6503" "6505" "6506" "6507" "6508" "6509" "6510"
#> [461] "6746" "6747" "6749" "6752" "6755" "6756" "6772" "6777" "6779" "6780"
#> [471] "6786" "6788" "6791" "6794" "6795" "6796" "6797" "6798" "6799" "6801"
#> [481] "6802" "6804" "6805"
setdiff(ids_to, ids_from)
#> [1] "632" "756" "855" "888" "948" "1151" "1630" "1631" "1632" "2050"
#> [11] "2051" "2052" "2114" "2115" "2116" "2162" "2233" "2234" "2235" "2337"
#> [21] "2338" "2503" "3340" "3358" "3359" "3378" "3617" "3671" "3713" "3792"
#> [31] "3847" "3931" "3932" "3955" "4049" "4184" "5048" "5049" "5050" "5137"
#> [41] "5138" "5236" "5237" "5238" "5269" "5323" "5324" "5397" "5398" "5464"
#> [51] "5540" "5541" "5571" "5613" "5804" "5805" "5831" "6074" "6075" "6076"
#> [61] "6090" "6118" "6181" "6203" "6204" "6252" "6461" "6512" "6807" "6808"
#> [71] "6809" "6810"
# Compute mapping and count non-matching municipality IDs:
mapping <- swc_get_mapping(ids_from = ids_from, ids_to = ids_to)
#> 2023-11-05 05:21:05.776557 INFO::meltMutations
#> 2023-11-05 05:21:05.938916 INFO::meltMutations
#> 2023-11-05 05:21:05.953015 INFO::computeMunList
#> 2023-11-05 05:21:05.967004 INFO::meltMutations
#> 2023-11-05 05:21:05.99671 INFO::meltMutations
#> 2023-11-05 05:21:06.012439 INFO::computeMunList
with(mapping, sum(mapping$mIdAsNumber.from != mapping$mIdAsNumber.to))
#> [1] 487
# Communes that are "missing" are mostly lakes and other special communes:
subset(mapping, MatchType == "missing")[, c("mIdAsNumber.from", "mShortName.from")]
#> mIdAsNumber.from mShortName.from
#> 253 5238 C'za Corticiasca/Valcol.
#> 2222 4510 Sulgen
# These should be looked at in some detail, and fixed manually:
subset(mapping, !(MatchType %in% c("valid", "missing")))
#> mHistId.from cAbbreviation.from mId.from
#> 589 11250 TI <NA>
#> 590 11251 TI <NA>
#> mLongName.from mShortName.from
#> 589 C'za Bidogno/Sala Capriasca/Corticiasca C'za Bid./Sala C./Corti.
#> 590 C'za Sala Capriasca/Vaglio/Lugaggia C'za Sala C./Vag./Lugag.
#> MatchType.from mHistId.to cAbbreviation.to mId.to mLongName.to
#> 589 missing 14938 TI 5226 Capriasca
#> 590 missing 14938 TI 5226 Capriasca
#> mShortName.to MatchType.to mIdAsNumber.from mIdAsNumber.to MatchType
#> 589 Capriasca valid 5236 5226 missing.from
#> 590 Capriasca valid 5237 5226 missing.from
# Test for injectivity. The result shows that the mapping is almost injective,
# only one "from" commune is mapped to more than one other "to" commune.
# This situation requires further examination.
mapping.dupes <- subset(mapping, duplicated(mIdAsNumber.from))
(noninjective.mapping <- subset(
mapping, mIdAsNumber.from %in% mapping.dupes$mIdAsNumber.from
))
#> mHistId.from cAbbreviation.from mId.from mLongName.from mShortName.from
#> 610 11403 TI 5246 Caneggio Caneggio
#> 611 11403 TI 5246 Caneggio Caneggio
#> MatchType.from mHistId.to cAbbreviation.to mId.to mLongName.to
#> 610 valid 14499 TI 5249 Castel San Pietro
#> 611 valid 14974 TI 5269 Breggia
#> mShortName.to MatchType.to mIdAsNumber.from mIdAsNumber.to MatchType
#> 610 Castel San Pietro valid 5246 5249 valid
#> 611 Breggia valid 5246 5269 valid
# Simple treatment (just for this example): Remove duplicates, and use only
# valid matches:
cleaned.mapping <- subset(
mapping,
!duplicated(mIdAsNumber.from) & MatchType == "valid"
)
# Now merge the two datasets based on the mapping table:
SwissPop.1970 <- subset(SwissPop, Year == "1970")
SwissPopMapping.1970 <- merge(SwissPop.1970,
cleaned.mapping[, c("mId.from", "mId.to")],
by.x = "MunicipalityID", by.y = "mId.from"
)
# Datasets from the "from" table must be suitably aggregated. For the given
# case of population totals we use the sum.
SwissPopMapping.1970.agg <- group_by(
SwissPopMapping.1970,
mId.to,
HouseholdSize
) %>%
summarize(Households = sum(Households))
#> `summarise()` has grouped output by 'mId.to'. You can override using the
#> `.groups` argument.
with(SwissPopMapping.1970.agg, stopifnot(
length(unique(mId.to)) * length(levels(HouseholdSize)) ==
length(mId.to)
))
# The aggregated "from" dataset now can be merged with the "to" dataset:
SwissBirths.1970 <- subset(SwissBirths, Year == "1970")
SwissPopBirths.1970 <- merge(SwissPopMapping.1970.agg, SwissBirths.1970,
by.x = "mId.to", by.y = "MunicipalityID"
)
# Some more communes are still missing from the 1970 statistics, although
# the matches are valid:
subset(mapping, mIdAsNumber.to %in% setdiff(
SwissPopMapping.1970.agg$mId.to, SwissBirths.1970$MunicipalityID
))[
,
c("mId.from", "mShortName.from", "MatchType")
]
#> mId.from mShortName.from MatchType
#> 16 3533 Marmorera valid
#> 30 3503 Mutten valid
#> 36 3593 Duvin valid
#> 42 3583 Schnaus valid
#> 48 3577 Luven valid
#> 122 3803 Braggio valid
#> 135 3806 Cauco valid
#> 146 3743 Lavin valid
#> 147 3742 Guarda valid
#> 176 3811 Selma valid
#> 197 3631 Almens valid
#> 198 3616 Waltensburg/Vuorz valid
#> 217 3711 Rongellen valid
#> 220 3708 Mathon valid
#> 221 3707 Lohn (GR) valid
#> 222 3706 Innerferrera valid
#> 226 3702 Ausserferrera valid
#> 234 3670 Urmein valid
#> 235 3669 Tschappina valid
#> 346 5129 Sonogno valid
#> 390 5688 Syens valid
#> 439 5732 Vich valid
#> 441 5233 Vico Morcote valid
#> 452 5690 Villars-le-Comte valid
#> 455 5651 Villars-Sainte-Croix valid
#> 458 5935 Villars-Epeney valid
#> 479 3923 Castiel valid
#> 549 5650 Vaux-sur-Morges valid
#> 576 5932 Ursins valid
#> 591 5173 Certara valid
#> 609 5307 Campo (Vallemaggia) valid
#> 627 2008 Châtillon (FR) valid
#> 641 5475 Chavannes-le-Veyron valid
#> 650 5906 Chanéaz valid
#> 720 5668 Cremin valid
#> 728 5479 Cuarnens valid
#> 816 5102 Corippo valid
#> 933 5621 Aclens valid
#> 976 6172 Bister valid
#> 980 5155 Bogno valid
#> 983 4302 Böbikon valid
#> 985 5747 Bofflens valid
#> 992 5622 Bremblens valid
#> 999 5096 Brione sopra Minusio valid
#> 1049 5513 Bioley-Orjulaz valid
#> 1224 5109 Gresso valid
#> 1247 2079 Grangettes valid
#> 1315 5671 Dompierre (VD) valid
#> 1427 4106 Mönthal valid
#> 1500 5488 Mauraz valid
#> 1502 4314 Mellikon valid
#> 1503 389 Meienried valid
#> 1512 5012 Moleno valid
#> 1540 6216 Mex (VS) valid
#> 1730 6109 Inden valid
#> 1800 6178 Martisberg valid
#> 1803 5676 Martherenges valid
#> 1872 5315 Linescio valid
#> 1911 5564 Novalles valid
#> 1964 5798 Ropraz valid
#> 2016 5081 Sobrio valid
#> 2031 5762 Sergey valid
#> 2042 5686 Sarzens valid
#> 2167 2216 Pierrafortscha valid
#> 2372 532 Bangerten valid
#> 2375 535 Deisswil b. Münchenbuch. valid
#> 2455 629 Oberhünigen valid
#> 2464 4084 Islisberg valid
#> 2467 362 Ittigen valid
#> 2468 363 Ostermundigen valid
#> 2474 6716 Mettembert valid
#> 2476 3613 Pigniu valid
#> 2521 2522 Hersiwil valid
#> 2540 4826 Mammern valid
#> 2541 631 Trimstein valid
#> 2542 630 Allmendingen valid
#> 2550 4643 Bottighofen valid
#> 2576 2851 Kilchberg (BL) valid
#> 2805 972 Berken valid
#> 2835 4621 Warth-Weiningen valid
#> 2839 4716 Bettwiesen valid
#> 2860 6728 Vellerat valid
#> 2865 4724 Eschlikon valid
#> 2881 4786 Wilen (TG) valid
#> 2887 4546 Schlatt (TG) valid
#> 2891 4723 Braunau valid
# The "from" list must be from an earlier time than the "to" list.
try(swc_get_mapping(ids_from = ids_to, ids_to = ids_from))
#> 2023-11-05 05:21:10.145095 INFO::meltMutations
#> 2023-11-05 05:21:10.187964 INFO::meltMutations
#> 2023-11-05 05:21:10.199548 INFO::computeMunList
#> 2023-11-05 05:21:10.211782 INFO::meltMutations
#> 2023-11-05 05:21:10.233549 INFO::meltMutations
#> 2023-11-05 05:21:10.241401 INFO::computeMunList
#> Error in getMunicipalityMappingWorker(mutations, hist.list.from, mid.from, :
#> mid.from <= mid.to is not TRUE