如何实现第16章所述的空间连接（Spatial Join）的高级应用？

摘要：layout: default title: &quot;第16章：空间连接（Spatial Join）&quot; 第16章：空间连接（Spatial Join）空间连接是地理空间分析中最常用的操作之一。

第16章：空间连接（Spatial Join）空间连接是地理空间分析中最常用的操作之一。它根据两组地理要素之间的空间关系（如相交、包含、邻近等），将一个 GeoDataFrame 的属性信息关联到另一个 GeoDataFrame 上。本章将全面介绍 GeoPandas 中 sjoin() 和 sjoin_nearest() 的使用方法、参数配置与性能优化策略。 16.1 空间连接概述 16.1.1 什么是空间连接空间连接（Spatial Join）是指根据两个数据集中要素之间的空间关系，将一方的属性附加到另一方的过程。与 pandas 中基于列值的 merge() 类似，空间连接基于几何对象的空间关系进行匹配。例如：将每个 POI（兴趣点）关联到它所在的行政区将建筑物关联到其所在的洪水风险区域将交通事故关联到最近的道路段 16.1.2 空间连接与属性连接的区别特征属性连接（merge）空间连接（sjoin）连接依据列值匹配空间关系匹配输入数据 DataFrame / GeoDataFrame 两个 GeoDataFrame 匹配方式精确值匹配或键值匹配几何关系判定典型场景合并统计表点落入多边形分析性能影响取决于数据量取决于几何复杂度与空间索引 16.1.3 空间连接的工作流程空间连接通常经过以下步骤：空间索引过滤：利用 R-tree 快速筛选可能匹配的候选要素对精确谓词判定：对候选对执行精确的空间关系判断（如 intersects、within）属性合并：将匹配要素的属性按指定方式合并到结果中 import geopandas as gpd from shapely.geometry import Point, Polygon # 创建点数据 - POI pois = gpd.GeoDataFrame({ 'poi_name': ['商场A', '学校B', '医院C', '公园D', '餐厅E'], 'category': ['商业', '教育', '医疗', '休闲', '餐饮'], 'geometry': [ Point(116.40, 39.92), Point(116.42, 39.94), Point(116.38, 39.90), Point(116.44, 39.96), Point(116.41, 39.93) ] }, crs="EPSG:4326") # 创建面数据 - 行政区 districts = gpd.GeoDataFrame({ 'district': ['朝阳区', '海淀区'], 'geometry': [ Polygon([(116.39, 39.91), (116.45, 39.91), (116.45, 39.97), (116.39, 39.97)]), Polygon([(116.36, 39.88), (116.41, 39.88), (116.41, 39.93), (116.36, 39.93)]) ] }, crs="EPSG:4326") print("POI 数据:") print(pois) print("\n行政区数据:") print(districts) 16.2 sjoin() 基础用法 16.2.1 基本语法 sjoin() 是 GeoPandas 提供的空间连接函数，基本语法如下： geopandas.sjoin(left_df, right_df, how='inner', predicate='intersects') 也可以使用 GeoDataFrame 的方法形式： left_df.sjoin(right_df, how='inner', predicate='intersects') 16.2.2 最简单的空间连接 # 将 POI 与行政区进行空间连接 result = gpd.sjoin(pois, districts, how='inner', predicate='intersects') print(result) 输出结果将包含每个 POI 的原始属性，加上其所在行政区的属性。 16.2.3 how 参数 - 连接方式 how 参数控制连接方式，与 SQL 的 JOIN 类型类似： how 值说明保留的行 'inner' 内连接（默认）只保留两边都有匹配的行 'left' 左连接保留左表所有行，右表无匹配则为 NaN 'right' 右连接保留右表所有行，左表无匹配则为 NaN # 内连接 - 只保留有匹配的 POI inner_result = gpd.sjoin(pois, districts, how='inner') print(f"内连接结果数: {len(inner_result)}") # 左连接 - 保留所有 POI，没有落在任何区的 POI 属性为 NaN left_result = gpd.sjoin(pois, districts, how='left') print(f"左连接结果数: {len(left_result)}") # 右连接 - 保留所有行政区 right_result = gpd.sjoin(pois, districts, how='right') print(f"右连接结果数: {len(right_result)}") 16.2.4 predicate 参数 - 空间谓词 predicate 参数指定空间关系的判定方式： # 使用 intersects（默认）- 相交 result_intersects = gpd.sjoin(pois, districts, predicate='intersects') # 使用 within - 在...之内 result_within = gpd.sjoin(pois, districts, predicate='within') # 使用 contains - 包含 result_contains = gpd.sjoin(districts, pois, predicate='contains') 16.3 sjoin() 参数详解 16.3.1 lsuffix 和 rsuffix 当两个 GeoDataFrame 存在同名列时，lsuffix 和 rsuffix 用于区分来源： # 两个数据集都有 'name' 列 gdf1 = gpd.GeoDataFrame({ 'name': ['点A', '点B'], 'value': [10, 20], 'geometry': [Point(0, 0), Point(1, 1)] }) gdf2 = gpd.GeoDataFrame({ 'name': ['区域1'], 'area_km2': [100], 'geometry': [Polygon([(-.5, -.5), (1.5, -.5), (1.5, 1.5), (-.5, 1.5)])] }) # 默认后缀为 'left' 和 'right' result = gpd.sjoin(gdf1, gdf2, how='inner') print(result.columns.tolist()) # 输出: ['name_left', 'value', 'geometry', 'index_right', 'name_right', 'area_km2'] # 自定义后缀 result = gpd.sjoin(gdf1, gdf2, how='inner', lsuffix='poi', rsuffix='zone') print(result.columns.tolist()) 16.3.2 结果的索引空间连接的结果保留左表的索引，并额外添加一个 index_right 列（或 index_left 列），记录匹配到的右表（或左表）行的原始索引： result = gpd.sjoin(pois, districts) print(result.index) # 保留 pois 的索引 print(result['index_right']) # 记录匹配到的 districts 索引 16.3.3 一对多匹配当一个要素与多个要素匹配时，结果中会出现重复的索引： from shapely.geometry import Point, Polygon # 一个点同时落在两个重叠的多边形中 points = gpd.GeoDataFrame({ 'name': ['中心点'], 'geometry': [Point(0.5, 0.5)] }) polygons = gpd.GeoDataFrame({ 'zone': ['区域A', '区域B'], 'geometry': [ Polygon([(0, 0), (1, 0), (1, 1), (0, 1)]), Polygon([(0.3, 0.3), (1.3, 0.3), (1.3, 1.3), (0.3, 1.3)]) ] }) result = gpd.sjoin(points, polygons) print(f"输入点数: {len(points)}, 输出行数: {len(result)}") print(result) # 中心点出现两次，分别匹配区域A和区域B 16.4 常用空间谓词 16.4.1 谓词概览谓词说明典型场景 intersects 几何对象有任何交集通用连接（默认） within 左表几何完全在右表几何内部点在多边形内 contains 左表几何完全包含右表几何多边形包含点 crosses 几何对象交叉（部分重叠）道路穿越区域 touches 几何对象仅边界接触相邻区域判定 overlaps 几何对象部分重叠区域重叠分析 covers 左表几何覆盖右表几何覆盖关系分析 covered_by 左表几何被右表几何覆盖被覆盖分析 16.4.2 intersects - 相交 intersects 是最宽泛的谓词，只要两个几何对象有任何共享的空间部分（包括边界接触），就认为匹配： import geopandas as gpd from shapely.geometry import Point, LineString, Polygon # 多边形 poly = gpd.GeoDataFrame({ 'name': ['区域'], 'geometry': [Polygon([(0, 0), (2, 0), (2, 2), (0, 2)])] }) # 测试不同几何与多边形的 intersects 关系 tests = gpd.GeoDataFrame({ 'desc': ['内部点', '边界点', '外部点', '穿越线', '外部线'], 'geometry': [ Point(1, 1), # 在内部 Point(2, 2), # 在边界上 Point(3, 3), # 在外部 LineString([(1, -1), (1, 3)]), # 穿越 LineString([(3, 0), (4, 0)]) # 在外部 ] }) result = gpd.sjoin(tests, poly, predicate='intersects') print("intersects 匹配的要素:") print(result['desc'].tolist()) # 输出: ['内部点', '边界点', '穿越线'] 16.4.3 within 与 contains within 和 contains 是互逆关系： # within: 左表在右表内 result_within = gpd.sjoin(pois, districts, predicate='within') print("within 结果:", len(result_within)) # contains: 左表包含右表（注意左右表顺序颠倒） result_contains = gpd.sjoin(districts, pois, predicate='contains') print("contains 结果:", len(result_contains)) # 两者结果本质相同，只是视角不同 16.4.4 crosses 与 touches from shapely.geometry import LineString, Polygon # 道路数据 roads = gpd.GeoDataFrame({ 'road': ['主干道', '环路', '支路'], 'geometry': [ LineString([(0, 1), (3, 1)]), # 穿越区域 LineString([(0, 0), (2, 0)]), # 接触边界 LineString([(3, 0), (4, 1)]) # 完全在外部 ] }) # 区域数据 zones = gpd.GeoDataFrame({ 'zone': ['核心区'], 'geometry': [Polygon([(0.5, 0), (2.5, 0), (2.5, 2), (0.5, 2)])] }) # crosses - 穿越（几何有交集但不完全包含） result_crosses = gpd.sjoin(roads, zones, predicate='crosses') print("穿越核心区的道路:", result_crosses['road'].tolist()) # touches - 仅边界接触 result_touches = gpd.sjoin(roads, zones, predicate='touches') print("与核心区边界接触的道路:", result_touches['road'].tolist()) 16.4.5 选择合适的谓词分析需求推荐谓词说明点落入哪个区域 within 最精确，排除边界上的点哪些要素与区域有关 intersects 最宽泛，包含所有相关要素区域包含哪些要素 contains 注意左右表顺序道路穿过哪些区域 crosses 仅匹配穿越的情况寻找相邻区域 touches 仅边界相邻 16.5 sjoin_nearest() 最近邻连接 16.5.1 基本概念 sjoin_nearest() 将左表的每个要素与右表中距离最近的要素进行连接，即使它们没有空间交集。这在处理不完全重叠的数据时非常有用。 geopandas.sjoin_nearest(left_df, right_df, how='inner', max_distance=None, distance_col=None) 16.5.2 基础用法 import geopandas as gpd from shapely.geometry import Point # 事故发生地点 accidents = gpd.GeoDataFrame({ 'accident_id': [1, 2, 3], 'severity': ['严重', '轻微', '一般'], 'geometry': [Point(116.40, 39.91), Point(116.43, 39.95), Point(116.37, 39.89)] }, crs="EPSG:4326") # 医院位置 hospitals = gpd.GeoDataFrame({ 'hospital': ['协和医院', '人民医院', '中日友好医院'], 'beds': [2000, 1500, 1800], 'geometry': [Point(116.41, 39.92), Point(116.38, 39.90), Point(116.44, 39.96)] }, crs="EPSG:4326") # 为每个事故找到最近的医院 result = gpd.sjoin_nearest(accidents, hospitals) print(result[['accident_id', 'severity', 'hospital', 'beds']]) 16.5.3 max_distance 参数 max_distance 限制最近邻搜索的最大距离。超过此距离的匹配将被排除： # 投影到平面坐标系以使用米为单位 accidents_proj = accidents.to_crs(epsg=32650) hospitals_proj = hospitals.to_crs(epsg=32650) # 只查找 3000 米范围内的最近医院 result = gpd.sjoin_nearest( accidents_proj, hospitals_proj, max_distance=3000 # 单位: 米（取决于 CRS） ) print(f"在 3000 米内找到匹配的事故数: {len(result)}") 注意：max_distance 的单位取决于 GeoDataFrame 的 CRS。使用地理坐标系（如 EPSG:4326）时单位为度，使用投影坐标系时单位为米。 16.5.4 distance_col 参数 distance_col 指定一个列名，用于在结果中存储实际距离值： # 计算并保存距离 result = gpd.sjoin_nearest( accidents_proj, hospitals_proj, distance_col='dist_to_hospital' ) print(result[['accident_id', 'hospital', 'dist_to_hospital']]) 输出示例： accident_id hospital dist_to_hospital 1 协和医院 1432.5 2 中日友好医院 1587.3 3 人民医院 1245.8 16.5.5 处理等距情况当左表的一个要素与右表的多个要素距离相同时，所有等距的要素都会被返回： # 一个点与两个点等距 center = gpd.GeoDataFrame({ 'name': ['中心'], 'geometry': [Point(0, 0)] }) targets = gpd.GeoDataFrame({ 'target': ['东', '西'], 'geometry': [Point(1, 0), Point(-1, 0)] }) result = gpd.sjoin_nearest(center, targets, distance_col='dist') print(result) # 可能返回两行，因为东和西距离相等 16.6 性能优化 16.6.1 空间索引的自动使用 sjoin() 在执行时会自动利用右表的空间索引进行加速： import geopandas as gpd from shapely.geometry import Point, Polygon import numpy as np import time np.random.seed(42) # 创建大规模数据 n_points = 100000 points = gpd.GeoDataFrame({ 'id': range(n_points), 'geometry': [Point(x, y) for x, y in zip( np.random.uniform(0, 100, n_points), np.random.uniform(0, 100, n_points) )] }) # 创建网格多边形 polygons = [] names = [] for i in range(10): for j in range(10): polygons.append(Polygon([ (i*10, j*10), ((i+1)*10, j*10), ((i+1)*10, (j+1)*10), (i*10, (j+1)*10) ])) names.append(f"网格_{i}_{j}") grid = gpd.GeoDataFrame({'name': names, 'geometry': polygons}) # 空间连接自动利用空间索引 start = time.time() result = gpd.sjoin(points, grid, predicate='within') elapsed = time.time() - start print(f"空间连接 {n_points} 个点与 {len(grid)} 个网格: {elapsed:.3f} 秒") print(f"匹配结果数: {len(result)}") 16.6.2 坐标参考系统的一致性空间连接要求两个 GeoDataFrame 使用相同的 CRS，否则会引发警告或错误： # 确保 CRS 一致 if pois.crs != districts.crs: districts = districts.to_crs(pois.crs) result = gpd.sjoin(pois, districts) 16.6.3 数据预处理优化 # 1. 先用边界框粗筛，减少参与连接的数据量 bbox = districts.total_bounds # [minx, miny, maxx, maxy] pois_filtered = pois.cx[bbox[0]:bbox[2], bbox[1]:bbox[3]] # 2. 投影到合适的平面坐标系 pois_proj = pois_filtered.to_crs(epsg=32650) districts_proj = districts.to_crs(epsg=32650) # 3. 执行空间连接 result = gpd.sjoin(pois_proj, districts_proj, predicate='within') 16.6.4 分块处理大数据集 import pandas as pd def sjoin_chunked(left, right, chunk_size=10000, **kwargs): """分块执行空间连接，降低内存峰值""" results = [] n_chunks = (len(left) + chunk_size - 1) // chunk_size for i in range(n_chunks): start_idx = i * chunk_size end_idx = min((i + 1) * chunk_size, len(left)) chunk = left.iloc[start_idx:end_idx] chunk_result = gpd.sjoin(chunk, right, **kwargs) results.append(chunk_result) return pd.concat(results, ignore_index=False) # 使用分块处理 result = sjoin_chunked(points, grid, chunk_size=20000, predicate='within') print(f"分块连接结果数: {len(result)}") 16.6.5 性能对比优化策略适用场景效果自动空间索引所有场景默认启用，显著加速 CRS 一致性多源数据避免错误结果边界框预筛选数据分布不均匀减少计算量分块处理内存不足降低内存峰值投影坐标系需要距离计算提高精度 16.7 实际应用案例 16.7.1 案例一：点在多边形分析 - 统计各区 POI 数量 import geopandas as gpd from shapely.geometry import Point, Polygon import numpy as np np.random.seed(42) # 模拟 POI 数据 n_pois = 5000 poi_data = gpd.GeoDataFrame({ 'poi_type': np.random.choice(['餐饮', '商业', '教育', '医疗'], n_pois), 'geometry': [Point(x, y) for x, y in zip( np.random.uniform(116.2, 116.6, n_pois), np.random.uniform(39.8, 40.1, n_pois) )] }, crs="EPSG:4326") # 模拟行政区数据 districts_data = gpd.GeoDataFrame({ 'district': ['A区', 'B区', 'C区', 'D区'], 'geometry': [ Polygon([(116.2, 39.8), (116.4, 39.8), (116.4, 39.95), (116.2, 39.95)]), Polygon([(116.4, 39.8), (116.6, 39.8), (116.6, 39.95), (116.4, 39.95)]), Polygon([(116.2, 39.95), (116.4, 39.95), (116.4, 40.1), (116.2, 40.1)]), Polygon([(116.4, 39.95), (116.6, 39.95), (116.6, 40.1), (116.4, 40.1)]) ] }, crs="EPSG:4326") # 空间连接 joined = gpd.sjoin(poi_data, districts_data, predicate='within') # 按区和 POI 类型统计数量 poi_stats = joined.groupby(['district', 'poi_type']).size().unstack(fill_value=0) print("各区 POI 数量统计:") print(poi_stats) # 计算各区 POI 总数 poi_total = joined.groupby('district').size() print("\n各区 POI 总数:") print(poi_total) 16.7.2 案例二：最近设施分析 import geopandas as gpd from shapely.geometry import Point import numpy as np np.random.seed(42) # 居民点 n_residents = 1000 residents = gpd.GeoDataFrame({ 'resident_id': range(n_residents), 'geometry': [Point(x, y) for x, y in zip( np.random.uniform(0, 10000, n_residents), np.random.uniform(0, 10000, n_residents) )] }, crs="EPSG:32650") # 消防站 stations = gpd.GeoDataFrame({ 'station': ['消防站A', '消防站B', '消防站C', '消防站D', '消防站E'], 'geometry': [ Point(2000, 2000), Point(8000, 2000), Point(5000, 5000), Point(2000, 8000), Point(8000, 8000) ] }, crs="EPSG:32650") # 最近邻连接：找到每个居民点最近的消防站及距离 result = gpd.sjoin_nearest( residents, stations, distance_col='distance_m' ) # 统计各消防站的服务人数 service_stats = result.groupby('station').agg( 服务人数=('resident_id', 'count'), 平均距离=('distance_m', 'mean'), 最大距离=('distance_m', 'max') ).round(1) print("消防站服务统计:") print(service_stats) # 标记超过 3000 米的居民点为"服务薄弱区" result['risk'] = result['distance_m'] > 3000 print(f"\n距离最近消防站超过 3000 米的居民点: {result['risk'].sum()} 个") print(f"占比: {result['risk'].mean():.1%}") 16.7.3 案例三：将属性从面传递到线 import geopandas as gpd from shapely.geometry import LineString, Polygon # 道路数据 roads = gpd.GeoDataFrame({ 'road_name': ['长安街', '三环路', '四环路段'], 'length_km': [12.5, 48.3, 65.2], 'geometry': [ LineString([(116.30, 39.91), (116.50, 39.91)]), LineString([(116.32, 39.88), (116.48, 39.88), (116.48, 39.98)]), LineString([(116.28, 39.85), (116.52, 39.85)]) ] }, crs="EPSG:4326") # 行政区 admin_zones = gpd.GeoDataFrame({ 'zone': ['城区', '郊区'], 'zone_type': ['核心区', '发展区'], 'geometry': [ Polygon([(116.30, 39.87), (116.50, 39.87), (116.50, 39.95), (116.30, 39.95)]), Polygon([(116.25, 39.83), (116.55, 39.83), (116.55, 39.87), (116.25, 39.87)]) ] }, crs="EPSG:4326") # 空间连接 - 找出道路穿过的区域 road_zones = gpd.sjoin(roads, admin_zones, predicate='intersects') print("道路与区域的关联:") print(road_zones[['road_name', 'zone', 'zone_type']]) 16.8 本章小结本章全面介绍了 GeoPandas 中的空间连接操作。主要内容回顾：核心函数函数说明 gpd.sjoin() 基于空间谓词的空间连接 gpd.sjoin_nearest() 基于最近距离的空间连接关键参数参数说明默认值 how 连接方式（inner/left/right） 'inner' predicate 空间谓词（intersects/within/contains 等） 'intersects' lsuffix/rsuffix 同名列的后缀 'left'/'right' max_distance 最近邻搜索最大距离 None distance_col 存储距离值的列名 None 使用建议选择合适的谓词：根据分析需求选择最精确的空间谓词，避免不必要的匹配确保 CRS 一致：连接前务必检查并统一坐标参考系统使用投影坐标系：涉及距离计算时，先投影到合适的平面坐标系注意一对多：一个要素可能匹配多个要素，导致结果行数大于输入行数利用空间索引：sjoin() 自动使用空间索引，无需手动创建在下一章中，我们将学习几何叠加分析（Overlay），它是空间连接的延伸，不仅关联属性，还对几何对象本身进行交集、并集等操作。

如何实现第16章所述的空间连接（Spatial Join）的高级应用？

相关推荐