Name	Name	Last commit message	Last commit date
parent directory ..
pic	pic
README.md	README.md
arctern_shtruck_bootcamp.ipynb	arctern_shtruck_bootcamp.ipynb

Analyzing Shanghai Truck Dataset with Arctern

This tutorial uses the Shanghai Truck dataset as an example to describe how to use Arctern to process large geospatial data and use kepler.gl to visualize data.

Prerequisite

Install Arctern

Install Jupyter Notebook

In the previous step, you have set up the artern_env environment. Enter this environment and run the following command to install Jupyter Notebook:

$ conda install -c conda-forge notebook

Install required libraries

In the arctern_env environment, run the following command to install required libraries:

$ pip install keplergl matplotlib

Data Preparation

Download the Shanghai Truck dataset, which includes more than 2 million Shanghai Truck data records and the Shanghai topographic map. Save the data in /tmp by default:

$ cd /tmp
# Download and unzip the Shanghai Truck data
$ wget https://github.com/arctern-io/arctern-bootcamp/raw/master/shtruck/file/20181016.zip
# Download Shanghai roda data
$ wget https://github.com/arctern-io/arctern-bootcamp/raw/master/shtruck/file/sh_roads.csv

Initialize Jupyter Notebook

Download arctern_shtruck_bootcamp.ipynb, and then start Jupyter Notebook in the arctern_env environment：

$ wget https://raw.githubusercontent.com/arctern-io/arctern-bootcamp/master/shtruck_en/arctern_shtruck_bootcamp.ipynb
# Start Jupyter Notebook
$ jupyter notebook

Open arctern_nytaxi_bootcamp.ipynb in Jupyter Notebook. Let's start to have fun with the example codes.

Introduction of example codes

1. Data loading

Load Shanghai truck trajectory data. The original data has a total of 8 columns, but here we only need 4 columns for analysis: plate number, time, latitude and longitude.

import pandas as pd
import arctern
from arctern import GeoSeries

sh_columns=[
    ("plate_number","string"),
    ("pos_time","string"),
    ("pos_longitude","double"),
    ("pos_latitude","double"),
    ("pos_direction0","double"),
    ("pos_direction1","double"),
    ("pos_direction2","double"),
    ("pos_direction3","double")
]

sh_select_columns={
                   "plate_number",
                   "pos_time",
                   "pos_longitude",
                   "pos_latitude"
                  }

sh_schema={}
sh_use_cols=[]
sh_names=[]
for idx in range(len(sh_columns)):
    if sh_columns[idx][0] in sh_select_columns:
        sh_schema[sh_columns[idx][0]] = sh_columns[idx][1]
        sh_use_cols.append(idx)
        sh_names.append(sh_columns[idx][0])
            
sh_df = pd.read_csv("/tmp/20181016.txt",
                    usecols=sh_use_cols,
                    names=sh_names,
                    dtype=sh_schema,
                    header=None,
                    delimiter="\t",
                    date_parser=pd.to_datetime,
                    parse_dates=["pos_time"])

According to the latitudes and longitudes, construct position information:

sh_df["pos_point"]=GeoSeries.point(sh_df.pos_longitude,sh_df.pos_latitude)
sh_df

	plate_number	pos_time	pos_longitude	pos_latitude	pos_point
0	沪DK7362	2018-10-16 00:00:00	121.273108	30.989863	POINT (121.273108 30.989863)
1	沪DT0830	2018-10-16 00:00:00	121.471555	31.121763	POINT (121.471555 31.121763)
2	沪EP2723	2018-10-16 00:00:00	121.717205	31.380190	POINT (121.717205 31.38019)
3	沪DH9100	2018-10-16 00:00:00	121.476368	31.197768	POINT (121.476368 31.197768)
4	沪DP8608	2018-10-16 00:00:00	121.826568	31.096545	POINT (121.826568 31.096545)
...	...	...	...	...	...
2076589	沪EG9666	2018-10-16 23:59:31	121.753138	31.356040	POINT (121.753138 31.35604)
2076590	沪DP8746	2018-10-16 23:59:35	121.447145	31.125255	POINT (121.447145 31.125255)
2076591	沪DP8746	2018-10-16 23:59:41	121.448203	31.125408	POINT (121.448203 31.125408)
2076592	沪DP8746	2018-10-16 23:59:48	121.449426	31.125510	POINT (121.449426 31.12551)
2076593	沪DE6779	2018-10-16 23:59:54	121.880973	31.082136	POINT (121.880973 31.082136)

2076594 rows × 5 columns

2. Analyze the running track

We can restore the running track of a truck. First, we select a plate numbers of truck and filter all the data:

one_trunck_plate_number=sh_df.plate_number[0]
print(one_trunck_plate_number)
one_truck_df = sh_df[sh_df.plate_number==one_trunck_plate_number]

沪DK7362

Draw all the track points of this car on the map:

from keplergl import KeplerGl
KeplerGl(data={"car_pos": pd.DataFrame(data={'car_pos':one_truck_df.pos_point.to_wkt()})})

3. Display road network and track

Next, we will look at the running track of the above truck according to the road network information of Shanghai. First, load the road network information:

sh_roads=pd.read_csv("/tmp/sh_roads.csv", 
                     dtype={"roads":"string"},
                     usecols=[0],
                     names=["roads"],
                     header=None,
                     delimiter='|')
sh_roads=GeoSeries(sh_roads.roads)
sh_roads

0        LINESTRING (121.6358731 31.221484,121.6359771 ...
1        LINESTRING (121.6362516 31.2205594,121.6360422...
2        LINESTRING (121.6372043 31.220911,121.6369344 ...
3        LINESTRING (121.4637777 31.2314411,121.4637564...
4        LINESTRING (121.4628334 31.2311683,121.4627892...
                               ...                        
74688    LINESTRING (121.2544395 31.0235354,121.2550238...
74689    LINESTRING (121.6372338 31.2208457,121.6362516...
74690    LINESTRING (121.6372338 31.2208457,121.6373315...
74691    LINESTRING (121.3657763 31.085248,121.3656812 ...
74692    LINESTRING (121.6372043 31.220911,121.6372338 ...
Name: roads, Length: 74693, dtype: GeoDtype

Draw the above track location and road network information on the map:

one_truck_roads=KeplerGl(data={"car_pos": pd.DataFrame(data={'car_pos':one_truck_df.pos_point.to_wkt()})})
one_truck_roads.add_data(data=pd.DataFrame(data={'sh_roads':sh_roads.to_wkt()}),name="sh_roads")
one_truck_roads

The returned map results support interaction. By zooming it in, you can find that some trajectory points of the vehicle are not on the road, and these noise data need to be cleaned.

Data conversion

We regard the track locations that are not on the road as noisy data, and these track points must be bound to the closest roads:

is_near_road=arctern.near_road(sh_roads,sh_df.pos_point)
sh_near_road_df=sh_df[is_near_road]
on_road=arctern.nearest_location_on_road(sh_roads, sh_near_road_df.pos_point)
on_road=GeoSeries(on_road)
on_road

0          POINT (121.273065837839 30.9898629672054)
1          POINT (121.471521117758 31.1218966267949)
2          POINT (121.717183265368 31.3801593122801)
3           POINT (121.47636780833 31.1977688430427)
4          POINT (121.826533061028 31.0965194009541)
                             ...                    
2076589    POINT (121.753124012736 31.3560208068604)
2076590     POINT (121.44712530551 31.1255173541719)
2076591    POINT (121.448188797914 31.1255971887735)
2076592    POINT (121.449412558681 31.1256890544539)
2076593    POINT (121.880966206794 31.0821528456654)
Length: 1807018, dtype: GeoDtype

Binding algorithm

Bind the truck locations to the roads, and reconstruct the DataFrame：

sh_on_road_df=pd.DataFrame(data={"plate_number":sh_near_road_df.plate_number,
                                 "pos_time":sh_near_road_df.pos_time,
                                 "on_road":on_road
                                })
sh_on_road_df

	plate_number	pos_time	on_road
0	沪DK7362	2018-10-16 00:00:00	POINT (121.273065837839 30.9898629672054)
1	沪DT0830	2018-10-16 00:00:00	POINT (121.471521117758 31.1218966267949)
2	沪EP2723	2018-10-16 00:00:00	POINT (121.717183265368 31.3801593122801)
3	沪DH9100	2018-10-16 00:00:00	POINT (121.47636780833 31.1977688430427)
4	沪DP8608	2018-10-16 00:00:00	POINT (121.826533061028 31.0965194009541)
...	...	...	...
2076589	沪EG9666	2018-10-16 23:59:31	POINT (121.753124012736 31.3560208068604)
2076590	沪DP8746	2018-10-16 23:59:35	POINT (121.44712530551 31.1255173541719)
2076591	沪DP8746	2018-10-16 23:59:41	POINT (121.448188797914 31.1255971887735)
2076592	沪DP8746	2018-10-16 23:59:48	POINT (121.449412558681 31.1256890544539)
2076593	沪DE6779	2018-10-16 23:59:54	POINT (121.880966206794 31.0821528456654)

1807018 rows × 3 columns

Draw the track of the truck and the road network of Shanghai again:

one_on_road_df=sh_on_road_df[sh_on_road_df.plate_number==one_trunck_plate_number]
one_on_roads=KeplerGl(data={"car_pos": pd.DataFrame(data={'car_pos':one_on_road_df.on_road.to_wkt()})})
one_on_roads.add_data(data=pd.DataFrame(data={'sh_roads':sh_roads.to_wkt()}),name="sh_roads")
one_on_roads

By zooming out, we can see that all the locations are on the road:

4. Road analysis

There are 74,693 road network records in Shanghai, but trucks will not cross all roads. So, we will analyze the roads that the truck passes by to see which roads have the highest passing-by frequency.

Draw the road where trucks travel

First, filter out all the roads that the truck passes:

all_roads=arctern.nearest_road(sh_roads,sh_on_road_df.on_road)
all_roads=GeoSeries(all_roads)
road_codes, road_uniques = pd.factorize(all_roads)

Show the road data of all passing trucks and the corresponding percentage:

print(len(road_uniques))
print(len(road_uniques)*100.0/len(sh_roads))

16450
22.0234827895519

Draw all the roads that the truck passes:

KeplerGl(data={"all_roads": pd.DataFrame(data={'all_roads':GeoSeries(road_uniques).to_wkt()})})

For some main roads, there are many trucks pass by, which results in plenty of GPS signal sampling points on these roads. At the same time, due to the high traffic volume of these roads, many trucks drive very slowly, which further strengthenes the GPS sampling points on the roads.

Next, we will find the busy roads according to the number of truck locations on the roads.

Road weight analysis of truck

Count the number of truck locations on each road, and reconstruct DataFrame. We take the number of location points on the road as road weights:

roads_codes_series = pd.Series(road_codes)
roads_codes_series = roads_codes_series.value_counts()
roads_codes_series = roads_codes_series.sort_index()

sh_road_weight = pd.DataFrame(data={"on_road":GeoSeries(road_uniques),
                                    "weight_value":roads_codes_series
                                   })

sh_road_weight

	on_road	weight_value
0	LINESTRING (121.2730666 30.9888831,121.2730596...	1646
1	LINESTRING (121.4677565 31.1198416,121.4678423...	1579
2	LINESTRING (121.7202751 31.3780202,121.7197339...	141
3	LINESTRING (121.477849 31.1981056,121.4742212 ...	83
4	LINESTRING (121.8374393 31.0816313,121.8345587...	1268
...	...	...
16445	LINESTRING (121.4278848 31.2389835,121.4280869...	3
16446	LINESTRING (121.431042 31.2403309,121.4307167 ...	1
16447	LINESTRING (121.6378175 31.2374256,121.6373658...	1
16448	LINESTRING (121.432118 31.2416392,121.4314564 ...	1
16449	LINESTRING (121.4444113 30.9236353,121.443815 ...	2

16450 rows × 2 columns

Show the road weights:

sh_road_weight.weight_value.describe()

count    16450.000000
mean       109.849119
std        802.067993
min          1.000000
25%          2.000000
50%          5.000000
75%         22.000000
max      28144.000000
Name: weight_value, dtype: float64

It can be found that most roads are not busy, yet some roads are very busy.

Plot weight_value into a histogram:

import matplotlib.pyplot as plt
plt.bar(sh_road_weight.index,sh_road_weight.weight_value)
plt.show()

The histogram can further confirm the previous conclusion.

Sort all roads according to the road weights:

sh_sorted_road=sh_road_weight.sort_values(by=['weight_value'],ascending=False)
sh_sorted_road

	on_road	weight_value
102	LINESTRING (121.4248121 31.4032657,121.4265065...	28144
23	LINESTRING (121.473513 31.3702961,121.4736103 ...	24364
9	LINESTRING (121.6349225 31.14309,121.6348039 3...	21448
43	LINESTRING (121.2749664 31.0244814,121.2722674...	20599
89	LINESTRING (121.3814009 31.391344,121.3820681 ...	20463
...	...	...
13661	LINESTRING (121.3757469 31.0789625,121.3756868...	1
7980	LINESTRING (121.3675949 31.2531523,121.3677313...	1
11360	LINESTRING (121.6867871 31.104265,121.6815152 ...	1
13664	LINESTRING (121.5174295 31.2760579,121.5171332...	1
14269	LINESTRING (121.3291131 31.7212865,121.3269348...	1

16450 rows × 2 columns

Select the top 100 busiest roads:

sh_sorted_road.iloc[0:100]

	on_road	weight_value
102	LINESTRING (121.4248121 31.4032657,121.4265065...	28144
23	LINESTRING (121.473513 31.3702961,121.4736103 ...	24364
9	LINESTRING (121.6349225 31.14309,121.6348039 3...	21448
43	LINESTRING (121.2749664 31.0244814,121.2722674...	20599
89	LINESTRING (121.3814009 31.391344,121.3820681 ...	20463
...	...	...
13	LINESTRING (121.4986731 31.2685444,121.4963124...	3668
566	LINESTRING (121.3656233 31.257495,121.3689789 ...	3655
291	LINESTRING (121.5675539 31.3573854,121.5650207...	3625
759	LINESTRING (121.4644001 31.3612873,121.463751 ...	3622
345	LINESTRING (121.297295 31.1168969,121.2973077 ...	3616

100 rows × 2 columns

Draw the top 100 busiest roads on the map:

KeplerGl(data={"on_roads": pd.DataFrame(data={'on_roads':sh_sorted_road.on_road.iloc[0:100].to_wkt()})})

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shtruck_en

shtruck_en

README.md

Analyzing Shanghai Truck Dataset with Arctern

Prerequisite

Install Arctern

Install Jupyter Notebook

Install required libraries

Data Preparation

Initialize Jupyter Notebook

Introduction of example codes

1. Data loading

2. Analyze the running track

3. Display road network and track

Data conversion

Binding algorithm

4. Road analysis

Draw the road where trucks travel

Road weight analysis of truck

Files

shtruck_en

Directory actions

More options

Directory actions

More options

Latest commit

History

shtruck_en

Folders and files

parent directory

README.md

Analyzing Shanghai Truck Dataset with Arctern

Prerequisite

Install Arctern

Install Jupyter Notebook

Install required libraries

Data Preparation

Initialize Jupyter Notebook

Introduction of example codes

1. Data loading

2. Analyze the running track

3. Display road network and track

Data conversion

Binding algorithm

4. Road analysis

Draw the road where trucks travel

Road weight analysis of truck