Analyzing Massive Amounts of Location Data with Honeycomb
Honeycomb is a client-side geospatial visualization tool. It runs directly in the web browser, delivering highly responsive interactive maps and charts. However, because of this design, there is a limit (4GB) to the amount of data that can be loaded directly into Honeycomb. You can find more background on the design of Honeycomb here.
Data grain: points or H3 indexes
Honeycomb natively supports two different types of data 'grain' (also known as granularity): point-level data or H3-level data. In the first, each row in the dataset is a point on the map, represented by latitude and longitude coordinates. In the second, each row is an H3 index. If you are not familiar with H3 indexes, you can learn more in this post on the Honeycomb Blog.
Note: Honeycomb also supports individual-level H3 indexes. It will automatically group by H3 before displaying on the map.
Point-level data: raw information
Using a dataset where each row is a point provides a high level of flexibility and detail. Because the 'raw' data is available, individual points can be shown on the map. In addition, on-the-fly 'slice and dice' aggregations involving multiple fields are supported, such as computing the total value of sales for failed deliveries.
If point-level data is provided, Honeycomb will compute H3 indexes for each point automatically. This means that you do not need to calculate H3 indexes in another tool before bringing it into Honeycomb.
The main drawback with point-level data is the volume: after 5-10M points it will be too large for Honeycomb to handle. This is often more than enough when analyzing events in the real-world. Not many companies make 5 million parcel deliveries in a single day.
If you are analyzing more than 5-10M points, there are a few approaches you can take. The first is to apply a filter to the data before it reaches Honeycomb.
- Filter the data before it reaches Honeycomb.
- Create a data table for a sub-region of your analysis. For example, a scooter rental company may create a table that contains data for each market they operate in. This is logical because operations managers in one market do not need to see the data for other markets. Creating these tables could be automated with tools like dbt.
- Create a data table with a time filter. Perhaps only events that have occured in the past 90 days are relevant to show on the map. This can be an effective way to reduce the data size while preserving value.
If filtering data before it gets to Honeycomb isn't desirable, you can pre-aggregate it to H3 cells instead.
H3-level data: pre-aggregated metrics
Honeycomb also supports datasets where each row is an H3 index. All rows must be the same resolution, and the resolution will be detected automatically.
Pre-aggregating data to H3 allows for the analysis of effectively infinite amounts of point data. However, the drawback is that the type of analysis or aggregation that will be shown on the map needs to be known upfront. In addition, individual points will not be visible on the map.
Aggregating to H3 is an ideal task for a scalable data warehouse like Snowflake. Snowflake's native H3 functions, including H3_LATLNG_TO_CELL()
make it easy to generate H3 indexes from points and then group on them to generate metrics.
Below is an example SQL query that takes data on the point-level and aggregates it to H3 level 12.
SELECT
count(id) as row_count,
h3_latlng_to_cell(latitude, longitude, 12) as h3_index
FROM
source_data
GROUP BY
h3_index
SELECT
count(id) as row_count,
h3_latlng_to_cell(latitude, longitude, 12) as h3_index
FROM
source_data
GROUP BY
h3_index
The complete SQL (including a full dbt project template) can be found on Github, and a full write-up of how to build a data pipeline with H3 in Snowflake can be found on the Honeycomb Blog here.
Snowflake and Honeycomb: Better Together
The Honeycomb Data Explorer Native Snowflake app lets you utilize Snowflake's highly scalable data warehouse for storing and preparing data upfront, and then use Honeycomb's client-side interactively to quickly view and analyze the data.
Large amounts of data just need to be aggregated to H3 once by Snowflake. After that, Honeycomb does all computation on the client-side, meaning that adding additional Honeycomb users does not add significant load to Snowflake. This hybrid approach allows companies to utilize massive amounts of location data in a highly efficient way.