Unleashing Real Estate with Snowflake and Large, Complex Geospatial Data Sources

June 10, 2022
10
min read
Unleashing Real Estate with Snowflake and Large, Complex Geospatial Data Sources

Overview of Parcl Labs

Parcl Labs mission is to set the global standard for residential real estate data, and in doing so, index every property on the planet. We are obsessed with ingesting the world's most challenging geospatial data sources to enrich our understanding of the geophysical attributes in which real estate exists. 

Real Estate is not a spreadsheet problem. It is a heterogeneous collection of data points shaped by a complex system of supply and demand, legal and regulatory frameworks, geophysical characteristics that act together to drive consumer preferences towards specific locations in time and space.

To approach a global problem, we need a different approach to how real estate tackles/uses data. 

How we use & why we need OSM data

To have a complete understanding of the current real estate market, we consume a wide variety of global and country specific datasets in a standardized way. One such datasource is Open Street Maps. Open Street Maps is used by companies like Uber and Meta to enhance their mapping and routing capabilities. Meta has contributed back to the community through a variant of Open Street Maps called Daylight which is what we ultimately used due to its ingest ready data format and built in quality checks.

Daylight opens up a world of opportunity to better understand real estate. It allows us to identify exactly how far away a home is from a beach front, a lake, and the closest park. It allows us to understand the accessibility of neighborhoods to economic hubs and identify areas of improvement for public transit. These are a small subset of the problems we are actively working on to elevate the world's understanding of real estate behavior.

Snowflake as an accelerator

Open Street Maps on its own is a cumbersome dataset to work with; it often comes in cryptic formats and requires large format conversions to get it into a modern, consumption ready raw data format like parquet. Ultimately, the structure of the data still leaves a lot to be desired.

Daylight addresses some of these pitfalls by being pre-made in a consumption ready format.On the other hand, Daylight is plagued by some of the same issues with regards to being in an analysis ready version.  We were inspired by an Analysis Ready version of Daylight that uses some unique Athena functionality, primarily map, and we wanted to see if we could replicate it in Snowflake for our own use cases. 

To illustrate why these transformations are required, the native format one would have after ingesting Daylight would look like this:

One of the primary issues with this format is the fact that the attributes of this particular node, which is a bus stop, are nested within the tags array as key, value. Being able to extract the fact that this is a bus stop would require a rather lengthy query to iterate over all elements of key, value for each key, value pair for all records to find bus stops. Note the actual names of the keys here are key and value. This didn’t work for our needs as we needed a more flexible and compact notation for querying one of the 100’s of amenity categories and subcategories.  

Due to Snowflake’s semi-structured querying capabilities, using the above example the following format would be immensely more helpful for doing analysis:

In this case, one could simply look for TAGS:highway=’bus_stop’ to find the bus stops. 

With some help from the Snowflake team, we were able to leverage some of their tools in order to get our data to an ideal state:

  1. Created a javascript UDF to restructure a nested json structure in the source data.
  2. Integrated the javascript UDF as a macro in DBT which made for a streamlined load process.
  3. Leveraged Snowflake's native geospatial functionality (TO_GEOGRAPHY) to create a more efficient version of the Athena work that is ready for geospatial analysis against our real estate data.

Here is code snippets of the javascript UDFs that we are using:

Use Cases

By ingesting this data, there are a lot of neat visuals we can build.

Global Power Towers:

Global Ice Cream Shops:

While building ad-hoc visuals is certainly fun, the power of data is at the intersection of seemingly disparate sources of information. This leads us to the remote work index. 

Remote Work Index

With Daylight OSM in a flexible format, it allows us to supercharge our real estate and other complex geospatial datasets we store in our warehouse in a consistent manner, very quickly. For example, for our remote work index we decided to look at the following characteristics for thousands of neighborhoods in the United States: 

  • Density of Bars
  • Density of Restaurants
  • Density of Coffee Shops
  • Avg Download Speeds (mbps) - Can you even field a Zoom?
  • Average Summertime Temps
  • Real Estate Affordability

To begin,  we were most interested in local bars, restaurants and coffee shops and filtered out amenities that had 10 or more locations. For our MVP, we prioritized walking access to these locations and standardized across locations by looking at the density of amenities per square mile as opposed to counts.

We decided to do this for the 10 Case-Shiller MSA’s to start, building off of work our team did earlier. For reference, this includes:

  • Miami
  • Washington, DC Metro Area
  • Boston
  • San Diego
  • Los Angeles
  • San Francisco
  • Denver
  • New York
  • Las Vegas
  • Chicago

Using the features outlined above, we created a composite score that allows us to normalize each of the inputs into an index ranging between 0-1 and sort neighborhoods around the country. However, it was readily apparent that our own preferences didn’t match that of our colleagues, so we turned it into a customizable calculator to allow folks to input their own preferences of what matters to them for remote work.

For example, if one had an unlimited budget and affordability of real estate was not a concern, was most interested in being able to access a wide variety of restaurants, bars, and coffee shops in a fairly moderate temperature during the summer, they can prioritize those preferences using the following settings:

Prototype version of data and calculator

However, a more balanced user who wants to have a great remote working experience while also not paying ridiculous premiums to do so, might have a profile like the following:

Prototype version of data and calculator

Where you get a wider variety of neighborhoods to choose from that keep costs in mind while still retaining walking access to a wide variety of local bars, restaurants and coffee shops. 

Next Steps

We are obsessed with finding new ways to ingest data in a scalable manner, such that it's in an analysis ready format to take advantage of Snowflake’s core capabilities. We have done similar work with Census data, weather data, financial series, hundreds of thousands of boundary files, and dozens of other sources.We have an ambitious roadmap of data ingests to enrich our understanding of the areas in which real estate exists. Real estate is not a tabular problem filled with records of square footage, number of bedrooms, number of bathrooms, etc. It is a highly complex geospatial problem that requires a deeper understanding of the areas in which homes exist that extends far beyond that of traditional real estate data. One of our missions as a data team is to index every property on the planet. That requires a fundamentally different approach to data ingestion and data standardization such that we have the best opportunity for machine learning capabilities at a global scale. 

As for the remote index, we are eager to integrate access to hiking trails, parks, mountains, beaches, and a variety of other features that allows users to better understand the characteristics of the locations folks might want to call home whether it be for a short term stay or a longer term situation.  

If this work intrigues you, we would love your feedback on what you would like to see in a remote work index! Bonus, we are hiring and are looking for folks who start their day working on the data left behind. Folks who love working with datasets that are too big and complex for most of the world to even consider looking at. Check out our careers page here.

Are you interested in Parcl Labs data? Please reach out to Jason Lewris