Parcl Labs Price Feed Whitepaper
Executive Summary
- The Parcl Labs Price Feed (PLPF) is an indicator that tracks price changes in residential real estate on a daily basis across multiple markets and property types using a simple metric; price per square foot.
- Existing data sources such as the Case Shiller provide a distorted view as they only look at specific types of properties (single family houses), only considers repeated sales, have lags on when the information is released, and these data sources often lack granularity below the level of Metropolitan Areas.
- Parcl Labs developed an enterprise level ETL process to ingest, clean and transform hundreds of millions of individual data points that leverages spatial data science to generate price estimates for multiple levels of geography.
- Our methodology looks at representative segments of the market, filters outliers and other irregularities, creates smooth time series, and is tested on a daily basis to guarantee the quality of our indicator.
- The PLPF empowers users to make better informed decisions and can be accessed through our user friendly API.
Introduction
Residential real estate is the cornerstone of wealth accumulation for a typical individual and it is the largest asset class in the world, with a staggering value of 258 trillion dollars. Despite its dominance within the global economy, the industry is supported by an information ecosystem that is decades past its prime and features a myriad of incomplete and potentially biased data.
Residential real estate is heavily fragmented, it lives in silos, and is very concentrated among local actors. Even in the cases where the information exists and can be accessed it tends to be lagged and/or heavily transformed. This results in a scenario where buyers, sellers, and investors only get a fragmented and distorted view of their local real estate markets and, as a result, can only make educated guesses on the asset they plan to acquire.
Existing data sources that aim to bring clarity to a market have serious shortcomings on how they present information. The most well known and widely referenced residential real estate metric within the United States is the Standard & Poor's (S&P) CoreLogic Case-Shiller HPI, what we will refer to as the Case Shiller or Case Shiller Index subsequently. The Case Shiller produces indexes for 20 different metropolitan areas as well as for the country in aggregate, However, we contend that the methodology and output of this index are far from ideal. The index only looks at specific types of properties (single family houses), has a data cleaning process that has presented difficulties when other researchers attempt to replicate it, even when using an identical dataset, and has a lag of more than two months from when the index is published.
Further, the index exclusively considers repeated sales which excludes a substantial and increasing amount of multi-family home transactions. Analysis of Parcl Labs data for 2022 reveals that the Case Shiller Index's approach to calculate price changes would discard 50 percent of all transactions for the New York Metropolitan region. For metros like Boston and Miami, the Case Shiller methodology would exclude 61 and 64 percent of the true total transactions within these given markets. This can be observed in the figure below where we analyze the coverage gaps for the group of metropolitan areas that are part of the Case Shiller 10.
The Parcl Labs Price Feed (PLPF) represents the sole presently available daily estimate of residential real estate price per square foot across multiple markets with varying geographic scopes (States, Metros, Cities, etc.), that looks at all available transactions in a systematic and standardized way. The PLPF allows users to monitor real-time price fluctuations in their respective markets with minimal delay with minimal lag and with an indicator (median price per square feet) that is easy to understand and that can be used to compare more representatively with other markets.
This approach represents a significant departure from existing solutions such as the previously mentioned Case Shiller Index, which provides users with an index whose base level is set to 100 in January 2000, comparing the value of the average home in a given month with the value of the average home in January of 2000. While the Case Shiller Index does allow for comparison of properties across geographies, in our view, using an index such as this requires more steps and data work to understand how much a typical home costs in a respective geography. The PLPF simplifies this by offering a normalized metric (median price per square feet) that can be compared across markets and that can be used to gauge historical price fluctuations; a square foot of property is the same dimension across all imperial system markets. Future markets accustomed to metric systems will use a square meter as the unit of analysis.
At Parcl Labs we have developed a price feed (PLPF) that tracks price changes on a daily basis across the widest variety of markets for all property types, such that users can understand the real-time dynamics of pricing in their markets and in other markets across the country, and eventually around the world. The Parcl Labs Price Feed addresses the following:
- Provides an intuitive and easy-to-understand metric for evaluating residential real estate price movements
- Closes the gap between county and other administrative real estate records authorities and current market conditions, resulting in the first-ever real-time tracking indicator of residential real estate
- Generates estimates based on all available information within a given market, rather than relying solely on a subset of transactions
- Standardizes and integrates data from different recording systems to create a single source of truth
- Incorporates more timely information such as listings and other data feeds into the depiction of real estate market conditions
- Reduces asymmetries in access to residential real estate data, thereby empowering users.
- Offers an index that dynamically adjusts to the velocity and volume of transactions occurring within any given market to provide an unbiased assessment of real estate prices
Having a price feed that updates daily, with clean and standardized data from multiple sources allows us to provide a more complete picture than relying on estimates for large geographies. Our research has shown that markets do not behave as a monolith and significant divergences can occur. As illustrated in the following figure, there are marked differences in price dynamics between the New York Metropolitan Statistical Area (MSA), Manhattan, and Brooklyn.
In the third quarter of 2023, Brooklyn and Manhattan both exhibited small increases followed by downward trend in prices. However, during the same period, the New York metropolitan area, which encompasses cities from states such as Pennsylvania, Connecticut, and New Jersey, demonstrated a distinct trend, indicating a split in the New York metropolitan market. Indexes, like the Case Shiller, or the House Price Index from the U.S. Federal Housing Finance Agency, fail to break down these types of regional divergences in a timely manner or with adequate geographical decomposition. Further, the price per square foot in Manhattan is nearly one thousand dollars more expensive than in the New York Metro Area.
We are able to provide a timely, accurate and detailed price feed thanks to our use of Data Science, Data Engineering and expertise in the real estate market. Like any model, the first step in creating the best price indicator for real estate starts with high quality, clean, and standardized data.
Data
Our approach solves the problems that have been present in real estate information since the genesis of recording real estate transactions, namely an incomplete and inaccurate data universe that lives in silos. Unlike indexes that rely only on single family homes that have been sold at least once, our data universe contains information from new construction, repeated sales as well as properties with no previous sale information. Further we include multiple types of housing such as single family homes, townhomes, condos, etc. and as such are able to present a more accurate depiction of real estate markets.
The data enrichment process also tackles one of the most important challenges in getting accurate and timely data; the use of historical records from county registrars that have heterogeneous timelines for publishing information. When a property is sold it can take anywhere from 2 weeks to 6 months to make it into the corresponding county register. To address this issue, we employ a range of sources and compare them against historical records to improve the accuracy and timeliness of our data. This is specially relevant in markets where there is a lot of volatility in the volume of transactions and additional data points are required to have a more accurate estimate of real estate. Our research has shown that listing prices and real estate prices are strongly correlated with a correlation coefficient of 0.89, as depicted in figure 3.
Collecting and integrating the data is only the beginning, to extract valuable insights we have developed an enterprise-level ETL (extract, transform, load) process that enables us to handle hundreds of millions of data points. However, since each data source has its own idiosyncrasies and time lags, we need to undertake a rigorous data cleaning and harmonization process that includes the following steps:
- Cleaning, de-duplicating, and standardizing property addresses to obtain the highest quality record of each property: For example the following address 2323 West Av can also appear as 2323 w Av or 2323 w Avenue. We undergo a process that compares and creates a unique source of truth.
- To further increase the confidence in our data we validate it using a third party to guarantee its integrity: This is an important step as it ensures that our assumptions are validated by external sources.
- After performing an initial data pass, we conduct a reconciliation process to ensure that the most current information associated with each record is consistent. For instance, a property may be recorded as having 1700 square feet in one source while another source has it at 1600. To resolve these discrepancies, we undergo an iterative process that examines various sources and time periods. This helps us arrive at a reliable resolution.
- After the data has been cleaned, standardized, and processed, we utilize the latest geographic information systems technology to assign properties to multiple types of markets. Using spatial data science, we are able to assign a property to its corresponding neighborhood, city, county, or other desired geographical construct, offering the flexibility to build PLPF at the desired geography level.
Once the data standardization process is completed we have a unique state of the art database to build scalable and timely PLPF for any desired level of geography. This is a necessary step before we can build the most reliable and timely price feed for residential real estate.
Parcl Labs Price Feed (PLPF) Methodology
The Parcl Labs Price Feed (PLPF) is based on a multi-stage approach that ensures the reliability and consistency of our price estimates. We take the data created in the previous step and transform it into a final time series for each market. This process consists of three stages that correct for volatility and market idiosyncrasies, combine historical and more timely series in a logical and consistent manner, and test the estimates produced to ensure the reliability of the data.
Time Series Correction and Smoothing
We use a correction method that is robust to outliers and representative of the markets we cover to adjust our daily estimates. Real estate data is often skewed and can be distorted by the impact of large outliers. In the next figure we see that the vast majority of sales are concentrated in the range of $11 to $1,900 dollars per square foot, even though we have transactions that indicate luxury real estate properties in the right hand of the distribution. Using a simple average generates a price per square foot of $653, while the more representative median price is $563 a whooping $90 difference.
Given how skewed the data is, we only look at the sales that fall within the 35 and 65 percentile distribution of prices to limit the impact of outliers on our analysis. With this more representative sample, we then use a moving median to build daily sales price estimates. This sample space captures the movement of the most representative parts of the markets, dynamically adjusts to changes in underlying distributional shifts in transactions, and offers a more stable price estimate.
The PLPF adds another step in perfecting how we impute information for each market by using a dynamic backpropagation window based on volatility of transactions. This simply looks at how many transactions in a given period of time are available in each market before deciding how far back in time we are going to look to create a sample space. While the availability of data in a metropolitan area may merit using a short window of time, the sparsity of a geography may require a more ample period to build stable samples. In the following figure we can appreciate the difference in the volume of transactions that are available for Manhattan and for Tribeca, a popular and well sought neighborhood in New York City.
To ensure that the sample is representative we select only observations within the 35th to 65th percentile of the distribution. Next, we calculate the daily median price per square foot for each market, using a window range that is appropriate for the characteristics of that particular market. This process can be represented by the following formula::
where ti is the dynamic window for market i.
We apply this smoothing and filtering process to both historical sales and more timely data across all markets. By using a dynamic moving median, we are able to smooth out price fluctuations, capture short- and long-term trends, and regularize market idiosyncrasies. The figure below illustrates the effects of applying our filtering process to sales data in the Pittsburgh MSA versus using raw, unfiltered data.
Calculation of Price levels using exponential decay weights
After smoothing out the after-sales information and incorporating more up-to-date data sources, we generate a new estimate that combines historical sales data with real-time information. This approach enables us to identify rapidly changing market conditions that may not be reflected in traditional indicators such as sales data alone. For instance, if a market is experiencing a downturn, metrics such as listings will capture the emerging trend first. To blend these two time series, we use a weighted average with weights that exponentially decay. This process assigns greater weight to recent observations while still preserving the influence of more distant data points. This is particularly relevant for more timely data sources. The index can be represented by the following formulas
Where variable psi represents the moving median price of market i multiplied by the weights of sales wsi. Meanwhile, pti refers to the median price of timely sources multiplied by the weight for timely sources wti. The weight wti is defined by the decay factor λ. By combining traditional sales with timely sources, this formula generates a daily estimate. Furthermore, our model can still provide estimates for markets with limited timely sources by relying solely on sales transactions when timely data is not available. Finally, we employ a 7-day smoothing process to further minimize the impact of market fluctuations.
While a seasonal adjustment is the norm on monthly and quarterly series there is no consensus on what the best methodology for adjusting daily time series is. This is due to the fact that intra-week and market-specific irregular factors can vary significantly between different markets, making it challenging to develop a one-size-fits-all approach. Additionally, even for monthly estimates that are seasonally adjusted the seasonal components exhibit irregularities that accentuate lagging data points on their estimates with calls to use unadjusted estimates in periods of market volatility such as the one we are living at the end of the Covid-19 housing boom. This method of adjusting our data for irregularities also allows us to scale to thousands of different markets across the USA, with global markets coming soon.
Dynamic testing of the data for price irregularities daily
To guarantee the consistency and reliability of our data we conduct tailored testing to each one of the markets available in our API before publishing a data update. This testing takes into consideration abnormal behavior in the different data sources that compose our database, the local market idiosyncrasies that explain volatility in volume and prices, as well as a geographic factor that further adjusts the volatility of our series. This results in a time series that rigorously tests for any sudden movements on the price per square foot.
Finally, as part of our effort to ensure data consistency and transparency we performed a correlation analysis between our PLPF and the metros listed in the Case Shiller 20 index. We compared the monthly median prices from our PLPF with the non-seasonally adjusted Case Shiller index for each metro where data was available. The table below demonstrates the close alignment of our numbers, with an average positive correlation of 0.98. And while the Case Shiller has a lag of 3 and half months, our PLPF is updated daily.
Conclusion
Residential real estate tends to be siloed, with incomplete snapshots of the market available for users either by limiting the estimates to a specific type of property and transactions (e.g. repeated sales of single family homes) or due to a long lag period. This has resulted in a lack of information that is complete and reliable for everyday users.
The Parcl Labs Price Feed (PLPF) provides a daily estimate of price per square foot of residential real estate across multiple markets and property types. We do this by cleaning and standardizing millions of traditional and real time data points, ingest them into our data warehouse, and by applying time series correction and smoothing to hundreds of different time series
We break down asymmetries in access to residential real estate data to empower users to make better and more informed decisions around real estate. The PLPF provides a comprehensive solution to the problems associated with the real estate information ecosystem, offering a reliable and timely price feed that allows buyers, sellers, and investors to make informed decisions based on accurate and real-time market data.
Sign up for api here.