Real Estate Data Platforms on AWS: Case Study

Author

BatchService

If I had to sum it up in one line: this case study shows how a real estate data platform used AWS to cut pipeline time, lower latency, and trim infrastructure cost while handling messy property data at scale.

Here’s the short version:

  • I’m dealing with fragmented U.S. real estate data spread across 500+ MLS systems plus tax, mortgage, ownership, and geospatial sources.
  • I need to normalize and deduplicate those feeds into one property record without long overnight jobs.
  • I’m using AWS Lambda, S3, Glue, Athena, DynamoDB, OpenSearch, Redis, API Gateway, and Cognito to split ingestion, storage, search, and access control into clear layers.
  • I’m cutting search time from about 200 ms to 20 ms with caching, and in some cases dropping read latency from 100 ms to 5 ms.
  • I’m shrinking a 14 million-item ingestion flow to under 30 minutes with parallel loading.
  • I’m lowering storage spend by moving images to WebP and tiering old listings in S3.
  • I’m also keeping tenant access, lineage, retention, and multi-region uptime in view, not treating them as afterthoughts.

What stands out is simple: the stack works best when each AWS service has one clear job. S3 stores, Glue and Athena process, DynamoDB serves record reads, OpenSearch handles search, and Redis caches hot queries.

For me, the main lesson is this: real estate platforms run better when ingestion, search, storage, and governance are planned together, not as separate projects.

Business Context and Platform Requirements

Users, Workflows, and Data Types

Those gains matter only if the platform works for several teams at once, without blowing past latency, governance, or budget limits.

Real estate data platforms serve investors, analysts, agents, and operations teams that use the same data in very different ways. Acquisition and investment teams use it to rank properties and markets. Portfolio analysts need dashboards that track model output over time. Leasing professionals and agents run multi-filter searches across property details, amenities, and pricing. Data operations teams spend a lot of time cleaning and joining messy inputs from many sources.

The same property record can support very different goals: acquisition, leasing, reporting, or data enrichment. And the data itself comes in many forms, including property records, ownership history, tax assessments, mortgage data, geospatial attributes, and verified contact data for skip tracing and outreach campaigns. AvalonBay Communities, a residential REIT managing 88,405 apartment homes across 12 states, handles 150,000+ multi-parameter searches per day and 3,500 lease applications monthly. Each of those actions depends on real-time property profile generation.

That workload mix has a direct effect on the stack. One team needs fast search. Another needs clean reporting tables. Another needs property enrichment pipelines that can keep up without falling apart.

Scale, Latency, and Governance Requirements

At this scale, targets aren’t vague. They’re hard numbers.

Fundrise, with 2 million users, built a quantitative reporting system that manages hundreds of billions of data points for investor shareholdings and dividends across time. That dataset keeps growing as investors, shareholdings, and time periods stack on top of each other. Their read target for investment performance data is under 20 ms.

Governance adds another layer. Teams need lineage, separation between raw and curated data, retention controls, and strict access control. They also need durable storage with automated lifecycle rules for contract document retention, plus multi-region active-active architectures to keep high-availability workflows running, including rent payments and lease execution.

So this isn’t just a storage problem. It’s a question of how the AWS stack splits raw data, curated data, and the paths users take to access each one.

Budget and Operating Constraints

Cost predictability is a hard limit, especially for small teams. Always-on compute can drain a budget fast, and per-query pricing can get ugly when usage climbs. Fundrise cut its total cost of ownership by 25% by moving from a relational database to object storage built for analytics.

There’s also the day-to-day operating reality. Many teams work with nightly ingestion windows, which means data has to be re-ingested and indexed before the business day starts. Serverless compute helps here because spending tracks actual use instead of sitting idle and billing anyway.

AWS Architecture for a Real Estate Data Platform

Ingestion, Storage, and Processing Layers

AWS Lambda normalizes incoming feeds into a single format. A composite key built from APN, address hash, and MLS ID deduplicates records into one canonical property. That normalization layer then feeds both the data lake and downstream search paths.

Amazon S3 acts as the base of the data lake, while Parquet keeps storage lean with columnar reads and built-in compression. AWS Glue runs batch ETL between S3 and downstream systems. Athena supports ad hoc analysis. Redshift handles reporting and modeling.

Realtor.com used this same setup at scale: Amazon Athena partitioned large S3 files into 25 concurrent streams, then AWS Glue Spark workers and the DynamoDB BatchWrite API loaded the data into DynamoDB. That setup cut 14 million-item ingestion from hours to under 30 minutes.

API, Search, and Multi-Tenant Delivery

Amazon OpenSearch powers geospatial queries and faceted filtering. DynamoDB manages listing state and single-digit-millisecond detail reads. ElastiCache (Redis) stores popular searches in cache, cutting response time from 200 ms to 20 ms. API Gateway authorizers and Amazon Cognito enforce tenant isolation without custom access code in each service. When traffic jumps 5–10x within hours, Lambda burst scaling and SQS buffering take the hit without year-round overprovisioning.

Put simply, each service has a clear job:

  • OpenSearch handles search
  • DynamoDB serves detail reads and state
  • Cognito and API Gateway control tenant access
  • Redis speeds up repeated queries

Where BatchData – Ivo Draginov Fits in the Stack

BatchData

In the enrichment layer, BatchData adds on-demand skip tracing, phone and address verification, geocoding, and Parquet delivery straight to S3. Those choices set up the cost and performance results below.

Redfin Manages Data on Hundreds of Millions of Properties Using AWS

Cost Optimization and Performance Results

AWS Real Estate Data Platform: Before vs. After Performance & Cost Metrics

AWS Real Estate Data Platform: Before vs. After Performance & Cost Metrics

Main Cost Drivers and Baseline Metrics

After launch, most spending came from three places: S3 storage, compute, and search infrastructure. The biggest pressure points were image delivery, nightly MLS refreshes, and seasonal search traffic. Property images by themselves made up 60% to 80% of total S3 spend. At the same time, nightly ETL jobs ran on oversized clusters that sat idle for much of the day, and OpenSearch domains were provisioned for peak traffic even during slower off-season periods.

The starting numbers made the problem pretty clear. Average search latency was about 200 ms. Full pipeline refreshes took multiple hours. And large analytical reads came in at about 100 ms.

Changes That Reduced Cost and Improved Speed

Each change focused on one pain point: images, nightly ingestion, or peak search demand.

Storage came first. Converting property image uploads to WebP and moving sold or expired listings to S3 Intelligent-Tiering cut image-related storage costs by 40% to 60%. For low-latency reads, S3 Express One Zone can bring GET latency down from 100 ms to 5 ms – a 95% drop – while also reducing request costs by up to 50% compared with S3 Standard.

On the compute side, Lambda took over from always-on normalization servers for MLS feeds, which meant the system could scale down to zero during off-hours. ETL also moved from full nightly refreshes to incremental updates. DynamoDB ingestion was then parallelized with the BatchWriteItem API across 25 concurrent streams, using the same parallel-load pattern that ingested 14 million items in under 30 minutes.

Search capacity got a similar cleanup. Reserved Instances handled baseline OpenSearch traffic, while On-Demand capacity absorbed the spring spikes that run 2 to 3 times higher than winter traffic.

Before-and-After Metrics

The payoff showed up in storage costs, pipeline runtime, and latency across the stack.

Metric Baseline Optimized Improvement
Monthly Infrastructure Cost ~$4,000 ~$40 99% reduction
Pipeline Duration 84 hours 2 hours 97.6% reduction
Read/Query Latency 100 ms 5 ms 95% reduction
Search Latency (with Cache) 200 ms 20 ms 90% reduction
Total Cost of Ownership 100% 75% 25% reduction

Outcomes, Lessons Learned, and Conclusion

Business Impact and Product Gains

These architecture decisions did more than improve infra dashboards. They changed how fast teams could work with property data and ship product updates.

Fundrise cut GET latency from 100 ms to 5 ms, reduced total processing time by 33%, and lowered TCO by 25% after moving to S3 Express One Zone. CoStar Group reduced Homes.com search latency to 30 ms and cut listing update time from minutes to 4 seconds. Compass reduced saved-search alert lag from 1 hour to 1 minute across 250,000 saved searches.

That kind of lift doesn’t happen from one tool alone. It comes from making sure ingestion, search, and storage don’t trip over each other at scale.

Implementation Lessons for Future Builds

Three patterns showed up again and again across these builds.

  • Isolate your ingestion connectors. One Lambda per MLS feed using a property and contact data API, with its own retry logic, helps stop one bad source from dragging down the full ingestion pipeline.
  • Choose tools for the workload you have, not the one you may have later. For small teams, DuckDB with Parquet on S3 can handle millions of records without the overhead of Spark or Databricks. Partitioning S3 files by unique keys can also remove the need for separate indexing and cut nightly processing by up to 33%.
  • Observability matters more than most teams expect. Dagster improves observability by tracking lineage instead of task runs.

Final Takeaway

AWS delivers real value when teams design storage, ingestion, search, and governance as one system.

FAQs

Why use separate AWS services for search, storage, and reads?

Using separate AWS services lets each part focus on what it does best. That usually means better performance and lower costs, instead of forcing one system to handle everything.

  • Amazon OpenSearch Service handles complex, real-time search
  • Amazon DynamoDB provides low-latency reads and state management
  • Amazon S3 offers low-cost, durable storage for large datasets, images, and documents

How does the platform unify duplicate property records from many sources?

The platform starts with a normalization engine that cleans up messy data from different sources. It standardizes inconsistent address formats, ownership spellings, and field structures into one clean, consistent schema.

Then it gives each property a unique identifier that acts as the canonical reference point. That ID links MLS listings, tax assessments, and other inputs, so duplicate records and split-up entries get merged before the data ever reaches the user.

Which AWS changes had the biggest impact on cost and latency?

The biggest gains came from moving to serverless systems, specialized storage, and event-driven setups.

For example, Amazon S3 Express One Zone can deliver sub-20 millisecond read latency and lower total cost of ownership by 25% compared with old-school relational database storage.

There was also a cost shift on the analytics side. Moving away from legacy-managed clusters and using Parquet on Amazon S3 with serverless analytics cuts out idle compute costs and avoids per-query pricing.

That matters more than it might seem at first glance. With cluster-based systems, you’re often paying for machines that sit around waiting. With serverless tools, you pay for use instead of keeping the engine running all day.

Related Blog Posts

Highlights

Share it

Author

BatchService

Share This content

suggested content

Skip Tracing

Skip Tracing Software

The Pre-Market Advantage: Reaching Sellers Before Competition

Your complete guide to skiptracing in California

Your Complete Guide To Skip Tracing In California