Listing URL Scraper

A TypeScript scraper designed to extract property listing URLs from webpages, specifically optimized for Zillow-style property listings.

Files

scraper.ts - Main scraper functions and utilities
scraper-api.ts - HTTP API endpoint for the scraper
test-scraper.ts - Test file with sample usage
README.md - This documentation

Features

Extracts listing URLs and addresses from HTML content
Handles both absolute and relative URLs
Includes proper error handling and validation
Provides both programmatic API and HTTP endpoint
User-Agent spoofing to avoid bot detection

Usage

Programmatic Usage

import { scrapeListingUrls, getListingUrls } from "./scraper.ts";

// Get full listing data (URL + address)
const listings = await scrapeListingUrls("https://www.zillow.com/san-francisco-ca/");
console.log(listings);
// Output: [{ url: "https://...", address: "123 Main St..." }, ...]

// Get just the URLs
const urls = await getListingUrls("https://www.zillow.com/san-francisco-ca/");
console.log(urls);
// Output: ["https://www.zillow.com/homedetails/...", ...]

HTTP API Usage

The scraper is also available as an HTTP endpoint:

GET Request (Test with sample data):

curl https://your-val-town-url.web.val.run

POST Request (Scrape a real URL):

curl -X POST https://your-val-town-url.web.val.run \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.zillow.com/san-francisco-ca/"}'

HTML Pattern Matching

The scraper looks for HTML elements matching this pattern:

<a href="[URL]" data-test="property-card-link" [other-attributes]>
  <address>[ADDRESS]</address>
</a>

It uses flexible regex patterns to handle variations in:

Attribute order
CSS classes
Whitespace
Relative vs absolute URLs

Functions

`scrapeListingUrls(url: string): Promise<ListingData[]>`

Main function that fetches a webpage and extracts all listing URLs and addresses.

`parseListingUrls(html: string): ListingData[]`

Parses HTML content to extract listing data without making HTTP requests.

`getListingUrls(url: string): Promise<string[]>`

Convenience function that returns just the URLs as an array of strings.

`validateUrls(urls: string[]): string[]`

Utility function to filter out invalid URLs.

Error Handling

The scraper includes comprehensive error handling for:

Network failures
Invalid HTML
Missing elements
Malformed URLs

Rate Limiting & Ethics

When scraping websites:

Be respectful of rate limits
Check robots.txt
Consider the website's terms of service
Add delays between requests for large-scale scraping
Use appropriate User-Agent headers

Example Output

{
  "success": true,
  "count": 2,
  "listings": [
    {
      "url": "https://www.zillow.com/homedetails/1020-Pierce-St-A-San-Francisco-CA-94115/2113064552_zpid/",
      "address": "1020 Pierce St #A, San Francisco, CA 94115"
    },
    {
      "url": "https://www.zillow.com/homedetails/456-Oak-St-San-Francisco-CA-94102/123456789_zpid/",
      "address": "456 Oak St, San Francisco, CA 94102"
    }
  ]
}