Reddit /r/hardwareswap Scraper

A Val Town application that scrapes posts from Reddit's /r/hardwareswap subreddit and stores them in a Supabase database.

Features

🔄 Automated scraping of /r/hardwareswap posts using Reddit's official OAuth API
🗄️ Stores posts in Supabase PostgreSQL database
🚫 Duplicate detection to avoid storing the same post twice
📊 Detailed logging and statistics
⏰ Runs on a cron schedule (configurable in Val Town UI)
🔐 Secure OAuth authentication with automatic token refresh

Setup Instructions

1. Reddit API Setup

Go to Reddit App Preferences
Click "Create App" or "Create Another App"
Fill out the form:
- Name: Your app name (e.g., "Val Town Scraper")
- App type: Select "script"
- Description: Optional description
- About URL: Leave blank or add your website
- Redirect URI: Use http://localhost:8080 (required but not used)
Click "Create app"
Note down your Client ID (under the app name) and Client Secret

2. Supabase Setup

Create a new project in Supabase
Go to the SQL Editor in your Supabase dashboard
Copy and paste the contents of database-schema.sql and run it
Go to Settings > API to get your project URL and anon key

3. Environment Variables

Set these environment variables in your Val Town settings:

Supabase:

SUPABASE_URL: Your Supabase project URL (e.g., https://your-project.supabase.co)
SUPABASE_ANON_KEY: Your Supabase anon/public key

Reddit API:

REDDIT_CLIENT_ID: Your Reddit app's client ID
REDDIT_CLIENT_SECRET: Your Reddit app's client secret
REDDIT_USER_AGENT: Optional custom user agent (defaults to "Val Town Reddit Scraper 1.0")

4. Configure Cron Schedule

Set reddit-scraper.ts as a cron trigger in Val Town
Configure the schedule in the Val Town web UI (recommended: every 30 minutes)
Example cron expressions:
- Every 30 minutes: */30 * * * *
- Every hour: 0 * * * *
- Every 15 minutes: */15 * * * *

Database Schema

The posts table contains:

id: Primary key (auto-increment)
reddit_id: Unique Reddit post ID
reddit_original: Full Reddit post data as JSON
title: Post title
created_at: When the post was created on Reddit
updated_at: When the record was last updated in our database

Usage

Manual Run

You can manually trigger the scraper by running the reddit-scraper.ts val.

Automated Run

Once configured as a cron job, it will automatically:

Authenticate with Reddit using OAuth client credentials
Fetch the latest 25 posts from /r/hardwareswap
Check for duplicates in the database
Save new posts to Supabase
Log statistics about the scraping session

Technical Details

OAuth Flow

The scraper uses Reddit's Client Credentials OAuth flow:

Authenticates using your app's client ID and secret
Receives an access token from Reddit
Uses the token to make authenticated API requests
Automatically refreshes the token if it expires

This approach is more reliable than using Reddit's public JSON endpoints and respects Reddit's rate limits.

Rate Limiting

Reddit allows 60 requests per minute for OAuth applications
The scraper fetches 25 posts per run, well within limits
Recommended cron schedule: every 30 minutes or longer

Monitoring

Check the Val Town logs to monitor:

Number of new posts scraped
Number of duplicates skipped
Any errors during scraping
Performance metrics

Troubleshooting

Common Issues

Missing environment variables: Ensure all required Reddit and Supabase credentials are set
Database connection errors: Verify your Supabase credentials and that the table exists
Reddit OAuth errors: Check your Reddit app credentials and ensure the app type is "script"
Rate limiting: Reddit may temporarily block requests if rate limits are exceeded
Duplicate key errors: The scraper checks for duplicates, but race conditions might occur

Error Messages

Missing Supabase credentials: Set SUPABASE_URL and SUPABASE_ANON_KEY
Missing Reddit credentials: Set REDDIT_CLIENT_ID and REDDIT_CLIENT_SECRET
OAuth error: Check your Reddit app credentials and app type
Reddit API error: May indicate rate limiting or API issues
Error saving post: Check Supabase connection and table schema

Data Access

You can query your scraped data directly in Supabase:

-- Get recent posts
SELECT title, created_at, reddit_original->>'score' as score 
FROM posts 
ORDER BY created_at DESC 
LIMIT 10;

-- Search posts by title
SELECT title, created_at 
FROM posts 
WHERE title ILIKE '%gpu%' 
ORDER BY created_at DESC;

-- Get posts by author
SELECT title, created_at 
FROM posts 
WHERE reddit_original->>'author' = 'username' 
ORDER BY created_at DESC;

Contributing

Feel free to modify the scraper to:

Add more subreddits
Include additional post metadata
Add data processing or analysis features
Integrate with other services

wxw

scrape-hws