Reddit /r/hardwareswap Scraper

A Val Town application that scrapes posts from Reddit's /r/hardwareswap subreddit and stores them in a Supabase database.

Features

🔄 Automated scraping of /r/hardwareswap posts
🗄️ Stores posts in Supabase PostgreSQL database
🚫 Duplicate detection to avoid storing the same post twice
📊 Detailed logging and statistics
⏰ Runs on a cron schedule (configurable in Val Town UI)

Setup Instructions

1. Supabase Setup

Create a new project in Supabase
Go to the SQL Editor in your Supabase dashboard
Copy and paste the contents of database-schema.sql and run it
Go to Settings > API to get your project URL and anon key

2. Environment Variables

Set these environment variables in your Val Town settings:

SUPABASE_URL: Your Supabase project URL (e.g., https://your-project.supabase.co)
SUPABASE_ANON_KEY: Your Supabase anon/public key

3. Configure Cron Schedule

Set reddit-scraper.ts as a cron trigger in Val Town
Configure the schedule in the Val Town web UI (recommended: every 30 minutes)
Example cron expressions:
- Every 30 minutes: */30 * * * *
- Every hour: 0 * * * *
- Every 15 minutes: */15 * * * *

Database Schema

The posts table contains:

id: Primary key (auto-increment)
reddit_id: Unique Reddit post ID
reddit_original: Full Reddit post data as JSON
title: Post title
created_at: When the post was created on Reddit
updated_at: When the record was last updated in our database

Usage

Manual Run

You can manually trigger the scraper by running the reddit-scraper.ts val.

Automated Run

Once configured as a cron job, it will automatically:

Fetch the latest 25 posts from /r/hardwareswap
Check for duplicates in the database
Save new posts to Supabase
Log statistics about the scraping session

API Alternative

If you prefer to use Reddit's official API instead of the JSON endpoint:

Create a Reddit app at https://www.reddit.com/prefs/apps
Add these environment variables:
- REDDIT_CLIENT_ID
- REDDIT_CLIENT_SECRET
- REDDIT_USER_AGENT
Modify the scraper to use the official Reddit API

Monitoring

Check the Val Town logs to monitor:

Number of new posts scraped
Number of duplicates skipped
Any errors during scraping
Performance metrics

Troubleshooting

Common Issues

Missing environment variables: Ensure SUPABASE_URL and SUPABASE_ANON_KEY are set
Database connection errors: Verify your Supabase credentials and that the table exists
Reddit rate limiting: The scraper uses a 25-post limit and respectful user agent
Duplicate key errors: The scraper checks for duplicates, but race conditions might occur

Error Messages

Missing Supabase credentials: Set the required environment variables
Reddit API error: Check if Reddit is accessible and not rate limiting
Error saving post: Check Supabase connection and table schema

Data Access

You can query your scraped data directly in Supabase:

-- Get recent posts
SELECT title, created_at, reddit_original->>'score' as score 
FROM posts 
ORDER BY created_at DESC 
LIMIT 10;

-- Search posts by title
SELECT title, created_at 
FROM posts 
WHERE title ILIKE '%gpu%' 
ORDER BY created_at DESC;

-- Get posts by author
SELECT title, created_at 
FROM posts 
WHERE reddit_original->>'author' = 'username' 
ORDER BY created_at DESC;

Contributing

Feel free to modify the scraper to:

Add more subreddits
Include additional post metadata
Add data processing or analysis features
Integrate with other services

wxw

scrape-hws