Complete Instructions for Creating the /tasks/blob Endpoint

Overview

Create a webhook endpoint that processes PDF text extraction and blob storage for Notion pages, following the established separation of concerns architecture.

Prerequisites

Before starting, verify these files exist and understand their current state:

  • /backend/routes/tasks/_tasks.routes.ts (route definitions)
  • /backend/controllers/tasks.controller.ts (business logic)
  • /backend/services/notion.service.ts (Notion API calls)
  • /backend/utils/pdfExtractor.ts (PDF processing utilities)
  • /backend/types/ (type definitions directory)

Step 1: Create Type Definitions

Create or update /backend/types/tasks.types.ts with the following content:

/**
 * Task processing type definitions
 */

export interface ProcessBlobRequest {
  pageId: string;
}

export interface ProcessBlobResponse {
  success: boolean;
  pageId: string;
  blobKey?: string;
  textLength?: number;
  statusUpdated?: boolean;
  timestamp: string;
  error?: string;
  details?: string;
}

export interface WebhookPayload {
  data: {
    id: string;
    [key: string]: any;
  };
  [key: string]: any;
}

export interface NotionPageProperties {
  id: string;
  properties: {
    PDF?: {
      type: 'files';
      files: Array<{
        type: 'external' | 'file';
        external?: { url: string };
        file?: { url: string };
      }>;
    };
    [key: string]: any;
  };
}

Step 2: Add Required Imports to Route File

In /backend/routes/tasks/_tasks.routes.ts, ensure these imports are present at the top:

import { Hono } from "npm:hono@3.12.12";
import { processBlobExtraction, extractPageIdFromWebhook } from "../../controllers/tasks.controller.ts";

Step 3: Add Route Handler

In /backend/routes/tasks/_tasks.routes.ts, add this endpoint before the catch-all handler (the app.post("*", ...) route):

// PDF text extraction and blob storage endpoint
app.post("/blob", async (c) => {
  try {
    // At this point, webhook auth has already passed
    const body = await c.req.json();
    console.log("📥 Blob processing webhook received:", body);

    // Extract page ID from webhook payload
    const pageId = extractPageIdFromWebhook(body);
    if (!pageId) {
      return c.json({ error: "Page ID is required" }, 400);
    }

    // Process the blob extraction using controller
    const result = await processBlobExtraction({ pageId });

    // Return appropriate HTTP status based on result
    if (result.success) {
      return c.json(result);
    } else {
      const statusCode = result.error?.includes("No PDF") || result.error?.includes("No text") ? 400 : 500;
      return c.json(result, statusCode);
    }

  } catch (error) {
    console.error("❌ Error in blob endpoint:", error);
    return c.json({ 
      error: "Internal server error", 
      details: error.message,
      timestamp: new Date().toISOString()
    }, 500);
  }
});

Step 4: Update Available Endpoints List

In the same file, find the catch-all handler and update the availableEndpoints array to include the new endpoint:

app.post("*", (c) => {
  const path = c.req.path;
  const method = c.req.method;
  const returnObj = {
    error: "Endpoint not found",
    path: path,
    method: method,
    availableEndpoints: [
      "GET /tasks/debug-webhook",
      "POST /tasks/test",
      "POST /tasks/notion-webhook",
      "POST /tasks/blob",  // <- Add this line
    ],
  };
  console.log(returnObj);

  return c.json(returnObj, 404);
});

Step 5: Add Required Imports to Controller

In /backend/controllers/tasks.controller.ts, ensure these imports are present at the top:

import { blob } from "https://esm.town/v/std/blob";
import { getPageProperties, updatePageStatus } from "../services/notion.service.ts";
import { extractTextFromPDFUrl } from "../utils/pdfExtractor.ts";
import type { 
  ProcessBlobRequest, 
  ProcessBlobResponse, 
  WebhookPayload, 
  NotionPageProperties 
} from "../types/tasks.types.ts";

Step 6: Add Main Controller Function

In /backend/controllers/tasks.controller.ts, add this function:

/**
 * Process PDF extraction and blob storage for a Notion page
 */
export async function processBlobExtraction(request: ProcessBlobRequest): Promise<ProcessBlobResponse> {
  const { pageId } = request;
  const timestamp = new Date().toISOString();

  try {
    console.log(`🔍 Processing page: ${pageId}`);

    // Get page properties from Notion
    const pageResult = await getPageProperties(pageId);
    if (!pageResult.success) {
      console.error(`❌ Failed to get page properties: ${pageResult.error}`);
      return {
        success: false,
        pageId,
        timestamp,
        error: `Failed to get page properties: ${pageResult.error}`
      };
    }

    const page = pageResult.data;
    console.log(`📄 Retrieved page properties for: ${page.id}`);

    // Extract PDF URL from page properties
    const pdfUrl = extractPdfUrl(page);
    if (!pdfUrl) {
      console.error("❌ No valid PDF found on page");
      return {
        success: false,
        pageId,
        timestamp,
        error: "No PDF files found in PDF property"
      };
    }

    console.log(`📎 Found PDF URL: ${pdfUrl}`);

    // Extract text from PDF
    console.log("🔄 Starting PDF text extraction...");
    const extractedText = await extractTextFromPDFUrl(pdfUrl);
    
    if (!extractedText || extractedText.trim().length === 0) {
      console.error("❌ No text extracted from PDF");
      return {
        success: false,
        pageId,
        timestamp,
        error: "No text could be extracted from PDF"
      };
    }

    // Save extracted text to blob storage
    const blobKey = `findings--transcripts--${pageId}`;
    console.log(`💾 Saving extracted text to blob with key: ${blobKey}`);
    
    await blob.setJSON(blobKey, {
      pageId: pageId,
      extractedText: extractedText,
      extractedAt: timestamp,
      textLength: extractedText.length,
      pdfUrl: pdfUrl
    });

    console.log(`✅ Text saved to blob storage successfully`);

    // Update page status to "Done" and save blob key
    console.log("🔄 Updating page status to 'Done' and saving blob key...");
    const statusResult = await updatePageStatus(pageId, "Done", blobKey);
    if (!statusResult.success) {
      console.error(`⚠️ Failed to update page status: ${statusResult.error}`);
      // Don't fail the entire operation if status update fails
    } else {
      console.log("✅ Page status updated to 'Done' and blob key saved");
    }

    return {
      success: true,
      pageId,
      blobKey,
      textLength: extractedText.length,
      statusUpdated: statusResult.success,
      timestamp
    };

  } catch (error) {
    console.error("❌ Error processing blob extraction:", error);
    return {
      success: false,
      pageId,
      timestamp,
      error: "Internal server error",
      details: error.message
    };
  }
}

Step 7: Add Helper Functions to Controller

In /backend/controllers/tasks.controller.ts, add these helper functions:

/**
 * Extract PDF URL from Notion page properties
 */
function extractPdfUrl(page: NotionPageProperties): string | null {
  // Extract PDF property
  const pdfProperty = page.properties?.PDF;
  if (!pdfProperty || pdfProperty.type !== 'files') {
    console.error("❌ No PDF property found or property is not of type 'files'");
    return null;
  }

  const files = pdfProperty.files;
  if (!files || files.length === 0) {
    console.error("❌ No files found in PDF property");
    return null;
  }

  // Get the first (and should be only) PDF file
  const pdfFile = files[0];
  if (!pdfFile) {
    console.error("❌ PDF file is null or undefined");
    return null;
  }

  // Get the file URL (handle both external and Notion-hosted files)
  if (pdfFile.type === 'external') {
    return pdfFile.external?.url || null;
  } else if (pdfFile.type === 'file') {
    return pdfFile.file?.url || null;
  } else {
    console.error("❌ Unknown PDF file type:", pdfFile.type);
    return null;
  }
}

/**
 * Extract page ID from Notion webhook payload
 */
export function extractPageIdFromWebhook(webhookBody: WebhookPayload): string | null {
  const pageId = webhookBody.data?.id;
  if (!pageId) {
    console.error("❌ No page ID found in webhook payload", JSON.stringify(webhookBody, null, 2));
    return null;
  }
  return pageId;
}

Step 8: Verify Required Service Functions

Check that /backend/services/notion.service.ts contains these functions. If they don't exist, add them:

import { Client } from "npm:@notionhq/client";

const notion = new Client({ auth: Deno.env.get("NOTION_API_KEY") });

export async function getPageProperties(pageId: string) {
  try {
    const response = await notion.pages.retrieve({
      page_id: pageId,
    });
    return {
      success: true,
      data: response,
      timestamp: new Date().toISOString(),
    };
  } catch (error) {
    return {
      success: false,
      error: error.message,
      timestamp: new Date().toISOString(),
    };
  }
}

export async function updatePageStatus(pageId: string, status: string, blobKey?: string) {
  try {
    const properties: any = {
      Status: {
        select: {
          name: status
        }
      }
    };

    // Add blob key if provided
    if (blobKey) {
      properties["Blob Key"] = {
        rich_text: [
          {
            text: {
              content: blobKey
            }
          }
        ]
      };
    }

    const response = await notion.pages.update({
      page_id: pageId,
      properties: properties
    });

    return {
      success: true,
      data: response,
      timestamp: new Date().toISOString(),
    };
  } catch (error) {
    return {
      success: false,
      error: error.message,
      timestamp: new Date().toISOString(),
    };
  }
}

Step 9: Verify Required Utility Function

Check that /backend/utils/pdfExtractor.ts contains the extractTextFromPDFUrl function. If it doesn't exist, add it:

/**
 * Extract text from a PDF file accessible via URL
 */
export async function extractTextFromPDFUrl(pdfUrl: string): Promise<string> {
  try {
    console.log(`📄 Downloading PDF from URL: ${pdfUrl}`);
    
    // Download the PDF file
    const response = await fetch(pdfUrl);
    if (!response.ok) {
      throw new Error(`Failed to download PDF: ${response.status} ${response.statusText}`);
    }
    
    const arrayBuffer = await response.arrayBuffer();
    console.log(`📄 Downloaded PDF, size: ${arrayBuffer.byteLength} bytes`);
    
    // Use pdfjs-dist to extract text
    const pdfjsLib = await import("https://esm.sh/pdfjs-dist@4.0.379/legacy/build/pdf.mjs");
    
    // Load the PDF document
    const loadingTask = pdfjsLib.getDocument({ data: arrayBuffer });
    const pdfDocument = await loadingTask.promise;
    
    console.log(`📄 PDF loaded with ${pdfDocument.numPages} pages`);
    
    // Extract text from all pages
    let fullText = '';
    for (let pageNum = 1; pageNum <= pdfDocument.numPages; pageNum++) {
      const page = await pdfDocument.getPage(pageNum);
      const textContent = await page.getTextContent();
      
      // Combine text items into a single string
      const pageText = textContent.items
        .map((item: any) => item.str)
        .join(' ');
      
      fullText += pageText + '\n';
      console.log(`📄 Extracted text from page ${pageNum}: ${pageText.length} characters`);
    }
    
    console.log(`✅ Total extracted text: ${fullText.length} characters`);
    return fullText.trim();
    
  } catch (error) {
    console.error("❌ Error extracting text from PDF URL:", error);
    throw new Error(`PDF text extraction failed: ${error.message}`);
  }
}

Step 10: Test the Implementation

Test 1: Basic Functionality Test

# Use the fetch tool to test the endpoint
fetch("/tasks/blob", {
  method: "POST",
  body: JSON.stringify({
    data: {
      id: "test-page-id-123"
    }
  }),
  headers: {
    "Content-Type": "application/json"
  }
})

Test 2: Error Handling Test

# Test with invalid payload
fetch("/tasks/blob", {
  method: "POST", 
  body: JSON.stringify({
    data: {} // Missing id
  }),
  headers: {
    "Content-Type": "application/json"
  }
})

Test 3: Check Logs

# Use the requests tool to examine execution logs
requests("main.tsx")

Step 11: Verification Checklist

After implementation, verify:

  • [ ] Types file created: /backend/types/tasks.types.ts exists with all required interfaces
  • [ ] Route handler added: Endpoint is in /backend/routes/tasks/_tasks.routes.ts before catch-all handler
  • [ ] Imports correct: All files have proper import statements with correct paths
  • [ ] Controller function: processBlobExtraction exists in tasks controller with proper typing
  • [ ] Helper functions: extractPageIdFromWebhook and extractPdfUrl exist with proper typing
  • [ ] Service functions: getPageProperties and updatePageStatus exist in notion service
  • [ ] Utility function: extractTextFromPDFUrl exists in PDF extractor utility
  • [ ] Available endpoints updated: New endpoint listed in catch-all handler
  • [ ] Blob storage format: Uses key format findings--transcripts--{pageId}
  • [ ] Error handling: Proper HTTP status codes (400 for client errors, 500 for server errors)
  • [ ] Logging consistency: All console.log statements use emoji prefixes
  • [ ] Type safety: No any types used except where necessary for external APIs

Expected Behavior

Successful Request:

  • Status: 200

  • Response:

    { "success": true, "pageId": "abc123", "blobKey": "findings--transcripts--abc123", "textLength": 1234, "statusUpdated": true, "timestamp": "2024-01-01T12:00:00.000Z" }

Missing Page ID:

  • Status: 400

  • Response:

    { "error": "Page ID is required" }

No PDF Found:

  • Status: 400

  • Response:

    { "success": false, "pageId": "abc123", "error": "No PDF files found in PDF property", "timestamp": "2024-01-01T12:00:00.000Z" }

Internal Error:

  • Status: 500

  • Response:

    { "error": "Internal server error", "details": "Specific error message", "timestamp": "2024-01-01T12:00:00.000Z" }

Architecture Compliance

This implementation follows the established patterns:

  • Routes: Handle HTTP request/response only, delegate to controllers
  • Controllers: Orchestrate business logic, call services and utilities
  • Services: Make pure API calls to external systems (Notion)
  • Utils: Provide reusable utility functions (PDF processing)
  • Types: Centralized type definitions shared across modules
  • Error Handling: Consistent success/error response patterns
  • Logging: Emoji-prefixed console logs for easy debugging

The endpoint integrates seamlessly with existing webhook authentication middleware and follows the same patterns as other endpoints in the project.