For Data Sources

Submit a job data source via jobpool.live.

New job boards, ATS feeds, careers pages, and bulk dataset operators can plug into the Job Pool ecosystem through three documented paths. Every contribution is reviewed, attributed, and routed through the canonical schema. The origin stays the system of record — the pool just gives your data a structured cache and a documented contract.

Why join

What you get for being a documented source

The pool is a cooperative cache. Your records show up in the canonical API, the dataset, and downstream Job Pool surfaces alongside attribution back to your origin.

Attribution preserved

Every record carries url, apply_link, and source_business_url back to your origin. Consumers are pointed at your apply flow, not a clone.

Documented in the catalog

Your source is added to /v1/sources with repository, DVC pointer, schema, and freshness signals. Downstream consumers know what to trust.

Traffic offload

Responsible aggregators are pointed at the cache first, reducing duplicate crawls against your origin pages and apply flows.

Non-exclusive

Joining the pool grants no exclusive rights to your data, your audience, or your crawler traffic. You can withdraw the source at any time and the cache stops refreshing.

Reviewed before publish

Submissions land in a staged review queue on jobpool.live/docs/submissions before they show up in the public dataset and API.

Integration Paths

Three ways to become a Job Pool data source

Pick the path that matches how your data is shaped today. You can mix paths over time — for example, expose a feed for incremental updates and ship an initial CSV for backfill.

1. Feed crawl by JobPoolBot

Best when your job board already exposes a stable, machine-readable feed (RSS, Atom, JSON Feed, or a documented JSON listing endpoint).

  1. Publish or expose a feed at a stable URL (for example https://example.com/jobs.json) that returns the listings you want represented in the pool.
  2. Allow the JobPoolBot robots product token in your robots.txt and reference the feed in a comment or sitemap.
  3. Add a standard Link header on listing pages: Link: <https://example.com/jobs.json>; rel="alternate"; type="application/json".
  4. Open a submission on jobpool.live/docs/submissions with the feed URL and refresh interval.
  5. Review goes through the staged queue; once approved, JobPoolBot starts paced refreshes and your records appear in /v1/jobs.

Good fit for: established job boards, ATS APIs, careers-page aggregators with structured templates.

2. Reviewed scraper on jobpool.live

Best when there is no first-party feed and the data lives in HTML pages that need a small adapter to normalize into the canonical schema.

  1. Read jobpool.live/docs/scrapers for the scraper contract: input, output, naming, paging, retries.
  2. Write a scraper that emits records matching job-listing.schema.json, including stable url, apply_link, and source_business_url attribution.
  3. Submit the scraper for review on jobpool.live/docs/submissions. Review covers correctness, politeness, schema fit, and origin attribution.
  4. On merge, the scraper is run on a paced schedule from JobPoolBot infrastructure. You stay listed as the source operator on /v1/sources.
  5. You can pause or retire the scraper at any time; the catalog entry is updated and the cache stops refreshing.

Good fit for: niche boards, regional sites, vertical-specific careers pages, ATS templates without an open feed.

3. Direct dataset upload

Best when you have a periodic bulk export (CSV, Parquet) you can hand over for review and inclusion in the canonical dataset.

  1. Export your listings as a CSV that conforms to job-listing.schema.json, with one job per row and consistent column order.
  2. Smoke-check locally against /openapi.json if you also want to mirror the API shape.
  3. Open a submission on jobpool.live/docs/submissions describing the dataset cadence (one-time, daily, weekly), origin URL, and license posture.
  4. The dataset is reviewed, normalized, then merged into the next reviewed snapshot served at /datasets/latest.csv.
  5. Status, version, and freshness are surfaced on jobpool.live/docs/status and /v1/sources.

Good fit for: enterprise boards with bulk exports, partners doing one-time backfills, research datasets.

What records look like

Conform to the canonical job listing schema

The pool serves a single normalized record shape. Every integration path eventually produces records that match the same schema, regardless of whether they came from a feed, a scraper, or a CSV upload.

Required fields

FieldTypeNotes
job_titlestringPublic-facing role title as it appears on the origin.
urlstringCanonical detail URL on your origin. Must resolve.
apply_linkstringApplication URL, preserved verbatim. Carries forward attribution.
source_business_urlstringOrigin business or board URL the listing belongs to.
country_codestring2-letter ISO style code, for example US.
industriesarrayBest-effort classification. Free-form strings, deduped.

Full schema: /schemas/job-listing.schema.json. Mirror the API shape using /openapi.json.

Sample record

{
  "job_title": "Backend Engineer",
  "url": "https://example.com/jobs/backend-engineer",
  "apply_link": "https://example.com/apply/backend-engineer",
  "source_business_url": "https://example.com",
  "country_code": "US",
  "industries": ["Software", "Engineering"],
  "ingestion_date": "2026-05-09"
}

Records that fail validation are quarantined in the review queue, not silently dropped. The submission UI surfaces what to fix.

Reference Agent

How JobPoolBot identifies and behaves

JobPoolBot is the canonical data-source agent for the pool. It is the same code path used by reviewed scrapers and the same crawler that picks up first-party feeds. Treat it as the reference implementation for what an automated data source should look like.

Identification

If you see this UA on your origin, you are seeing the bot that ingests your listings. Rate-limit issues are tracked through jobpool.live.

Crawl etiquette

  • Honors robots.txt, including Crawl-delay hints when supplied.
  • Reads HTTP Link headers and HTML metadata before deep-crawling listing pages.
  • Paced requests by host. No parallel hammering of a single origin.
  • Caches and conditional If-Modified-Since / ETag revalidation where available.
  • Stops on persistent 4xx / 5xx; surfaces incidents on jobpool.live/docs/status.

Robots example for a participating source

# /robots.txt
User-agent: JobPoolBot
Allow: /jobs
Allow: /careers
Crawl-delay: 5

# Preferred structured feed:
# https://example.com/jobs.json
Sitemap: https://example.com/sitemap.xml

This is the same pattern used in the publisher integration docs at /rfc/#publishers. Data-source integration is publisher integration with a follow-on submission step.

Submission Lifecycle

From submitted to served

Every integration path follows the same staged review flow. There is no private, undocumented bypass.

02

Review

Submissions are reviewed for correctness, politeness, schema fit, and attribution. Feedback is posted on the submission record.

03

Stage

Approved submissions land in a staging dataset where records are validated end-to-end and dedup keys are checked.

04

Promote

Promoted sources show up in /v1/sources and their records flow into /v1/jobs and the next reviewed dataset snapshot.

Status and freshness

  • Source freshness, last successful crawl, and incident notes are public on jobpool.live/docs/status.
  • The reviewed CSV snapshot is published at /datasets/latest.csv; consumers can read freshness from /v1/sources at any time.
  • Withdrawals, schema breaks, or origin migrations are also handled through the same submission UI.

FAQ

Common questions before you submit

Do I have to expose a feed first?

No. If your data only lives in HTML, the reviewed-scraper path on jobpool.live/docs/scrapers is the supported alternative. Feeds are encouraged but not required.

Does the pool re-host my apply flow?

No. The pool stores normalized records; apply_link is preserved verbatim. Consumers are pointed at your apply URL, not a copy.

How do I withdraw a source?

Open a withdrawal on the same submissions UI. The catalog entry is updated, the scraper or feed pull is paused, and the cache stops refreshing on the next cycle.

Is JobPoolBot the only crawler?

JobPoolBot is the canonical, documented agent. Other consumers may still crawl your origin under their own UAs — the pool just gives them a cache to prefer first via /v1/jobs.

What about rate limits?

Crawl-delay hints in robots.txt are respected. Origin-specific pacing is set per source during review and tuned if you flag issues on jobpool.live/docs/status.

Next Steps

Pick a path and open a submission

The submission UI on jobpool.live captures everything reviewers need: feed URL, scraper, dataset, refresh cadence, license posture, and contact info.