Open Job Data Pool
A foundational proposal for treating job listings as shared, structured, continuously refreshed data infrastructure in the AI era.
This RFC defines the concept of an Open Job Data Pool: a shared, continuously refreshed, multi-source data layer for job listings. It explains why the phrase “data pool” is intentionally used, identifies structural issues in the current job data climate, and argues that AI has changed the economics of job data access. As LLMs make scraping, extraction, and mass application workflows easier for ordinary people, the web is moving toward defensive lockdown. The Open Job Data Pool is proposed as a coordinated, transparent alternative to an escalating cycle of private scraping and public restriction.
Overview
Job listings are not just marketing copy. For a job seeker, a listing can represent rent, health insurance, immigration stability, career mobility, family security, or the next step out of stagnation. For an employer, a listing represents an open operational need. For researchers and builders, listings are signals about the shape of the labor market. Yet the data layer beneath job search remains fragmented, inconsistent, and unusually opaque relative to its social importance.
The Open Job Data Pool proposes that job listings should be treated as shared infrastructure. Not every job listing is equally trustworthy. Not every source should be treated the same way. Not every record can be redistributed without care. But the central idea is simple: the labor market is too important for its public signals to exist only as scattered web pages, closed search experiences, fragile scrapers, and isolated databases.
The Job Pool ecosystem is built around the distinction between data and surfaces. A consumer-facing product can help people find work. A live transparency layer can show what the system is seeing. A canonical data domain can expose schemas, datasets, RFCs, and API contracts. Contributor tools can help expand and maintain ingestion. These surfaces should not be collapsed into one product. They should be connected by a shared pool.
Core Thesis
The job market is not lacking data. It is lacking usable structure.
Job listings already exist at massive scale, but they are fragmented, inconsistent, and increasingly difficult to access in a reliable way. At the same time, demand for structured job data has sharply increased.
This shift is driven by AI.
Modern tools do not just display job listings. They interpret them. They compare roles, extract requirements, generate applications, summarize companies, cluster similar openings, and assist in decision-making. These workflows depend on structured inputs. When structure is missing, the burden shifts back to the user as guesswork, repetition, and error.
At the same time, the cost of generating scraping and aggregation systems has collapsed. What previously required engineering teams can now be produced by individuals using LLMs. The number of independent collectors has increased dramatically.
This creates a feedback loop:
- More demand for structured job data.
- More uncoordinated scraping.
- More operational strain on source systems.
- More defensive restrictions.
- Less accessible job data.
The result is not more opportunity. It is less visible opportunity.
The Open Job Data Pool is proposed as a coordination layer that breaks this cycle.
Terminology
The key words MUST, SHOULD, and MAY indicate requirement strength within this document.
| Term | Meaning |
|---|---|
job data |
Structured information describing a job listing, including title, company, location, compensation, source, posting date, apply URL, remote policy, employment type, lifecycle status, and related metadata. |
data pool |
A shared, continuously updated aggregation layer fed by multiple sources and consumed by multiple downstream products, tools, or workflows. |
source |
An origin from which job listing data is obtained, such as an employer career page, job board, ATS endpoint, public feed, partner dataset, or contributor-submitted scraper. |
surface |
A user-facing or developer-facing way to consume the pool, such as search, API access, bulk downloads, transparency pages, scraper documentation, or RFC pages. |
provenance |
Metadata describing where a record came from, when it was observed, how it was transformed, and whether it is still believed to be active. |
mass aggregator |
An actor, tool, script, or system that collects listings across many sources at scale. In the AI era, this category increasingly includes ordinary individuals using generated scripts, not only companies. |
Problem Statement
The current job data climate has a structural mismatch: job listings are treated as isolated product content, but the market increasingly depends on them as infrastructure.
Fragmentation
Job listings are distributed across company career pages, applicant tracking systems, job boards, staffing agencies, public feeds, social platforms, and reposting networks. Each source has its own format, update cadence, URL structure, filtering model, anti-bot posture, and data quality profile.
This fragmentation turns every serious attempt to understand the market into a repeated engineering project: discovery, collection, deduplication, normalization, freshness checking, trust scoring, and source monitoring. The waste is not only technical. It is emotional. Every duplicate listing, dead apply link, or hidden salary range makes the market feel less knowable than it should be.
Inconsistent Structure
Even when listings describe similar roles, their fields differ. Salary may be absent, embedded in prose, expressed hourly or annually, split across a range, or hidden in the body text. Location may refer to a city, region, country, remote policy, hybrid expectation, legal work authorization zone, or recruiting territory. Seniority, employment type, department, skills, and application process are rarely represented consistently.
This inconsistency matters more in the AI era because structured data is the difference between a useful assistant and a hallucination-prone summarizer. LLMs can reason over job listings more effectively when the inputs are normalized, labeled, and traceable. The better the structure, the less a user needs to trust a model to infer what should have been explicit.
Staleness and Ambiguity
A job listing can remain visible after it is no longer actionable. It may be filled, paused, reposted, expired, duplicated, or used as a talent-pipeline signal rather than as an actively hiring role. Job seekers experience this as wasted effort. Data consumers experience it as noise. Employers may not even realize how much stale public surface area they are leaving behind.
Limited Provenance
Many job interfaces show a listing but not the data lifecycle behind it. Users often cannot tell when the job was first observed, when it was last checked, whether it came from the employer directly, whether it was reposted, whether the apply URL changed, or how confidence in the listing was determined.
Provenance is not an optional detail for job data. It is central to trust. In a market filled with reposts, scams, AI-generated job descriptions, staffing intermediaries, and stale listings, the question is not just “what does this listing say?” The question is “why should this listing be believed?”
Closed Access
Most job platforms optimize around their own interface. Even when they expose useful search experiences, the underlying data is typically not made available as a stable public dataset, open schema, or developer-ready API. This limits experimentation and makes it harder to build new products around job discovery, labor-market analysis, alerting, accessibility, matching, verification, and user-owned workflows.
Trust and Fraud Pressure
The broader job-search environment is increasingly affected by scams, impersonation, low-quality postings, misleading opportunities, and automation. The FTC reported that job-scam losses increased more than threefold from 2020 to 2023 and exceeded $220 million in the first half of 2024. This does not mean every bad listing is a scam. It does mean job data now exists in a trust-sensitive environment where source quality, provenance, and freshness matter.
AI-Era Shift
AI has changed job search from a browsing problem into a data problem.
The modern expectation is no longer: “show me jobs.” It is: “help me reason about opportunities.”
That requires structure.
A model can compare roles, extract requirements, infer seniority, tailor a resume, and summarize tradeoffs. But it can only do this reliably when the input is coherent. Without structured fields, freshness metadata, source context, and clear apply URLs, even powerful tools are forced to reason from incomplete evidence.
At the same time, AI has made it trivial to generate scraping logic, automation scripts, and aggregation pipelines. This has expanded the population of “aggregators” from a small number of companies to a large number of individuals.
This shift matters because web systems are not designed to absorb unlimited uncoordinated extraction.
When enough independent actors attempt to extract the same data, the system responds defensively:
- rate limiting
- bot detection
- CAPTCHAs
- blocked crawlers
- gated content
These defenses do not distinguish intent. They treat extraction as risk.
The result is predictable: job data becomes harder to access precisely when demand for it is highest.
The Pool as a Pressure Valve
A shared job data pool gives the ecosystem a less destructive path. Instead of thousands of independent scripts repeatedly hitting the same sources, a pool can centralize collection, document source policies, expose downstream access, preserve provenance, and provide rate-limited structured distribution.
The goal is not to make crawling limitless. The goal is to make data access legible, coordinated, and useful enough that fewer actors need to behave like anonymous scrapers in the first place.
Coordination Failure
The current job data ecosystem is a coordination failure.
Each participant acts rationally in isolation:
- Individuals collect data to improve visibility.
- Builders create aggregators to improve access.
- Operators restrict access to protect systems.
But the aggregate outcome is worse than any individual intention:
- redundant scraping
- increased system load
- reduced accessibility
- declining trust in listings
This is not primarily a moral failure. It is a structural one.
Without a shared layer, the system cannot stabilize. Private collection keeps expanding because public access remains weak. Public access keeps narrowing because private collection becomes harder to distinguish from abuse. The loop feeds itself.
The Open Job Data Pool is an attempt to introduce coordination where none currently exists.
Declining Opportunity as a Shared Emotion
The perception of declining opportunity is not solely a function of fewer jobs.
It is a function of reduced clarity.
When listings are duplicated, stale, hidden, gated, or inconsistent, the visible market becomes unreliable. A role may exist, but the path to it is obscured. A listing may appear active, but lead nowhere. A position may be open, but not discoverable without navigating fragmented systems.
This creates a form of informational scarcity.
The experience becomes repetition instead of progress, noise instead of signal, and effort without feedback.
AI intensifies this effect. It promises leverage: faster search, better applications, deeper understanding. But that leverage collapses when the underlying data is incomplete or unstructured.
A model can help a person reason. It cannot make a dead listing alive. It cannot turn a hidden salary into a transparent one. It cannot reliably infer freshness if the source does not expose it.
The result is a mismatch: high-capability tools operating on low-quality inputs.
That mismatch produces frustration, which is often interpreted as a lack of opportunity. In reality, the problem is not only economic. It is infrastructural.
Why “Data Pool”
“Data pool” is not a branding choice. It is a structural description.
A pool has three properties:
- It is fed by multiple sources.
- It is continuously changing.
- It supports multiple consumers.
This is a better model for job data than a “job board.”
A job board is a destination. A data pool is infrastructure.
The distinction matters.
When job data is treated as infrastructure:
- multiple interfaces can exist without duplicating the data layer
- access patterns can be coordinated
- provenance can be preserved
- structure can be standardized
- AI systems can reason over cleaner inputs
The goal is not to replace job boards. The goal is to decouple the data from any single interface.
Pool Implies Stewardship
Job Pool does not need to claim exclusive ownership over the labor market’s job listings. The stronger position is stewardship: collect responsibly, normalize transparently, preserve provenance, expose useful access patterns, and improve the quality of downstream experiences.
Pool Is the Right Metaphor for AI Access
AI systems do not merely browse; they transform. They summarize, classify, compare, cluster, rank, rewrite, and extract. A pool is a better substrate for these workflows than scattered pages because it gives AI systems cleaner inputs and gives humans more inspectable outputs.
Goals
- Define a shared conceptual foundation for open, structured job data.
- Separate job data infrastructure from any single consumer product interface.
- Support multiple downstream surfaces, including search, APIs, bulk datasets, transparency pages, and contributor tools.
- Make provenance, freshness, and source quality first-class metadata.
- Reduce uncoordinated scraping pressure by offering structured downstream access.
- Enable AI-assisted job search workflows with cleaner, more reliable inputs.
- Encourage schema stability while allowing iterative improvement.
- Create a foundation for trust signals around job listings.
Non-Goals
- This RFC does not define the final database schema.
- This RFC does not define the full API contract.
- This RFC does not claim that all job data can or should be freely redistributed without source-specific review.
- This RFC does not replace employer career pages, applicant tracking systems, or job boards.
- This RFC does not define a ranking algorithm for job search.
- This RFC does not propose that every listing in the pool is equally trustworthy.
- This RFC does not encourage bypassing access controls, violating source terms, or scraping without operational restraint.
Proposal
Job Pool should be developed around the Open Job Data Pool model. The data pool is the shared substrate. Product surfaces are consumers of that substrate.
MUST: The system should treat job listings as records with provenance, lifecycle, freshness, and source metadata.
MUST: Consumer-facing job search should not be the only way to access or understand the data.
MUST: The system should avoid encouraging uncontrolled scraping by providing safer, documented downstream access where possible.
SHOULD: The canonical data layer should expose schemas, datasets, and RFCs as public documentation.
SHOULD: The system should preserve a clear separation between data infrastructure, transparency tooling, and consumer product UX.
MAY: The system may support contributor-submitted sources, scraper modules, enrichment pipelines, source confidence, and trust scores as the pool matures.
Under this proposal, mewannajob.com is best understood as a consumer product built on top of the
pool. jobpool.live is best understood as the transparency and power-user surface.
jobdatapool.com is best understood as the canonical data, API, dataset, schema, and RFC authority.
datapool.work is best understood as a contributor and ingestion operations layer.
Design Principles
Data Before Interface
The pool must be valuable even if the consumer-facing UI changes. A durable data layer creates leverage: multiple interfaces can be built without re-solving ingestion and normalization.
Provenance Before Polish
A beautiful listing is not useful if it cannot be trusted. Source, observation time, last validation, apply URL, transformation history, and confidence should be treated as core data, not internal implementation details.
Freshness Is a First-Class Signal
Jobs decay quickly. A listing that was valid last week may not be valid today. Freshness should be visible and machine-readable.
Structured Data Is an AI Safety Feature
Better structure reduces the amount of inference required from AI systems. When fields are explicit, models can help users reason over facts instead of guessing at missing context. Structure is not just convenience; it is a guardrail against confusion.
Coordinate Access Instead of Multiplying Scrapers
The system should reduce the incentive for every user, startup, researcher, and job seeker to build a private scraper. A shared pool is more responsible than a thousand isolated collectors.
Open Does Not Mean Uncontrolled
Open access does not require careless redistribution. Source terms, privacy, abuse prevention, rate limits, and responsible use policies still matter.
Transparency Compounds Trust
Leaderboards, scraper docs, source status, dataset releases, RFCs, and public changelogs turn the system from a black box into inspectable infrastructure.
Normative Requirements
| ID | Requirement | Rationale |
|---|---|---|
| REQ-001 | The system MUST distinguish between source records and normalized records. | Consumers need normalized data, but maintainers need to audit how that normalized data was derived. |
| REQ-002 | The system MUST retain source and observation metadata for each listing where available. | Trust depends on knowing where a record came from and when it was last seen. |
| REQ-003 | The system SHOULD expose data access through more than one surface. | Search users, developers, researchers, and contributors have different access needs. |
| REQ-004 | The system SHOULD document schemas, RFCs, and dataset releases on the canonical data domain. | Stable documentation helps external users understand and trust the system. |
| REQ-005 | The system SHOULD provide freshness signals for listings and sources. | Freshness is central to job usefulness and ranking quality. |
| REQ-006 | The system SHOULD provide access patterns that reduce the need for uncoordinated scraping. | Coordinated access can reduce strain on source systems and make usage more transparent. |
| REQ-007 | The system MAY support source confidence, duplicate clustering, and listing-lifecycle classification. | These features improve trust and reduce noise, but can be introduced iteratively. |
Implementation Implications
Canonical Data Domain
The project should maintain a canonical data domain for API documentation, schemas, RFCs, dataset releases,
and source-of-truth explanations. This role belongs naturally to jobdatapool.com.
Consumer Product Separation
The consumer job-search product should be free to optimize for speed, clarity, conversion, alerts, and user
experience without carrying the full complexity of data infrastructure. This role belongs naturally to
mewannajob.com.
Transparency Surface
A live transparency surface should expose bulk downloads, scraper outputs, scraper documentation, limited CRUD,
source status, and contributor leaderboards. This role belongs naturally to jobpool.live.
Contributor Operations
Contributor and ingestion workflows should be separated from public consumer UX when possible. This keeps the
product clear while preserving room for operational tooling. This role belongs naturally to datapool.work.
Public RFC Series
The RFC series should become the project’s public reasoning layer. It should explain why architectural choices exist, how terms are used, and how the system should evolve.
AI-Compatible Data Contracts
The pool should assume that AI assistants will become common consumers of job data. This implies stable field names, predictable formats, explicit uncertainty, clear source attribution, and machine-readable freshness. The goal is not to optimize only for models. The goal is to make the data clear enough that both humans and models can reason over it responsibly.
Risks and Mitigations
| Risk | Description | Mitigation |
|---|---|---|
| Low-quality ingestion | A large pool can become noisy if every source is accepted without validation. | Track source quality, last validation, duplicate clusters, and confidence signals. |
| Stale listings | Listings may remain visible after they are no longer actionable. | Expose last-seen timestamps, active checks, decay rules, and lifecycle status. |
| Legal or terms conflicts | Some sources may restrict collection, display, redistribution, or API access. | Maintain source-specific policies and avoid assuming all discovered data can be redistributed the same way. |
| Operational strain | Centralized collection can still create pressure if poorly designed. | Respect rate limits, cache aggressively, document source behavior, and prefer permitted feeds where available. |
| Product confusion | Users may confuse the consumer product, transparency layer, and data authority if the domains overlap too much. | Maintain clear surface roles and cross-link intentionally. |
| False trust | Open data can appear more authoritative than it is if confidence and provenance are hidden. | Show uncertainty, source metadata, and freshness instead of pretending the pool is perfect. |
Relationship to RFC-0002
This RFC defines the concept: the Open Job Data Pool. It establishes why the system exists, why “data pool” is the right framing, and why AI-era job search makes structured job data more valuable and more urgent.
JPE-RFC-0002 — Web Topology defines how this concept maps onto public domains and web surfaces. RFC-0001 should be read first. RFC-0002 should be treated as the implementation topology for the conceptual model defined here.
References
- Federal Trade Commission, “Paying to get paid: gamified job scams drive record losses,” Dec. 12, 2024. The FTC reported that job-scam losses increased more than threefold from 2020 to 2023 and topped $220 million in the first half of 2024.
- Cloudflare, “Cloudflare Just Changed How AI Crawlers Scrape the Internet-at-Large,” July 1, 2025. Cloudflare described a shift toward permission-based AI crawling and noted that more than one million customers had used its one-click AI crawler blocking option after its September 2024 launch.
- JPE-RFC-0002, “Job Pool Web Topology.” Defines the public domain architecture for the Job Pool ecosystem.