You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RFC: Plugin-based Provider Architecture for scan.mjs
Tracks:#230 (Scanner & Job Board Providers umbrella) Follows:Discussion #284 — RFC Process Status: Draft — soliciting feedback. Scope is the architectural extraction only; concrete new providers (scraper, Apify, browser transport) are deferred to follow-up RFCs/PRs that build on this foundation.
1. Problem
Adding sources requires core edits. Today, supporting a new ATS or a one-off careers page means modifying scan.mjs directly. That increases the blast radius for what should be additive changes, and it concentrates scan.mjs PRs into a single high-traffic file. The umbrella issue #230 already anticipates this and notes "new providers should be added to scan.mjs or as modules in a providers/ directory."
This RFC proposes extracting the existing Greenhouse / Ashby / Lever logic into a plugin system without changing observable behavior. Once the plugin contract is in place, follow-up RFCs can add new provider types — an HTML scraper for portals without a clean API (#230 sub-task), an Apify-based runner for 403/429 portals (#325), additional ATS adapters — as additive providers/*.mjs files with no further changes to scan.mjs.
Out of scope for this RFC: new provider implementations (scraper, Apify), browser/Playwright transport, and the 403/429 fallback story. Those are intentionally separated so this RFC can be reviewed as a pure refactor.
2. Proposed solution
A plugin-based provider system where each job source lives in its own file and scan.mjs becomes a coordinator.
2.1 Provider interface
Every providers/*.mjs file exports a default object that conforms to the Provider contract below. Types are documented as JSDoc @typedef annotations — the project is plain ESM JavaScript with no build step. Authors can drop the same blocks at the top of their provider file to get IDE type hints under // @ts-check without introducing a TypeScript dependency.
2.1.1 Types
/** * Normalized job posting — the unit of currency throughout the scanner. * * @typedef {object} Job * @property {string} title Required, non-empty after trim. * @property {string} url Required, absolute URL — used as the dedup key. * @property {string} company May be empty when the source can't expose it * at the list-page level; populated downstream. * @property {string} location May be empty. *//** * A single `tracked_companies` entry from `portals.yml`. * * Provider-specific fields are opaque to `scan.mjs` and validated by the provider * itself. Examples in the current code: `api`, `list_item_pattern`, `url_must_include`, * `page_param`, `actor`, `input`, `field_map`, `defaults`, `timeout_ms`. Providers * read these directly off the entry object — no schema enforcement at the framework * level (see Open Question 3). * * @typedef {object} PortalEntry * @property {string} name User-facing label; appears in logs and placeholders. * @property {boolean} [enabled] Default: true. * @property {string} [careers_url] Public listing URL; consumed by detect(). * @property {string} [provider] Explicit provider id — bypasses detect(). * @property {('http'|'browser')} [transport] Default: 'http'. *//** * Returned by `detect()` when a provider claims an entry. `url` is informational * (used in logs); routing only checks for a non-null return. * * @typedef {object} DetectHit * @property {string} url *//** * Options forwarded to the underlying `fetch` / Playwright call. * * @typedef {object} FetchOptions * @property {number} [timeoutMs] * @property {Object<string,string>} [headers] * @property {string} [method] * @property {(string|null)} [body] * @property {('load'|'domcontentloaded'|'networkidle'|'commit')} [waitUntil] Browser transport only. *//** * What `scan.mjs` hands to `provider.fetch()`. Shape varies by transport: * `withPage` is only present when `transport === 'browser'`. * * @typedef {object} Context * @property {('http'|'browser')} transport * @property {(url: string, opts?: FetchOptions) => Promise<string>} fetchText * @property {(url: string, opts?: FetchOptions) => Promise<unknown>} fetchJson * @property {(<T>(fn: (page: import('playwright').Page) => Promise<T>) => Promise<T>)} [withPage] *//** * The provider contract — the default export of every `providers/*.mjs` file * (excluding `_`-prefixed shared helpers). * * @typedef {object} Provider * @property {string} id Unique across all loaded providers. * @property {((entry: PortalEntry) => (DetectHit | null))} [detect] Optional auto-detection. * @property {(entry: PortalEntry, ctx: Context) => Promise<Job[]>} fetch Required. */
2.1.2 Invariants
Member
Must
Must not
id
be unique across all loaded providers; equal to the provider: value in portals.yml for explicit routing
collide with another provider's id (loader logs the duplicate and keeps the first)
detect()
be synchronous, side-effect-free, deterministic for a given entry
perform I/O, mutate entry, or throw
fetch()
return an array of Job (possibly empty) when the source is reachable; jobs missing title or url may be filtered by the provider before returning
return non-array shapes; entries with missing required config should throw rather than return [] silently
fetch()
throw on unrecoverable failure (4xx / 5xx, schema mismatch, missing required entry config) so the run summary captures it
swallow errors silently — scan.mjs collects thrown errors per-entry and continues with the next
2.1.3 Error contract
An error thrown from fetch() is caught by scan.mjs, attributed to entry.name, and added to the run-level error list. The scan does not abort.
HTTP non-2xx responses surface as Error objects with err.status (the HTTP code) and err.body (truncated response body) attached, so providers may branch on status if useful — default behavior is to let the error propagate to scan.mjs.
Provider load failures (the import() itself throwing) are caught by the loader, logged with the offending filename, and that file is skipped — other providers still load.
An entry with provider: <id> referencing a missing/failed provider is recorded as a resolve error and skipped, not retried via detect().
The typedef imports at the top are optional — they only matter if the author wants // @ts-check IDE hints. The runtime contract is enforced by scan.mjs (it checks id and typeof fetch === 'function' at load time and Array.isArray(result) at runtime), not by the type annotations.
The Job shape is the contract that lets the rest of scan.mjs (dedup, title filter, pipeline output, scan-history.tsv writer) stay unchanged across providers.
2.2 Discovery & priority
providers/*.mjs are auto-loaded at startup.
Files starting with _ are shared helpers (e.g. _types.js, _http.mjs), not loaded as providers. Future provider RFCs may add more (_browser.mjs, _apify.mjs, etc.) under the same convention.
Load order is alphabetical so detect() priority is deterministic across machines/filesystems. Custom plugins can use prefixes (00-foo.mjs, 99-bar.mjs) to bias priority intentionally.
Failed imports are logged and skipped — one broken provider doesn't take down the scan.
The Context typedef and PortalEntry.transport field reserve space for additional transports (e.g. browser) so future provider RFCs can add them without breaking the contract. No additional transports are implemented here.
2.5 Backwards compatibility with existing scan.mjs flow
The plugin layer is additive. Specifically:
Dedup model unchanged: URL + ${company.toLowerCase()}::${title.toLowerCase()}.
scan-history.tsv source label keeps the ${provider.id}-api suffix so existing rows continue to match for dedup. (Renaming the suffix would invalidate every existing Greenhouse/Ashby/Lever row in the file.)
Title filter (positive/negative), pipeline output format, --dry-run, --company flags — all unchanged.
Existing portals.yml entries work as-is: Greenhouse/Ashby/Lever auto-detect from careers_url. New fields (provider:, transport:, timeout_ms, etc.) are all optional.
3. Files affected
System layer (auto-updatable)
File
Change
scan.mjs
Refactored to plugin-loader / coordinator. Hardcoded provider switches removed.
providers/_types.js
New. JSDoc @typedef catalog for Job, PortalEntry, DetectHit, FetchOptions, Context, Provider. Pure documentation — providers import types via /** @typedef {import('./_types.js').Provider} Provider */.
providers/_http.mjs
New. Timeout-bounded fetch wrapper with body-snippet error reporting.
providers/greenhouse.mjs
New (port). Extracted from current scan.mjs.
providers/ashby.mjs
New (port). Extracted from current scan.mjs.
providers/lever.mjs
New (port). Extracted from current scan.mjs.
templates/portals.example.yml
Optional provider: field documented on the existing entries; no new entry types.
CLAUDE.md
Documents the provider-plugin model, helper conventions, and the alphabetical-load-order rule (Phase B).
User layer (untouched)
portals.yml, cv.md, config/profile.yml, modes/_profile.md, data/*, reports/*, output/*, interview-prep/* — none of these are written to or modified by this RFC. Existing portals.yml entries continue to work; new fields are opt-in.
4. Data Contract impact
No new user-layer files. No changes to existing user-layer files.
All new providers/*.mjs files are System layer (scripts), auto-updatable like the rest of *.mjs.
Underscore-prefix convention (_types.js, _http.mjs) for shared helpers is enforced by scan.mjs's loader (files starting with _ are skipped). Convention is documented in CLAUDE.md.
templates/portals.example.yml (System layer) gets a one-line provider: documentation update; the user's own portals.yml (User layer) is not auto-modified, and existing entries continue to work without any provider: field (auto-detection handles them).
CLAUDE.md updates are System-layer and need core-architecture review per the existing process — that's why they're isolated to Phase B.
5. Phases
Two PRs, in order, after this RFC is approved.
Phase A — Plugin infra + port existing providers (advances #230)
scan.mjs refactor (load providers from providers/, resolve per-entry, dispatch).
Transport helper: _http.mjs.
Type-doc catalog: _types.js.
Extract greenhouse.mjs, ashby.mjs, lever.mjs from scan.mjs into the new interface.
Remove the hardcoded provider paths from scan.mjs once the ports are in place.
Pure refactor. Source labels in scan-history.tsv stay (-api suffix preserved). No behavior change for existing portals.yml entries; existing entries auto-detect via careers_url exactly as today.
Estimated diff: ~6 files, mostly moves with a small net-new core.
Phase B — CLAUDE.md documentation
Document the provider-plugin model, the underscore-prefix convention, the alphabetical-load-order priority rule, and the optional provider: override on portals.yml entries.
Update the "Stack and Conventions" section.
Add a short provider-authoring guide pointing at _types.js and the existing greenhouse.mjs as the canonical example.
Estimated diff: ~1 file (CLAUDE.md), no code.
Out of scope (follow-up RFCs/PRs)
The following land in separate, additive PRs once the plugin contract is in place:
Browser/Playwright transport (_browser.mjs) and the transport: browser PortalEntry field — needed by future scraper / 403-prone providers.
Generic HTML scraper provider (scraper.mjs) — schema.org / pattern-based extraction for portals without a clean API.
Any opt-in fallback chain (provider: greenhouse, fallback: scraper) and per-entry credit caps.
These are deliberately deferred so this RFC reviews as a pure refactor with no behavior change.
6. Open questions
Plugin priority control. Alphabetical-by-filename plus user-prefix convention (00-, 99-) — is that sufficient, or do we want an explicit priority: field on the exported provider object? Filename ordering is simpler; explicit priority is more flexible.
Config validation. Should the framework provide a hook for providers to declare a JSON schema for their portals.yml block, so misconfigurations fail at load time rather than at fetch time? Today each provider hand-rolls validation inside fetch().
Reserved-but-unused transport field. The PortalEntry.transport field is declared in the type catalog so future provider RFCs can add browser without breaking the contract, but no provider in this RFC reads it. Is that acceptable forward-compatibility, or should the field land with the first provider that actually uses it?
PR #454 implemented the full architecture (plugin infra + browser transport + scraper + Apify) in a single PR before opening an RFC, which was the wrong order and the reason reviewers asked for a split. The branch is preserved (now in draft) and will be carved up as follows:
Round-2 review fixes already pushed to the #454 branch will be cherry-picked into whichever follow-up PR contains the file they touch, rather than landed cumulatively against this RFC's Phase A.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
RFC: Plugin-based Provider Architecture for
scan.mjs1. Problem
Adding sources requires core edits. Today, supporting a new ATS or a one-off careers page means modifying
scan.mjsdirectly. That increases the blast radius for what should be additive changes, and it concentratesscan.mjsPRs into a single high-traffic file. The umbrella issue #230 already anticipates this and notes "new providers should be added toscan.mjsor as modules in aproviders/directory."This RFC proposes extracting the existing Greenhouse / Ashby / Lever logic into a plugin system without changing observable behavior. Once the plugin contract is in place, follow-up RFCs can add new provider types — an HTML scraper for portals without a clean API (#230 sub-task), an Apify-based runner for 403/429 portals (#325), additional ATS adapters — as additive
providers/*.mjsfiles with no further changes toscan.mjs.Out of scope for this RFC: new provider implementations (scraper, Apify), browser/Playwright transport, and the 403/429 fallback story. Those are intentionally separated so this RFC can be reviewed as a pure refactor.
2. Proposed solution
A plugin-based provider system where each job source lives in its own file and
scan.mjsbecomes a coordinator.2.1 Provider interface
Every
providers/*.mjsfile exports a default object that conforms to theProvidercontract below. Types are documented as JSDoc@typedefannotations — the project is plain ESM JavaScript with no build step. Authors can drop the same blocks at the top of their provider file to get IDE type hints under// @ts-checkwithout introducing a TypeScript dependency.2.1.1 Types
2.1.2 Invariants
idprovider:value inportals.ymlfor explicit routingid(loader logs the duplicate and keeps the first)detect()entryentry, or throwfetch()Job(possibly empty) when the source is reachable; jobs missingtitleorurlmay be filtered by the provider before returningthrowrather than return[]silentlyfetch()entryconfig) so the run summary captures itscan.mjscollects thrown errors per-entry and continues with the next2.1.3 Error contract
fetch()is caught byscan.mjs, attributed toentry.name, and added to the run-level error list. The scan does not abort.Errorobjects witherr.status(the HTTP code) anderr.body(truncated response body) attached, so providers may branch on status if useful — default behavior is to let the error propagate toscan.mjs.import()itself throwing) are caught by the loader, logged with the offending filename, and that file is skipped — other providers still load.provider: <id>referencing a missing/failed provider is recorded as a resolve error and skipped, not retried viadetect().2.1.4 Quick example —
greenhouse.mjsThe typedef imports at the top are optional — they only matter if the author wants
// @ts-checkIDE hints. The runtime contract is enforced byscan.mjs(it checksidandtypeof fetch === 'function'at load time andArray.isArray(result)at runtime), not by the type annotations.The
Jobshape is the contract that lets the rest ofscan.mjs(dedup, title filter, pipeline output,scan-history.tsvwriter) stay unchanged across providers.2.2 Discovery & priority
providers/*.mjsare auto-loaded at startup._are shared helpers (e.g._types.js,_http.mjs), not loaded as providers. Future provider RFCs may add more (_browser.mjs,_apify.mjs, etc.) under the same convention.detect()priority is deterministic across machines/filesystems. Custom plugins can use prefixes (00-foo.mjs,99-bar.mjs) to bias priority intentionally.2.3 Resolution per portals.yml entry
entry.provider:(explicit) → forced; skipsdetect().detect()runs in alphabetical order; first hit wins.2.4 Transport contexts
The framework defines a single transport at the
ctxlevel for this RFC:ctxexposeshttp(default)fetchText,fetchJson(timeout-boundedfetchwrapper)The
Contexttypedef andPortalEntry.transportfield reserve space for additional transports (e.g.browser) so future provider RFCs can add them without breaking the contract. No additional transports are implemented here.2.5 Backwards compatibility with existing
scan.mjsflowThe plugin layer is additive. Specifically:
${company.toLowerCase()}::${title.toLowerCase()}.scan-history.tsvsource label keeps the${provider.id}-apisuffix so existing rows continue to match for dedup. (Renaming the suffix would invalidate every existing Greenhouse/Ashby/Lever row in the file.)--dry-run,--companyflags — all unchanged.portals.ymlentries work as-is: Greenhouse/Ashby/Lever auto-detect fromcareers_url. New fields (provider:,transport:,timeout_ms, etc.) are all optional.3. Files affected
System layer (auto-updatable)
scan.mjsproviders/_types.js@typedefcatalog forJob,PortalEntry,DetectHit,FetchOptions,Context,Provider. Pure documentation — providers import types via/** @typedef {import('./_types.js').Provider} Provider */.providers/_http.mjsfetchwrapper with body-snippet error reporting.providers/greenhouse.mjsscan.mjs.providers/ashby.mjsscan.mjs.providers/lever.mjsscan.mjs.templates/portals.example.ymlprovider:field documented on the existing entries; no new entry types.CLAUDE.mdUser layer (untouched)
portals.yml,cv.md,config/profile.yml,modes/_profile.md,data/*,reports/*,output/*,interview-prep/*— none of these are written to or modified by this RFC. Existingportals.ymlentries continue to work; new fields are opt-in.4. Data Contract impact
providers/*.mjsfiles are System layer (scripts), auto-updatable like the rest of*.mjs._types.js,_http.mjs) for shared helpers is enforced byscan.mjs's loader (files starting with_are skipped). Convention is documented inCLAUDE.md.templates/portals.example.yml(System layer) gets a one-lineprovider:documentation update; the user's ownportals.yml(User layer) is not auto-modified, and existing entries continue to work without anyprovider:field (auto-detection handles them).CLAUDE.mdupdates are System-layer and need core-architecture review per the existing process — that's why they're isolated to Phase B.5. Phases
Two PRs, in order, after this RFC is approved.
Phase A — Plugin infra + port existing providers (advances #230)
scan.mjsrefactor (load providers fromproviders/, resolve per-entry, dispatch)._http.mjs._types.js.greenhouse.mjs,ashby.mjs,lever.mjsfromscan.mjsinto the new interface.scan.mjsonce the ports are in place.scan-history.tsvstay (-apisuffix preserved). No behavior change for existingportals.ymlentries; existing entries auto-detect viacareers_urlexactly as today.Estimated diff: ~6 files, mostly moves with a small net-new core.
Phase B — CLAUDE.md documentation
provider:override onportals.ymlentries._types.jsand the existinggreenhouse.mjsas the canonical example.Estimated diff: ~1 file (
CLAUDE.md), no code.Out of scope (follow-up RFCs/PRs)
The following land in separate, additive PRs once the plugin contract is in place:
_browser.mjs) and thetransport: browserPortalEntry field — needed by future scraper / 403-prone providers.scraper.mjs) — schema.org / pattern-based extraction for portals without a clean API.apify.mjs,_apify.mjs,APIFY_TOKENin.env.example) — closes Feature: Apify-based scraper as complement to scan.mjs for Indeed Canada #325 against this architecture.provider: greenhouse, fallback: scraper) and per-entry credit caps.These are deliberately deferred so this RFC reviews as a pure refactor with no behavior change.
6. Open questions
Plugin priority control. Alphabetical-by-filename plus user-prefix convention (
00-,99-) — is that sufficient, or do we want an explicitpriority:field on the exported provider object? Filename ordering is simpler; explicitpriorityis more flexible.Config validation. Should the framework provide a hook for providers to declare a JSON schema for their
portals.ymlblock, so misconfigurations fail at load time rather than at fetch time? Today each provider hand-rolls validation insidefetch().Reserved-but-unused
transportfield. ThePortalEntry.transportfield is declared in the type catalog so future provider RFCs can addbrowserwithout breaking the contract, but no provider in this RFC reads it. Is that acceptable forward-compatibility, or should the field land with the first provider that actually uses it?Appendix — relationship to PR #454
PR #454 implemented the full architecture (plugin infra + browser transport + scraper + Apify) in a single PR before opening an RFC, which was the wrong order and the reason reviewers asked for a split. The branch is preserved (now in draft) and will be carved up as follows:
_http.mjs,_types.js, and thegreenhouse.mjs/ashby.mjs/lever.mjsports out of feat(scan): plugin-based provider architecture with Apify support (#325) #454. No browser transport, no scraper, no Apify._browser.mjsandscraper.mjsfrom feat(scan): plugin-based provider architecture with Apify support (#325) #454, with the round-2 fixes already on the branch (scrapermatch-cap,fileURLToPath, etc.)._apify.mjs,apify.mjs, the.env.exampleAPIFY_TOKENentry, and the round-2 fixes (Authorization-header,proxyConfigurationrename,browserPromiselaunch-failure clear where applicable).Round-2 review fixes already pushed to the #454 branch will be cherry-picked into whichever follow-up PR contains the file they touch, rather than landed cumulatively against this RFC's Phase A.
Beta Was this translation helpful? Give feedback.
All reactions