Resolution Pipeline - Jobo Enterprise

Pipeline Overview

The Geocode endpoint processes location strings through a 5-tier pipeline, stopping at the first tier that produces a confident result. Each tier is tried in order — the first successful match wins.

Tier Details

⚡ 1. Cache

Previously resolved locations stored in MongoDB, keyed by a normalized form of the input string. Provides sub-millisecond responses for repeated queries.

Method returned: cache
When used: The same (or equivalently normalized) input was recently resolved
Performance: < 1ms
Key generation: Input is lowercased, diacritics stripped, whitespace collapsed, and punctuation removed to produce a stable cache key — so "San Francisco, CA", "san francisco ca", and "SAN FRANCISCO, CA" all hit the same cache entry.

🌐 2. Remote Keyword

Detects remote/virtual/distributed keywords before any geocoding is attempted. Returns a successful result with country: "Remote" and no coordinates, since the job is location-independent.

Method returned: remote_keyword
Keywords detected: Remote, Work from Home, Virtual, Distributed, Anywhere, WFH, Telecommute, Telework, Work-from-home
Returns: succeeded: true with country: "Remote", no coordinates
Performance: < 1ms

Placeholder strings like "2 Locations" or "Multiple Locations" are also detected early and returned as failures with a descriptive error, rather than being sent through the geocoding pipeline.

🔍 3. Pattern Parse

Deterministic rule-based parser using pattern matching against built-in reference data. Handles standardized formats without any external HTTP calls. Results that pass a minimum confidence threshold (50%) are enriched with coordinates via the Photon geocoder.

Method returned: pattern_parse
Performance: < 5ms (pattern match) + 50–200ms (coordinate enrichment via Photon)
Confidence scoring: Each parsed result is assigned a confidence score. Only results ≥ 50% confidence are accepted.
Parsing order: Country-first → US format → Canadian → Australian → UK constituent → Irish county → State-first → International → Country alias → Multi-part fallback → Single part

Formats handled:

Format	Examples
US: City, State	`San Francisco, CA` · `Austin, Texas`
US: State abbreviation alone	`CA` · `NY` · `TX`
Canada: City, Province	`Toronto, ON` · `Vancouver, BC`
Australia: City, State	`Sydney, NSW` · `Melbourne, VIC`
UK: City, Constituent Country	`London, England` · `Edinburgh, Scotland`
Ireland: County format	`County Cork` · `Dublin, Ireland`
Country-first	`Germany, Berlin` · `ITA, Rome`
ISO alpha-2 / alpha-3 codes	`US` · `GB` · `DEU` · `ITA` · `SWE`
Well-known cities (standalone)	`London` · `Tokyo` · `Dubai` · `Mumbai`
City aliases & transliterations	`München` → Munich · `NYC` → New York City
Region aliases	`EMEA` · `APAC` · `LATAM` · `DACH`
Compound expressions	`US or Canada` · `New York and San Francisco`

🗺️ 4. Photon Geocoder

OpenStreetMap-backed geocoding service for locations that don’t match any deterministic pattern. Provides worldwide coverage with street-level accuracy for most cities.

Method returned: geocoder
Coverage: Worldwide (OpenStreetMap data)
Performance: 50–200ms
Accuracy: Street-level for most cities, region-level for less common locations
When used: Pattern Parse either failed entirely or produced a result below the confidence threshold

🤖 5. LLM Fallback

AI-powered parsing via OpenRouter API with structured JSON output for highly ambiguous or complex location strings that all previous tiers failed to resolve. Uses retries with exponential backoff for reliability.

Method returned: llm
Use cases: Unusual abbreviations, creative misspellings, multi-language input, exotic formats
Performance: 500–2000ms
Retries: Up to 3 attempts with 300ms × attempt backoff
Output: Structured city/region/country extraction with display name generation

The LLM tier is only available when an OpenRouter API key is configured. If unavailable, the pipeline returns a no_match error for inputs that reach this tier.

Internal Processing Steps

Under the hood, the 5 public tiers are implemented as a 9-step internal pipeline. This section documents the full processing flow for each location string:

Cache Lookup

The input is normalized into a stable cache key (lowercase, stripped diacritics, collapsed whitespace). MongoDB is queried for a previously stored result. If found → return immediately.

Remote Keyword Detect

The raw input is checked against a set of known remote/virtual keywords (IsFullyRemote). If matched → return remote_keyword result.

Clean & Normalize

The input is cleaned by stripping noise words (hybrid, onsite, office, headquarters, greater, metro, etc.), removing street addresses, suite numbers, postal codes, and other non-geographic tokens. Abbreviations are expanded and diacritics are handled.

Compound Split

The cleaned string is checked for compound separators: " or ", " and ", " & ", "/". If found, the string is split into independent parts, and each part is resolved separately through steps 5–8. Results are merged into a single response with multiple locations.

Pattern Normalize

Abbreviations are expanded (e.g., St. → Saint, Ft. → Fort), diacritics are processed, and the input is split into semantic parts (city, region, country) for rule-based matching.

Pattern Parse

The normalized parts are run through a chain of 11 pattern matchers in priority order: Country-first → US → Canadian → Australian → UK constituent → Irish county → State-first → International → Country alias → Multi-part fallback → Single part. Each matcher checks against the built-in reference data. A confidence score is calculated — results ≥ 50% are accepted.

Photon Geocoder

If pattern parsing produced a confident result but lacks coordinates, Photon enriches it with lat/lng. If pattern parsing failed entirely, Photon is used as the primary resolver with the full cleaned string.

LLM Fallback

For inputs that all deterministic methods failed to resolve, an LLM call extracts structured city/region/country data from the raw string. The LLM response is validated and the country name is resolved against reference data.

Cache Write

Successful results are written back to MongoDB so that subsequent requests for the same (or equivalently normalized) input resolve instantly from cache.

Built-in Reference Data

The Pattern Parse tier operates against a comprehensive inventory of built-in reference data, compiled from real-world job posting data. All lookups use case-insensitive, zero-allocation frozen collections for maximum throughput.

🇺🇸 US States & Territories

51 entries — all 50 states + District of Columbia. Both 2-letter abbreviations (CA, NY, TX) and full names (California, New York, Texas).

🇨🇦 Canadian Provinces

13 entries — all provinces and territories. Both abbreviations (ON, BC, QC) and full names (Ontario, British Columbia, Quebec).

🇦🇺 Australian States

8 entries — all states and territories. Both abbreviations (NSW, VIC, QLD) and full names (New South Wales, Victoria, Queensland).

🌍 ISO Alpha-2 Country Codes

~120 codes — comprehensive coverage of countries commonly seen in job postings, from US and GB to KE and UZ. Includes common variants like UK, U.S., U.S.A..

🌏 ISO Alpha-3 Country Codes

~80 codes — three-letter codes like DEU, GBR, ITA, SWE, AUS for formats like "ITA, Rome".

🏙️ City Aliases & Transliterations

50+ aliases — maps common abbreviations and non-English names to canonical forms: NYC → New York City, München → Munich, SF → San Francisco, CDMX → Mexico City, Bombay → Mumbai.

🗺️ Well-known Cities

200+ cities — globally unambiguous major cities with pre-mapped country and region. Covers US, Canada, Europe, Asia, Middle East, Africa, and Latin America. Used for single-city-name resolution (e.g., "London" → United Kingdom).

📍 Region Aliases

15 aliases — corporate region codes mapped to geographic areas: EMEA → Europe, APAC → Asia Pacific, LATAM → Latin America, DACH → Europe, BENELUX → Europe, GCC → Middle East, ANZ → Asia Pacific, NORDICS → Europe.

Additional reference data:

German states (Bundesländer): 24 entries (both German and English names) for "City, Bavaria, Germany" patterns
Indian states: 30 entries for "City, Maharashtra, India" patterns
UK constituent countries: England, Scotland, Wales, Northern Ireland
Country name aliases: 70+ variant spellings, old names, non-English country names (Deutschland → Germany, Czechia → Czech Republic, 日本 → Japan, भारत → India)
Ambiguous code registry: 9 codes that collide between US states and country codes (CA, IN, GA, DE, CO, AL, ME, NL, SK) with context-aware disambiguation
Noise keywords: 35+ words stripped during cleaning (hybrid, onsite, office, headquarters, metro, greater, etc.)

Response `method` Field

The method field in every response tells you which tier resolved the location:

Method	Tier	Typical Latency	Description
`cache`	1	< 1ms	Previously resolved, served from MongoDB
`remote_keyword`	2	< 1ms	Detected as remote/virtual keyword
`pattern_parse`	3	< 5ms + enrichment	Rule-based match against reference data
`geocoder`	4	50–200ms	Resolved by Photon (OpenStreetMap)
`llm`	5	500–2000ms	AI-parsed by LLM via OpenRouter
`no_match`	—	—	All tiers failed (returned as error)

Performance note: In production, the vast majority of locations resolve via cache or pattern parse (sub-millisecond to ~5ms). The Photon geocoder adds ~50–200ms for coordinate enrichment or full resolution. The LLM fallback is the slowest tier (~500–2000ms) but handles the most ambiguous inputs. Because all successful results are cached, subsequent requests for the same input are always instant.

API Reference

Documentation Index

​Pipeline Overview

​Tier Details

​⚡ 1. Cache

​🌐 2. Remote Keyword

​🔍 3. Pattern Parse

​🗺️ 4. Photon Geocoder

​🤖 5. LLM Fallback

​Internal Processing Steps

​Built-in Reference Data

🇺🇸 US States & Territories

🇨🇦 Canadian Provinces

🇦🇺 Australian States

🌍 ISO Alpha-2 Country Codes

🌏 ISO Alpha-3 Country Codes

🏙️ City Aliases & Transliterations

🗺️ Well-known Cities

📍 Region Aliases

​Response method Field

Pipeline Overview

Tier Details

⚡ 1. Cache

🌐 2. Remote Keyword

🔍 3. Pattern Parse

🗺️ 4. Photon Geocoder

🤖 5. LLM Fallback

Internal Processing Steps

Built-in Reference Data

Response `method` Field