Skip to main content

Pipeline Overview

The Geocode endpoint processes location strings through a 5-tier pipeline, stopping at the first tier that produces a confident result. Each tier is tried in order β€” the first successful match wins.

Tier Details

⚑ 1. Cache

Previously resolved locations stored in MongoDB, keyed by a normalized form of the input string. Provides sub-millisecond responses for repeated queries.
  • Method returned: cache
  • When used: The same (or equivalently normalized) input was recently resolved
  • Performance: < 1ms
  • Key generation: Input is lowercased, diacritics stripped, whitespace collapsed, and punctuation removed to produce a stable cache key β€” so "San Francisco, CA", "san francisco ca", and "SAN FRANCISCO, CA" all hit the same cache entry.

🌐 2. Remote Keyword

Detects remote/virtual/distributed keywords before any geocoding is attempted. Returns a successful result with country: "Remote" and no coordinates, since the job is location-independent.
  • Method returned: remote_keyword
  • Keywords detected: Remote, Work from Home, Virtual, Distributed, Anywhere, WFH, Telecommute, Telework, Work-from-home
  • Returns: succeeded: true with country: "Remote", no coordinates
  • Performance: < 1ms
Placeholder strings like "2 Locations" or "Multiple Locations" are also detected early and returned as failures with a descriptive error, rather than being sent through the geocoding pipeline.

πŸ” 3. Pattern Parse

Deterministic rule-based parser using pattern matching against built-in reference data. Handles standardized formats without any external HTTP calls. Results that pass a minimum confidence threshold (50%) are enriched with coordinates via the Photon geocoder.
  • Method returned: pattern_parse
  • Performance: < 5ms (pattern match) + 50–200ms (coordinate enrichment via Photon)
  • Confidence scoring: Each parsed result is assigned a confidence score. Only results β‰₯ 50% confidence are accepted.
  • Parsing order: Country-first β†’ US format β†’ Canadian β†’ Australian β†’ UK constituent β†’ Irish county β†’ State-first β†’ International β†’ Country alias β†’ Multi-part fallback β†’ Single part
Formats handled:
FormatExamples
US: City, StateSan Francisco, CA Β· Austin, Texas
US: State abbreviation aloneCA Β· NY Β· TX
Canada: City, ProvinceToronto, ON Β· Vancouver, BC
Australia: City, StateSydney, NSW Β· Melbourne, VIC
UK: City, Constituent CountryLondon, England Β· Edinburgh, Scotland
Ireland: County formatCounty Cork Β· Dublin, Ireland
Country-firstGermany, Berlin Β· ITA, Rome
ISO alpha-2 / alpha-3 codesUS Β· GB Β· DEU Β· ITA Β· SWE
Well-known cities (standalone)London Β· Tokyo Β· Dubai Β· Mumbai
City aliases & transliterationsMΓΌnchen β†’ Munich Β· NYC β†’ New York City
Region aliasesEMEA Β· APAC Β· LATAM Β· DACH
Compound expressionsUS or Canada Β· New York and San Francisco

πŸ—ΊοΈ 4. Photon Geocoder

OpenStreetMap-backed geocoding service for locations that don’t match any deterministic pattern. Provides worldwide coverage with street-level accuracy for most cities.
  • Method returned: geocoder
  • Coverage: Worldwide (OpenStreetMap data)
  • Performance: 50–200ms
  • Accuracy: Street-level for most cities, region-level for less common locations
  • When used: Pattern Parse either failed entirely or produced a result below the confidence threshold

πŸ€– 5. LLM Fallback

AI-powered parsing via OpenRouter API with structured JSON output for highly ambiguous or complex location strings that all previous tiers failed to resolve. Uses retries with exponential backoff for reliability.
  • Method returned: llm
  • Use cases: Unusual abbreviations, creative misspellings, multi-language input, exotic formats
  • Performance: 500–2000ms
  • Retries: Up to 3 attempts with 300ms Γ— attempt backoff
  • Output: Structured city/region/country extraction with display name generation
The LLM tier is only available when an OpenRouter API key is configured. If unavailable, the pipeline returns a no_match error for inputs that reach this tier.

Internal Processing Steps

Under the hood, the 5 public tiers are implemented as a 9-step internal pipeline. This section documents the full processing flow for each location string:
1

Cache Lookup

The input is normalized into a stable cache key (lowercase, stripped diacritics, collapsed whitespace). MongoDB is queried for a previously stored result. If found β†’ return immediately.
2

Remote Keyword Detect

The raw input is checked against a set of known remote/virtual keywords (IsFullyRemote). If matched β†’ return remote_keyword result.
3

Clean & Normalize

The input is cleaned by stripping noise words (hybrid, onsite, office, headquarters, greater, metro, etc.), removing street addresses, suite numbers, postal codes, and other non-geographic tokens. Abbreviations are expanded and diacritics are handled.
4

Compound Split

The cleaned string is checked for compound separators: " or ", " and ", " & ", "/". If found, the string is split into independent parts, and each part is resolved separately through steps 5–8. Results are merged into a single response with multiple locations.
5

Pattern Normalize

Abbreviations are expanded (e.g., St. β†’ Saint, Ft. β†’ Fort), diacritics are processed, and the input is split into semantic parts (city, region, country) for rule-based matching.
6

Pattern Parse

The normalized parts are run through a chain of 11 pattern matchers in priority order: Country-first β†’ US β†’ Canadian β†’ Australian β†’ UK constituent β†’ Irish county β†’ State-first β†’ International β†’ Country alias β†’ Multi-part fallback β†’ Single part. Each matcher checks against the built-in reference data. A confidence score is calculated β€” results β‰₯ 50% are accepted.
7

Photon Geocoder

If pattern parsing produced a confident result but lacks coordinates, Photon enriches it with lat/lng. If pattern parsing failed entirely, Photon is used as the primary resolver with the full cleaned string.
8

LLM Fallback

For inputs that all deterministic methods failed to resolve, an LLM call extracts structured city/region/country data from the raw string. The LLM response is validated and the country name is resolved against reference data.
9

Cache Write

Successful results are written back to MongoDB so that subsequent requests for the same (or equivalently normalized) input resolve instantly from cache.

Built-in Reference Data

The Pattern Parse tier operates against a comprehensive inventory of built-in reference data, compiled from real-world job posting data. All lookups use case-insensitive, zero-allocation frozen collections for maximum throughput.

πŸ‡ΊπŸ‡Έ US States & Territories

51 entries β€” all 50 states + District of Columbia. Both 2-letter abbreviations (CA, NY, TX) and full names (California, New York, Texas).

πŸ‡¨πŸ‡¦ Canadian Provinces

13 entries β€” all provinces and territories. Both abbreviations (ON, BC, QC) and full names (Ontario, British Columbia, Quebec).

πŸ‡¦πŸ‡Ί Australian States

8 entries β€” all states and territories. Both abbreviations (NSW, VIC, QLD) and full names (New South Wales, Victoria, Queensland).

🌍 ISO Alpha-2 Country Codes

~120 codes β€” comprehensive coverage of countries commonly seen in job postings, from US and GB to KE and UZ. Includes common variants like UK, U.S., U.S.A..

🌏 ISO Alpha-3 Country Codes

~80 codes β€” three-letter codes like DEU, GBR, ITA, SWE, AUS for formats like "ITA, Rome".

πŸ™οΈ City Aliases & Transliterations

50+ aliases β€” maps common abbreviations and non-English names to canonical forms: NYC β†’ New York City, MΓΌnchen β†’ Munich, SF β†’ San Francisco, CDMX β†’ Mexico City, Bombay β†’ Mumbai.

πŸ—ΊοΈ Well-known Cities

200+ cities β€” globally unambiguous major cities with pre-mapped country and region. Covers US, Canada, Europe, Asia, Middle East, Africa, and Latin America. Used for single-city-name resolution (e.g., "London" β†’ United Kingdom).

πŸ“ Region Aliases

15 aliases β€” corporate region codes mapped to geographic areas: EMEA β†’ Europe, APAC β†’ Asia Pacific, LATAM β†’ Latin America, DACH β†’ Europe, BENELUX β†’ Europe, GCC β†’ Middle East, ANZ β†’ Asia Pacific, NORDICS β†’ Europe.
Additional reference data:
  • German states (BundeslΓ€nder): 24 entries (both German and English names) for "City, Bavaria, Germany" patterns
  • Indian states: 30 entries for "City, Maharashtra, India" patterns
  • UK constituent countries: England, Scotland, Wales, Northern Ireland
  • Country name aliases: 70+ variant spellings, old names, non-English country names (Deutschland β†’ Germany, Czechia β†’ Czech Republic, ζ—₯本 β†’ Japan, ΰ€­ΰ€Ύΰ€°ΰ€€ β†’ India)
  • Ambiguous code registry: 9 codes that collide between US states and country codes (CA, IN, GA, DE, CO, AL, ME, NL, SK) with context-aware disambiguation
  • Noise keywords: 35+ words stripped during cleaning (hybrid, onsite, office, headquarters, metro, greater, etc.)

Response method Field

The method field in every response tells you which tier resolved the location:
MethodTierTypical LatencyDescription
cache1< 1msPreviously resolved, served from MongoDB
remote_keyword2< 1msDetected as remote/virtual keyword
pattern_parse3< 5ms + enrichmentRule-based match against reference data
geocoder450–200msResolved by Photon (OpenStreetMap)
llm5500–2000msAI-parsed by LLM via OpenRouter
no_matchβ€”β€”All tiers failed (returned as error)
Performance note: In production, the vast majority of locations resolve via cache or pattern parse (sub-millisecond to ~5ms). The Photon geocoder adds ~50–200ms for coordinate enrichment or full resolution. The LLM fallback is the slowest tier (~500–2000ms) but handles the most ambiguous inputs. Because all successful results are cached, subsequent requests for the same input are always instant.