Pipeline Overview
The Geocode endpoint processes location strings through a 5-tier pipeline, stopping at the first tier that produces a confident result. Each tier is tried in order β the first successful match wins.Tier Details
β‘ 1. Cache
Previously resolved locations stored in MongoDB, keyed by a normalized form of the input string. Provides sub-millisecond responses for repeated queries.- Method returned:
cache - When used: The same (or equivalently normalized) input was recently resolved
- Performance: < 1ms
- Key generation: Input is lowercased, diacritics stripped, whitespace collapsed, and punctuation removed to produce a stable cache key β so
"San Francisco, CA","san francisco ca", and"SAN FRANCISCO, CA"all hit the same cache entry.
π 2. Remote Keyword
Detects remote/virtual/distributed keywords before any geocoding is attempted. Returns a successful result withcountry: "Remote" and no coordinates, since the job is location-independent.
- Method returned:
remote_keyword - Keywords detected:
Remote,Work from Home,Virtual,Distributed,Anywhere,WFH,Telecommute,Telework,Work-from-home - Returns:
succeeded: truewithcountry: "Remote", no coordinates - Performance: < 1ms
Placeholder strings like
"2 Locations" or "Multiple Locations" are also
detected early and returned as failures with a descriptive error, rather than
being sent through the geocoding pipeline.π 3. Pattern Parse
Deterministic rule-based parser using pattern matching against built-in reference data. Handles standardized formats without any external HTTP calls. Results that pass a minimum confidence threshold (50%) are enriched with coordinates via the Photon geocoder.- Method returned:
pattern_parse - Performance: < 5ms (pattern match) + 50β200ms (coordinate enrichment via Photon)
- Confidence scoring: Each parsed result is assigned a confidence score. Only results β₯ 50% confidence are accepted.
- Parsing order: Country-first β US format β Canadian β Australian β UK constituent β Irish county β State-first β International β Country alias β Multi-part fallback β Single part
| Format | Examples |
|---|---|
| US: City, State | San Francisco, CA Β· Austin, Texas |
| US: State abbreviation alone | CA Β· NY Β· TX |
| Canada: City, Province | Toronto, ON Β· Vancouver, BC |
| Australia: City, State | Sydney, NSW Β· Melbourne, VIC |
| UK: City, Constituent Country | London, England Β· Edinburgh, Scotland |
| Ireland: County format | County Cork Β· Dublin, Ireland |
| Country-first | Germany, Berlin Β· ITA, Rome |
| ISO alpha-2 / alpha-3 codes | US Β· GB Β· DEU Β· ITA Β· SWE |
| Well-known cities (standalone) | London Β· Tokyo Β· Dubai Β· Mumbai |
| City aliases & transliterations | MΓΌnchen β Munich Β· NYC β New York City |
| Region aliases | EMEA Β· APAC Β· LATAM Β· DACH |
| Compound expressions | US or Canada Β· New York and San Francisco |
πΊοΈ 4. Photon Geocoder
OpenStreetMap-backed geocoding service for locations that donβt match any deterministic pattern. Provides worldwide coverage with street-level accuracy for most cities.- Method returned:
geocoder - Coverage: Worldwide (OpenStreetMap data)
- Performance: 50β200ms
- Accuracy: Street-level for most cities, region-level for less common locations
- When used: Pattern Parse either failed entirely or produced a result below the confidence threshold
π€ 5. LLM Fallback
AI-powered parsing via OpenRouter API with structured JSON output for highly ambiguous or complex location strings that all previous tiers failed to resolve. Uses retries with exponential backoff for reliability.- Method returned:
llm - Use cases: Unusual abbreviations, creative misspellings, multi-language input, exotic formats
- Performance: 500β2000ms
- Retries: Up to 3 attempts with 300ms Γ attempt backoff
- Output: Structured city/region/country extraction with display name generation
Internal Processing Steps
Under the hood, the 5 public tiers are implemented as a 9-step internal pipeline. This section documents the full processing flow for each location string:Cache Lookup
The input is normalized into a stable cache key (lowercase, stripped
diacritics, collapsed whitespace). MongoDB is queried for a previously
stored result. If found β return immediately.
Remote Keyword Detect
The raw input is checked against a set of known remote/virtual keywords
(
IsFullyRemote). If matched β return remote_keyword result.Clean & Normalize
The input is cleaned by stripping noise words (
hybrid, onsite, office,
headquarters, greater, metro, etc.), removing street addresses, suite
numbers, postal codes, and other non-geographic tokens. Abbreviations are
expanded and diacritics are handled.Compound Split
The cleaned string is checked for compound separators:
" or ", " and ",
" & ", "/". If found, the string is split into independent parts, and
each part is resolved separately through steps 5β8. Results are merged into
a single response with multiple locations.Pattern Normalize
Abbreviations are expanded (e.g.,
St. β Saint, Ft. β Fort),
diacritics are processed, and the input is split into semantic parts (city,
region, country) for rule-based matching.Pattern Parse
The normalized parts are run through a chain of 11 pattern matchers in
priority order: Country-first β US β Canadian β Australian β UK constituent
β Irish county β State-first β International β Country alias β Multi-part
fallback β Single part. Each matcher checks against the built-in reference
data. A confidence score is calculated β results β₯ 50% are accepted.
Photon Geocoder
If pattern parsing produced a confident result but lacks coordinates, Photon
enriches it with lat/lng. If pattern parsing failed entirely, Photon is used
as the primary resolver with the full cleaned string.
LLM Fallback
For inputs that all deterministic methods failed to resolve, an LLM call
extracts structured city/region/country data from the raw string. The LLM
response is validated and the country name is resolved against reference
data.
Built-in Reference Data
The Pattern Parse tier operates against a comprehensive inventory of built-in reference data, compiled from real-world job posting data. All lookups use case-insensitive, zero-allocation frozen collections for maximum throughput.πΊπΈ US States & Territories
51 entries β all 50 states + District of Columbia. Both 2-letter
abbreviations (
CA, NY, TX) and full names (California, New York,
Texas).π¨π¦ Canadian Provinces
13 entries β all provinces and territories. Both abbreviations (
ON,
BC, QC) and full names (Ontario, British Columbia, Quebec).π¦πΊ Australian States
8 entries β all states and territories. Both abbreviations (
NSW,
VIC, QLD) and full names (New South Wales, Victoria, Queensland).π ISO Alpha-2 Country Codes
~120 codes β comprehensive coverage of countries commonly seen in job
postings, from
US and GB to KE and UZ. Includes common variants like
UK, U.S., U.S.A..π ISO Alpha-3 Country Codes
~80 codes β three-letter codes like
DEU, GBR, ITA, SWE, AUS
for formats like "ITA, Rome".ποΈ City Aliases & Transliterations
50+ aliases β maps common abbreviations and non-English names to
canonical forms:
NYC β New York City, MΓΌnchen β Munich, SF β San
Francisco, CDMX β Mexico City, Bombay β Mumbai.πΊοΈ Well-known Cities
200+ cities β globally unambiguous major cities with pre-mapped country
and region. Covers US, Canada, Europe, Asia, Middle East, Africa, and Latin
America. Used for single-city-name resolution (e.g.,
"London" β United
Kingdom).π Region Aliases
15 aliases β corporate region codes mapped to geographic areas:
EMEA β
Europe, APAC β Asia Pacific, LATAM β Latin America, DACH β Europe,
BENELUX β Europe, GCC β Middle East, ANZ β Asia Pacific, NORDICS β
Europe.- German states (BundeslΓ€nder): 24 entries (both German and English names) for
"City, Bavaria, Germany"patterns - Indian states: 30 entries for
"City, Maharashtra, India"patterns - UK constituent countries: England, Scotland, Wales, Northern Ireland
- Country name aliases: 70+ variant spellings, old names, non-English country names (
Deutschlandβ Germany,Czechiaβ Czech Republic,ζ₯ζ¬β Japan,ΰ€ΰ€Ύΰ€°ΰ€€β India) - Ambiguous code registry: 9 codes that collide between US states and country codes (
CA,IN,GA,DE,CO,AL,ME,NL,SK) with context-aware disambiguation - Noise keywords: 35+ words stripped during cleaning (
hybrid,onsite,office,headquarters,metro,greater, etc.)
Response method Field
The method field in every response tells you which tier resolved the location:
| Method | Tier | Typical Latency | Description |
|---|---|---|---|
cache | 1 | < 1ms | Previously resolved, served from MongoDB |
remote_keyword | 2 | < 1ms | Detected as remote/virtual keyword |
pattern_parse | 3 | < 5ms + enrichment | Rule-based match against reference data |
geocoder | 4 | 50β200ms | Resolved by Photon (OpenStreetMap) |
llm | 5 | 500β2000ms | AI-parsed by LLM via OpenRouter |
no_match | β | β | All tiers failed (returned as error) |
Performance note: In production, the vast majority of locations resolve
via cache or pattern parse (sub-millisecond to ~5ms). The Photon
geocoder adds ~50β200ms for coordinate enrichment or full resolution. The LLM
fallback is the slowest tier (~500β2000ms) but handles the most ambiguous
inputs. Because all successful results are cached, subsequent requests for the
same input are always instant.

