Discovery Engine v6.1 Docs

The Business Discovery Engine is an open-source Node.js tool that discovers businesses from 4 public sources, enriches them with emails, social profiles, and company intelligence, and pushes everything to Google Sheets in real-time. v6.1 adds round-robin discovery, inline enrichment, and crash recovery.

v6.1 key change: Data flows into your Sheet continuously as each chunk completes. No more waiting hours for all discovery to finish first. Discover → Enrich → Push → Rotate.

Quick Start

1. Install

git clone https://github.com/itallstartedwithaidea/business-discovery-engine.git
cd business-discovery-engine
npm install

2. Configure

cp .env.example .env
# Edit .env — add GOOGLE_SPREADSHEET_ID and GOOGLE_CREDENTIALS_PATH

3. Run

# All 5 states, round-robin, 1000/category
node engine.js start --state ALL --fresh

# Background run (safe to close terminal)
nohup node engine.js start --state ALL --fresh > full-run.log 2>&1 &
tail -f full-run.log

# Single state
node engine.js start --state AZ --fresh

# Custom limits
node engine.js start --state ALL --max 500 --chunk 100 --fresh

# Specific categories
node engine.js start --state OH --categories "plumber,dentist"

Google Sheets Setup

The engine requires a Google Cloud service account with Sheets API access.

  1. Create a Google Cloud service account
  2. Enable the Google Sheets API in your project
  3. Download the JSON key file → save as google-credentials.json in the project root
  4. Create a new Google Sheet and copy the spreadsheet ID from the URL
  5. Share the Sheet with the service account email (Editor access)
  6. Add the spreadsheet ID and credentials path to your .env file
Never commit credentials. The .gitignore excludes .env and google-credentials.json by default. Your service account key should never be in version control.

Architecture: Round-Robin + Inline Enrichment

v6.1 The engine processes categories one at a time, discovering a chunk of businesses, enriching them immediately, and pushing to Sheets before rotating to the next category.

For each category (65 total):

  1. DISCOVER 250 businesses (configurable --chunk)
     — Shuffle: states × sources × cities
     — YP in Phoenix → Yelp in Columbus → BBB in Boise → GMaps in Vegas
     — Global dedup on every insert

  2. FIND WEBSITES (DuckDuckGo + Google fallback)

  3. ENRICH each business
     — Visit website → crawl contact/about/team pages
     — 9 email extraction methods
     — Social media link extraction
     — WHOIS → domain age → business age scoring
     — Email pattern inference → MX verification

  4. PUSH TO SHEETS (immediately)
     — Batch of 10 rows per API call
     — Each state gets its own tab
     — Dashboard updates every 30 seconds

  → Save checkpoint → rotate to next category → repeat

Key differences from v5

4 Discovery Sources

All sources use Puppeteer with stealth plugin for JavaScript rendering, anti-bot evasion, and human-like behavior.

Source Method Coverage Notes
Yellow Pages Puppeteer + Stealth 3 pages per category/city Extracts name, phone, address, website
Yelp Puppeteer + anti-detection Auto-skips after 5 blocks Longer delays, stealth scrolling, retry logic
BBB Puppeteer (React SPA) Top 15 categories, 3 cities Headless Chrome renders React app
Google Maps Puppeteer Top 20 categories, 5 cities Aria-label extraction, /maps/search/ URLs

Anti-detection features

Deduplication

Entity resolution runs on every insert using multiple matching strategies:

Records from multiple sources are merged into a single business entry with combined source attribution.

Enrichment Pipeline

v6.1 Enrichment runs inline after each discovery chunk, not as a separate phase.

For each discovered business with a website:

  1. Visit the website — crawl homepage + up to 15 subpages (contact, about, team pages prioritized)
  2. Extract emails — 9 methods applied in sequence
  3. Extract social media — Facebook, Instagram, LinkedIn, Twitter/X links
  4. WHOIS lookup — domain registration date, registrant info, business age scoring
  5. Email pattern inference — detect patterns from known contacts, generate for staff without emails
  6. MX verification — DNS lookup to validate mail servers exist for every email domain

For businesses without websites, the engine searches DuckDuckGo and Google as fallback.

9 Email Extraction Methods

Applied in sequence to every crawled page:

# Method Description
1 JSON-LD / Schema.org Parses structured data for contact information
2 Mailto links Extracts from href="mailto:" patterns in HTML
3 Regex on page text Comprehensive email pattern matching across full page
4 Meta tags Checks meta author, contact, and other tags
5 VCard / hCard Parses microformat contact cards
6 Staff directory parsing Finds name + title + email from team/about pages
7 WHOIS/RDAP Registrant contact data from domain records
8 Email pattern inference Detects patterns from known contacts, generates for others
9 MX verification DNS lookup validates mail servers exist for every domain

Email Pattern Inference

When the engine finds at least one verified email for a domain, it detects the naming pattern and generates likely emails for any staff members found without them.

Detected patterns

Inferred emails are marked with inferred confidence and still undergo MX verification before being included.

MX Verification

Every extracted or inferred email is validated via DNS MX record lookup. The engine uses Node.js dns.resolveMx() to check that the email domain has active mail servers.

Confidence Meaning
verified Email extracted directly from website + MX records valid
inferred Email generated via pattern inference + MX records valid
whois Email from WHOIS/RDAP registrant data
no_mx Email found but domain has no MX records — may be invalid

Social Media Extraction

Extracts social profiles from every crawled business website:

Platform Detection Method Filters
Facebook href scanning + HTML regex Filters share/login/plugin URLs
Instagram URL pattern matching Filters explore/accounts/posts
LinkedIn Company and personal profile URLs Filters share/login pages
Twitter/X twitter.com and x.com patterns Filters share/intent links

WHOIS Business Age Scoring

WHOIS/RDAP lookup reveals domain registration date. The engine calculates business age and applies labels:

Label Age Use Case
NEW < 1 year Recently started businesses — may need marketing services
NEW 1–2 years Early stage businesses building their presence
Growing 2–5 years Established enough to invest in growth
Established 5–10 years Stable businesses with existing operations
Mature 10+ years Long-standing businesses

5 States (54 Cities)

State Cities
AZ (15) Phoenix, Scottsdale, Tempe, Mesa, Chandler, Gilbert, Glendale, Peoria, Surprise, Tucson, Flagstaff, Yuma, Goodyear, Buckeye, Avondale
NV (10) Las Vegas, Henderson, Reno, North Las Vegas, Sparks, Carson City, Mesquite, Boulder City, Elko, Fernley
OH (12) Columbus, Cleveland, Cincinnati, Toledo, Akron, Dayton, Canton, Youngstown, Dublin, Westerville, Mason, Parma
ID (10) Boise, Meridian, Nampa, Caldwell, Idaho Falls, Pocatello, Twin Falls, Coeur d'Alene, Lewiston, Eagle
WA (12) Seattle, Spokane, Tacoma, Vancouver, Bellevue, Kent, Everett, Renton, Kirkland, Redmond, Olympia, Bellingham

Use --state ALL for round-robin across all states, or --state AZ for a single state. Use node engine.js states to list all states and cities.

65 Categories

Local Services (30)

plumber, electrician, dentist, restaurant, auto repair, salon, law firm, accountant, real estate agent, roofing, hvac, cleaning service, landscaping, insurance agent, veterinarian, fitness, photography, marketing agency, construction, mechanic, chiropractor, bakery, florist, pet grooming, daycare, tutoring, printing, tailor, locksmith, moving company

Retail & Ecommerce (10)

boutique, jewelry store, furniture store, sporting goods, pet store, gift shop, wine shop, supplement store, thrift store, consignment shop

Food & Beverage (5)

coffee shop, brewery, catering, food truck, juice bar

Health & Wellness (6)

med spa, dermatologist, physical therapy, optometrist, mental health counselor, massage therapist

Professional Services (6)

financial advisor, mortgage broker, staffing agency, IT services, web design, commercial cleaning

Home Services (8)

garage door, pest control, fence company, pool service, solar installer, window cleaning, tree service, pressure washing

Use node engine.js cats to list all 65 categories. Filter with --categories "plumber,dentist,hvac".

18-Column Output

Each state gets its own tab in Google Sheets with these columns:

Column Description Source
First Name Contact first name Website crawl
Last Name Contact last name Website crawl
Email Email address 9 extraction methods + MX
Title Job title (Owner, Manager, etc.) Staff card parsing
Company Name Business name Discovery sources
Location City, State Discovery sources
Website Business website URL Discovery + DuckDuckGo fallback
Phone Phone number Discovery sources
Facebook Facebook page URL Website crawl
Instagram Instagram profile URL Website crawl
LinkedIn LinkedIn profile URL Website crawl
Twitter/X X profile URL Website crawl
Source Discovery source(s) YP / Yelp / BBB / GMaps
Confidence verified, inferred, whois, no_mx Computed
Biz Age NEW, Growing, Established, Mature WHOIS/RDAP lookup
Year Founded Domain registration year WHOIS/RDAP lookup
Industry Business category Input parameter
Date Discovery timestamp Auto-generated

Live Dashboard

The engine auto-creates a "Dashboard" tab in your Google Sheet, updated every 30 seconds with:

Confidence Scoring

Each email is assigned a confidence level based on how it was obtained:

Level How Assigned
verified Directly extracted from business website + MX records confirm working mail server
inferred Generated via email pattern inference from known contacts + MX verified
whois Extracted from WHOIS/RDAP domain registration data
no_mx Email found but domain's DNS has no MX records — delivery uncertain

CLI Reference

# ── RUN ──
node engine.js start --state ALL                    # All states, round-robin
node engine.js start --state AZ                     # Single state
node engine.js start --state ALL --max 500          # 500/category cap
node engine.js start --state ALL --chunk 100        # Rotate every 100
node engine.js start --state OH --categories "plumber,dentist"
node engine.js start --state ALL --fresh            # Ignore saved progress

# ── JOB CONTROL ──
node engine.js pause                                # Pause after current business
node engine.js resume                               # Resume from pause
node engine.js stop                                 # Graceful stop + checkpoint
node engine.js status                               # Show current state
node engine.js reset                                # Clear all saved progress

# ── INFO ──
node engine.js states                               # List all states + cities
node engine.js cats                                 # List all 65 categories
node engine.js                                      # Help

Flags

Flag Default Description
--state required State abbreviation (AZ, NV, OH, ID, WA) or ALL
--max 1000 Maximum businesses per category across all states
--chunk 250 Rotate to next category after N new discoveries
--categories all 65 Comma-separated list of categories to run
--fresh false Ignore saved progress and start from scratch

Environment Variables

# Google Sheets — REQUIRED
GOOGLE_SPREADSHEET_ID=your_spreadsheet_id_here
GOOGLE_CREDENTIALS_PATH=google-credentials.json

# Timing
DELAY_MS=3000                    # Base delay between requests (ms)
MAX_PAGES_PER_SITE=15            # Max subpages to crawl per business website

# Discovery limits
MAX_PER_CATEGORY=1000            # Max businesses per category (65 categories)
CHUNK_SIZE=250                   # Rotate to next category after N discoveries

# Proxy (optional — recommended for large runs)
# PROXY_URL=http://user:pass@host:port

Crash Recovery

v6.1 The engine survives laptop sleep, terminal disconnects, and crashes.

State is saved to .discovery-state.json. To start fresh, use --fresh or run node engine.js reset.

Proxy Setup

For large discovery runs, a proxy is recommended to avoid IP-based rate limiting.

# In .env
PROXY_URL=http://username:password@proxy-host:port

The proxy is passed to Puppeteer via --proxy-server flag. All HTTP requests through the browser use the proxy. Axios requests (WHOIS, DuckDuckGo) also route through the proxy when configured.


Built by John Williams — Senior Paid Media Specialist at Seer Interactive.
Part of the Google Ads Agent ecosystem. Open source on GitHub.