DIY Media Database
Built from scratch without paying for a media list
⌚ 9 min read · 2,000 words
journalist contacts
Most PR agencies pay $500 to $1,000 a month for media databases like Prowly, Muckrack, or Cision. We did too, briefly. Then I realized something: those databases are incomplete, outdated, and everyone else has the same list.
“A journalist database isn’t a spreadsheet of email addresses. It’s a living system of relationships, recency signals, and coverage patterns.”
— Salva Jovells, Presslei
In This Article
- Why Commercial Media Databases Fall Short
- Source 1: Mining Your Own Placement History
- Source 2: Competitor Backlink Mining
- Source 3: Publication Sitemap Scraping
- Source 4: LinkedIn Connections
- Source 5: Email Pattern Engineering
- The Merge: Deduplication and Scoring
- Keeping It Fresh
- Do You Need 27,000+ Contacts?
Key Takeaway
We built a database of 27,000+ journalists without buying a single media list. Using free tools, public data from sitemaps and bylines, LinkedIn exports, and smart automation, any PR team can build a better contact list than what paid services offer.
So I built our own. From scratch. It now has 27,000+ journalists with beat information, contact details, and engagement history. And I am going to show you exactly how we did it.
Why Commercial Media Databases Fall Short
I am not saying these tools are useless. They are convenient. But they have three fundamental problems:
- Everyone has the same contacts. If you and every other agency are pitching the same list from Muckrack, those journalists are drowning in pitches. Your “exclusive” data story lands in an inbox alongside 50 other agencies using the same database.
- The data decays fast. Journalists change beats, switch publications, go freelance. Commercial databases update quarterly at best. By the time you pitch, the journalist may have moved on months ago.
- They miss the long tail. Freelancers, regional journalists, niche trade press, new hires who have not been indexed yet. Some of our best placements came from journalists who are not in any commercial database.
Building your own database is more work upfront. But the contacts are fresher, more targeted, and exclusively yours.
Source 1: Mining Your Own Placement History
If you have done any PR before, start here. Your past placements are a goldmine of journalist contacts.
We went through over 5,200 historical placements and extracted every journalist byline, email pattern, and publication. This gave us roughly 900 journalists who we knew had covered stories similar to ours, because they already had.
The process:
- Export all your placement URLs into a spreadsheet
- Visit each article and extract the journalist’s name and any contact info
- Note what topics they covered and which pitches they responded to
- Cross-reference with LinkedIn to find current roles
Yes, this is tedious. We ended up automating most of it with scripts that scrape bylines and author pages. But even doing it manually for your top 100 placements gives you a strong starting list.
The best predictor of a future placement is a past placement. Journalists who covered your kind of story before are significantly more likely to cover it again.
Pro Tip
Personalize every pitch. Reference the journalist most recent article and explain why your story matters to their specific audience.
Source 2: Competitor Backlink Mining
DO
- Start with Google News searches for recent relevant coverage
- Record the last 3 relevant articles for each journalist
- Verify email addresses before adding to your outreach list
- Include freelancers who write for multiple publications
- Update your database after every campaign with response data
DON’T
- Buy pre-built journalist databases without verification
- Add journalists based on publication prestige alone
- Include journalists who haven’t covered your topic in 90+ days
- Store journalist data without a legitimate business purpose
- Skip the LinkedIn employment verification step
Further Reading
This is one of our most effective methods and I have never seen another agency talk about it publicly.
The logic: if a journalist wrote about your competitor’s data study and linked to them, they will probably be interested in your data study on a similar topic.
Here is the method:
- Identify 10 to 15 competitors or similar brands that have earned media coverage
- Pull their backlink profiles using Ahrefs, Semrush, or Moz
- Filter for editorial links (exclude directories, forums, guest posts)
- Extract the journalist bylines from each linking article
- Cross-reference against your existing database to find new contacts
When we ran this process across 35 PR agency domains, we found 655 scored new contacts that were not in any commercial media database. These are journalists who are already proven to cover data-driven PR stories.
We take it a step further by scoring each contact based on the domain rating of publications they write for, whether we have an email, whether we have a LinkedIn profile, and their geographic region. The highest-scored contacts go into our priority outreach queue.
Source 3: Publication Sitemap Scraping
Every news website has a sitemap. Most sitemaps contain author URLs. Those author URLs contain journalist names, beats, and sometimes contact information.
We wrote a script that crawls publication sitemaps, extracts author pages, and pulls byline data. Running this across 60 major publications gave us over 700 journalist records, of which about 580 were genuinely new contacts not in our database.
The approach:
- Find the publication’s sitemap (usually at /sitemap.xml or /sitemap_index.xml)
- Look for author-specific sitemaps or URLs containing /author/
- Extract names and any associated metadata
- Cross-reference with the publication’s recent articles to identify active beats
This works best with mid-size publications. The massive outlets like BBC have complex sitemap structures, but regional news groups and trade publications are straightforward.
Source 4: LinkedIn Connections
If you have been networking in your industry, your LinkedIn connections are an untapped source of journalist contacts.
We exported all 8,100+ LinkedIn connections, filtered for media professionals (editors, journalists, reporters, correspondents), and found 580 media contacts we had never messaged.
The advantage: these are people who already accepted a connection request. There is a baseline relationship. A LinkedIn DM from a connection gets read far more reliably than a cold email.
We now maintain a dashboard that tracks which media connections have been contacted, which responded, and which should be avoided (because they explicitly asked not to be pitched). That kind of tracking prevents embarrassing double-pitches.
Key Takeaway
The best pitches answer one question: why should this journalist readers care about this right now?
Source 5: Email Pattern Engineering
Here is where it gets technical. Once you have journalist names and publications, you still need email addresses. And most journalists do not list their email publicly.
But email addresses follow patterns. Most publications use one of a few formats:
| Pattern | Example |
|---|---|
| firstname.lastname@ | john.smith@publication.com |
| firstname@ | john@publication.com |
| firstinitial.lastname@ | j.smith@publication.com |
| firstnamelastname@ | johnsmith@publication.com |
We built a pattern map covering 265 publication domains with their specific email formats. When we add a new journalist from The Telegraph or Express or Metro, we can generate a likely email address instantly.
How we built the pattern map:
- Start with journalists whose emails we already know (from previous correspondence, public bios, etc.)
- Extract the pattern for that publication domain
- Apply it to other journalists at the same publication
- Verify using email validation tools
This alone took our email coverage from roughly 25% to over 42% of our database. For a free, DIY approach, it is remarkably effective.
The Merge: Deduplication and Scoring
The hardest part of building a multi-source database is merging everything without creating duplicates. A journalist might appear in your placement history, your competitor backlinks, AND your LinkedIn connections under slightly different name spellings.
Our deduplication process:
- Normalize names (lowercase, remove middle initials, handle hyphenated surnames)
- Match on name + publication domain first
- Then match on email address for cross-publication matches
- Manual review for fuzzy matches (similar names at the same publication)
After merging 17 different sources, our unified database settled at 5,909 unique journalist records. About 42% have verified email addresses. About 26% have LinkedIn profiles. And we tag every contact with their source, beat, engagement history, and a priority tier.
Keeping It Fresh
A database is only as good as its last update. We have a few systems for keeping ours current:
- Bounce tracking: Every email that bounces gets flagged. If a journalist’s email bounces, we check if they changed publications.
- Response tagging: Every response (positive, negative, or redirect) gets logged. This builds a picture of each journalist’s preferences over time.
- Quarterly re-scraping: We re-run our sitemap and backlink scripts every quarter to catch new journalists and publication changes.
- LinkedIn monitoring: Job change notifications for key contacts alert us when someone switches publications.
Do You Need 27,000+ Contacts?
No. For most campaigns, you are pitching 50 to 80 journalists. But having a large database means you can be selective. Instead of pitching everyone who might be relevant, you pitch the 50 people who are most likely to respond, based on their beat, their publication’s authority, and their past engagement with similar stories.
Start small. Even 200 well-researched, correctly-targeted contacts will outperform a 10,000-name list from a commercial database. The value is not in the volume. It is in the accuracy and the freshness.
Want access to our journalist network for your campaign? When you work with Presslei, you get the benefit of 27,000+ contacts built over months of research and real outreach. Get in touch.
Frequently Asked Questions
How do you build a journalist database without paying for one?
Start with three free sources: scrape article bylines from publication sitemaps to find active journalists, export your LinkedIn connections to identify media professionals, and mine competitor backlinks to find journalists who cover your industry. Then enrich with email patterns based on each publication’s format.
Are paid media databases worth it?
For most small agencies and in-house teams, no. Paid databases like Cision or Muckrack charge $5,000 to $15,000 per year and often contain outdated contacts. A self-built database using public data is more current, more targeted, and free.
How do you find journalist email addresses?
Most publications follow predictable email patterns like firstname.lastname@publication.com. Once you identify the pattern for a publication, you can generate emails for any journalist there. Verify them using free tools before sending.
You Might Also Like
About the Author
Salva Jovells
Founder of Presslei. 12+ years in ecommerce SEO across international markets. After a decade of link buying for Hockerty and Sumissura, I reverse-engineered 5,272 earned media placements and founded a reactive PR agency that builds authority through data-driven stories journalists actually want to publish. Based in Zurich.
Related Reading
- How I Reverse-Engineered 5,272 Media Placements
- PR Tools Compared: Muck Rack vs Cision vs Prezly vs Prowly
- HARO vs Qwoted vs Featured vs SourceBottle
Keep Reading
Ready to earn links instead of buying them?
Get 8–14 editorial placements in top-tier publications. No contracts. No risk. Just results.
$3,000 per campaign · 8–14 links guaranteed · Zero penalty risk


