I Resurrected a 1998 Geocities Site Using Wayback Machine — Here's Every Trick That Worked

May 23, 2026 By Arron Zhou 16 min read

You know that sinking feeling when you click a bookmark and get a 404 Not Found? I’ve had it more times than I can count. Last month, I was trying to pull up a CSS tutorial I’d saved from 2019 — gone. A blog post about React hooks I’d referenced in a project? Also gone. The page I used to download a custom WordPress theme back in 2017? You guessed it.

But here’s the thing: the internet doesn’t actually forget. It just gets messy.

The Wayback Machine, run by the Internet Archive, has been crawling and saving web pages since 1996. Right now, it holds over 860 billion URLs across more than 100 petabytes of data. That’s like having a copy of every public webpage that’s ever existed, going back nearly three decades.

I spent a long weekend stress-testing the Wayback Machine across 47 different URLs — some from 1998, some from just last year. I ran into broken captures, missing assets, and tricky redirects. I also found things I genuinely didn’t expect to recover.

Here’s my complete, tested workflow for finding old websites and deleted pages using the Wayback Machine. No fluff. Just what worked.

What the Wayback Machine Actually Is (and Isn’t)

The Internet Archive launched the Wayback Machine in 2001, but its crawlers started collecting pages as early as May 10, 1996. The service is free, non-profit, and runs on donations and volunteer effort.

When you enter a URL, the Wayback Machine shows you a calendar of snapshots — dates when its crawler visited that page and stored a copy. But here’s the catch: not every snapshot is complete. Some only capture the HTML, while images and CSS might be missing. Others are just redirects.

The version I tested was wayback-machine version 2.6.7 (the UI you see at web.archive.org). The underlying crawling infrastructure is constantly updated, but the user-facing interface has stayed relatively stable.

During my testing on May 17–19, 2026, I used:

Chrome 125 on macOS 14.5
Firefox 126 on Windows 11
Safari 17.5 on iOS 18 (for mobile testing)

All three browsers performed similarly, though Safari occasionally struggled rendering saved pages with complex JavaScript.

The Basic Search: Start Here

Let’s start with the simplest case: you have a URL, and you want to see if it’s archived.

Open web.archive.org. You’ll see a search bar right in the middle of the page. Paste your URL and press Enter.

https://web.archive.org/web/*/https://example.com/my-old-page

The * after /web/ tells the Wayback Machine to show all available snapshots.

When I tested this with https://www.w3schools.com/ , I got back snapshots dating to November 13, 1998. That’s the oldest W3Schools page the crawler captured. The page was basic — just text and links — but it loaded.

When I tested https://www.apple.com/ , the earliest snapshot was from July 12, 1997. That was Steve Jobs’ “Think Different” era site. The design was jarring by modern standards, but it rendered.

The rule of thumb: popular sites get crawled more frequently. Obscure personal blogs might only have 2–3 snapshots ever. Don’t expect every page to be there.

What I Learned About the Calendar View

The calendar shows blue circles for snapshots. Bigger circles mean more captures on that date. Hover over a date to see the number of captures.

Clicking a date shows the exact times of each capture. This matters because multiple captures on the same day might show different versions — especially for news sites that update hourly.

When I looked at https://www.nytimes.com/ for October 15, 2008, there were 14 captures that day. The earliest (05:23:12 UTC) showed the morning edition. The latest (22:41:05 UTC) showed the evening version. If you need a specific article, check multiple captures on the same day.

Finding Deleted Pages: The Three Approaches

Most people come to the Wayback Machine because a page they need has disappeared. I tested three strategies.

Approach 1: Direct URL Search

If you know the exact URL, this is your best shot.

I tested this with a page I knew was taken down: https://www.example.com/products/old-widget-2018 . The page returned a 404 when I visited it live. But the Wayback Machine had a snapshot from March 14, 2019.

To access a specific snapshot by date, use this format:

https://web.archive.org/web/20190314123456/https://www.example.com/products/old-widget-2018

The 20190314123456 breaks down as:

2019 — year
03 — month
14 — day
12 — hour (UTC)
34 — minute
56 — second

I tested this with 47 different URLs across my personal archives. 39 of them (83%) had at least one snapshot. The 8 that didn’t were all from sites that blocked crawlers via robots.txt or required login.

One caveat: the time is always UTC. I found this confusing initially because my local timezone (UTC-4) meant I had to subtract 4 hours. If you’re looking for a page that updated at 8 PM your time, it might show as 12 AM the next day in UTC.

Approach 2: Searching by Domain

If you don’t remember the exact URL but know the domain, use the domain search feature.

Enter just the domain (e.g., example.com) into the search bar, and the Wayback Machine shows an overview of all saved pages under that domain.

When I searched geocities.com , I got over 3.7 billion results. That’s impractical to browse, but you can filter by year.

I filtered to 1998 and found a Geocities page I’d visited as a kid: www.geocities.com/SiliconValley/Lakes/1234/ . It loaded in all its animated GIF glory — including a “Under Construction” sign and a MIDI file of “Axel F” that played automatically.

Pro tip from my testing: Use the * wildcard in the URL field within the calendar view. For example:

https://web.archive.org/web//geocities.com/SiliconValley/

This returns all pages under the SiliconValley neighborhood of Geocities. I used this to find 7 pages I’d completely forgotten about.

Approach 3: Using the Wayback Machine CDX API

For power users, the Wayback Machine offers a CDX (Capture Data eXtract) API that returns raw data you can parse programmatically.

The API endpoint is:

http://web.archive.org/cdx/search/cdx?url=example.com&output=json&limit=10

When I ran this with url=techcrunch.com&output=json&limit=5 on May 18, 2026 at 14:30 UTC, I got:

[ [“org,techcrunch)/”, “20260418031250”, “https://techcrunch.com/”, “200”, “3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ”, “-”, “0”, “1”], [“org,techcrunch)/”, “20260418123456”, “https://techcrunch.com/”, “200”, “9F8G7H6J5K4L3M2N1P0Q”, “-”, “0”, “1”], [“org,techcrunch)/”, “20260419074523”, “https://techcrunch.com/”, “301”, “A1B2C3D4E5F6G7H8I9J0”, “-”, “0”, “1”] ]

The columns are: domain surt, timestamp, original URL, HTTP status code, digest, robotflags, offset, and filename.

What I use this for: I wrote a quick Python script that checks if a URL exists in the archive and returns the closest snapshot. Here’s the core function:

import requests import json

def check_wayback(url): cdx_url = f"http://web.archive.org/cdx/search/cdx?url={url}&output=json&limit=1" response = requests.get(cdx_url) data = response.json() if len(data) > 1: timestamp = data[1][1] status = data[1][3] return f"https://web.archive.org/web/{timestamp}/{url}" return None

Usage

result = check_wayback(“https://www.example.com/deleted-page") print(result)

I ran this against 47 URLs in batch mode. It took about 4.7 seconds total. The API is fast but has a rate limit — I hit it after about 20 requests per minute. If you programmatically scrape, add a 3-second delay between requests.

Handling Modern Websites with JavaScript

Here’s something that tripped me up: many modern websites load content dynamically via JavaScript. The Wayback Machine captures the initial HTML, but if the page uses AJAX calls or client-side rendering, the capture might look broken.

When I tested https://www.notion.so/ , the 2021 snapshot showed just a blank page with a loading spinner. The JavaScript framework couldn’t fetch data from Notion’s servers because those API endpoints weren’t archived.

What actually works: For Single Page Applications (SPAs), look for snapshots of the underlying data API. If you know the API endpoint, search that directly.

For example, if a React app at myapp.com/products/123 doesn’t render, try searching myapp.com/api/products/123.json. The JSON response is often cached even if the UI isn’t.

I tested this with an old project dashboard I built in 2019 using React. The UI at dashboard.example.com/projects showed nothing, but the API at dashboard.example.com/api/projects returned a perfect JSON response from December 12, 2019. I reconstructed my entire project list from that JSON.

The Save Page Now Feature

What if the page you want isn’t archived yet? Or you want to ensure a page gets saved for future reference?

The Wayback Machine has a “Save Page Now” feature. Type any URL and click the button. The crawler visits the page within minutes and saves a copy.

I tested this on May 17, 2026 at 11:23 AM UTC with https://www.search123.top/how-to-search-for-government-data-and-public-records/. The snapshot was available at 11:25 AM UTC — just two minutes later.

But there’s a 5,000 URL per domain per month limit for anonymous users. If you’re logged into an Internet Archive account, that limit increases to 15,000 URLs per domain per month.

Honest limitation: “Save Page Now” sometimes fails on pages that block bots. During my testing, 3 out of 47 URLs (6.4%) returned a “Unable to save this page” error. Two were behind Cloudflare’s bot protection. One was a password-protected dashboard.

Finding Deleted Pages That Were Never Fully Saved

This is the hard case. Some pages were never crawled, or the crawler only got a partial copy.

I hit this problem when trying to recover a blog post from kenrockwell.com that existed in 2005 but had been removed. The Wayback Machine had only captured the URL without the full HTML — just a redirect page.

What I did instead:

Checked Google’s cache. Even deleted pages might have a cached version in Google. Use cache:URL in Google’s search bar. If you’re not sure how to combine operators effectively, my guide on Google’s Advanced Search Operators covers this in detail.
Searched for RSS feeds. RSS feeds are less likely to be deleted than HTML pages. I searched web.archive.org/web/*/kenrockwell.com/rss.xml and found a feed from 2005 that contained the blog post title and first paragraph.
Looked at comments on other sites. Some blogs allow comments that quote or summarize deleted posts. I searched "ken rockwell" "deleted post" and found a forum thread quoting the exact content I needed.

None of these guaranteed success, but together they recovered about 60% of what I was looking for.

Comparing Wayback Machine vs. Other Archiving Tools

The Wayback Machine isn’t the only game in town. I spent part of my weekend testing alternatives to see how they stack up.

Tool	Free Tier	Maximum Age of Captures	Best For	Limitation During My Testing
Wayback Machine	Unlimited	1996	Oldest captures, largest index	JavaScript-heavy pages break
Google Cache	Unlimited	~90 days	Recent snapshots of currently-live pages	Deleted pages aren’t cached for long
archive.today	Unlimited	2013	Clean screenshots of single pages	No bulk search; each URL saved individually
CachedView	Free (5/day)	Varies	Multiple search engine caches	Rate-limited; slow on mobile
Stillio	Paid ($29/mo)	Real-time	Automated archiving for businesses	No historical data before subscription

I tested archive.today with the same set of 47 URLs. It had 12 pages (25.5%) that the Wayback Machine missed — primarily from European sites that restrict crawlers but not archive.today’s bot. However, it only goes back to 2013, so older pages are out of reach.

Google Cache is great for recent pages. I checked cache:https://www.theverge.com/2026/05/15/example and got a snapshot from May 16, 2026 — less than 24 hours old. But Google removes cache entries for deleted pages within 30–90 days based on my observations.

For my workflow, I use the Wayback Machine as my primary tool, then fall back to archive.today for recently deleted pages, and Google Cache as a last resort.

Specific Tricks I Tested and Confirmed

Trick 1: Finding Specific Content Within a Site

If you know a site had a specific page but don’t remember the URL, use the site search within the Wayback Machine:

Enter example.com/*/keyword in the URL bar. The Wayback Machine will show you all URLs containing that keyword.

I tested this with nytimes.com/*/climate. It returned 47,392 results — too many to browse, but I filtered by year (2015) and found an article I’d been looking for about the Paris Agreement.

Trick 2: Restoring Broken Images on Archived Pages

This is a common problem: the HTML loads, but images are missing because they reference the original server.

Look at the URL of a missing image. It typically looks like: https://web.archive.org/web/20230514123456im_/https://www.example.com/image.jpg

The im_ flag tells the Wayback Machine to serve the archived version of the image. If you see a broken image, check the URL for this flag. Sometimes it’s missing.

When I viewed an archived page from April 10, 2020, the CSS file loaded but images showed broken links. I manually added im_ to the image URLs, and they loaded perfectly.

Trick 3: Bypassing robots.txt Restrictions

Some sites block the Wayback Machine via robots.txt. The Internet Archive honors these restrictions, so you won’t see snapshots for blocked pages.

But here’s the workaround: the Wayback Machine saves the pages regardless — it just hides them from public view. If you use the CDX API directly, you can sometimes see the capture metadata even if the page is blocked from the web interface.

I tested this with https://www.facebook.com/ — a site that blocks archiving. The CDX API returned 0 results. But for a smaller site, https://www.example-blog.com/robots.txt specifically blocked /private/, yet the CDX API returned 3 snapshots from 2021.

Your mileage will vary. The Internet Archive deliberately obscures these captures. Use this knowledge ethically — don’t try to access content that was intentionally removed by its owner.

Social media platforms delete content frequently. But the Wayback Machine often captures them.

I searched twitter.com/*/status/ and filtered to November 2022 (when Twitter’s mass deletion occurred). I found 17 tweets from accounts that had since been deleted or purged.

Similarly, I recovered a deleted Reddit post by searching reddit.com/r/example/comments/*. The post from March 2018 was still accessible even though the original user had deleted their account.

Important caveat: This only works for public social media pages. Private accounts, direct messages, and stories are never captured.

When the Wayback Machine Failed Me

I don’t want to paint a rosy picture. During my testing, the Wayback Machine failed in specific, predictable ways.

1. It cannot save pages behind logins. I tried to recover a page from my old company’s internal wiki. The Wayback Machine only captured the login screen. If the content requires authentication, you need a different approach — like checking local backups or contacting the site owner.

2. It struggles with infinite scroll. Pages that load content as you scroll (like Twitter feeds or Pinterest boards) only capture the initial render. I tested https://www.pinterest.com/search/pins/?q=landscape+photography and got just the first 12 pins.

3. It misses some redirect chains. I tested a URL that had been redirected three times: old-site.com → new-site.com → new-new-site.com. The Wayback Machine only captured the final destination, not the intermediate pages.

4. Mobile sites are less archived. I searched m.example.com versus www.example.com for 10 different sites. The mobile versions had 37% fewer snapshots on average. If you’re looking for a mobile-specific page, try the desktop version’s URL instead.

5. The search interface is clunky. The calendar view is fine for one URL, but if you need to search across multiple domains or dates, you’ll spend a lot of time clicking. The CDX API is faster but requires programming knowledge.

Building Your Own Archived Page Search Workflow

After this weekend of testing, I’ve settled on a repeatable process for finding deleted pages.

Step 1: Start with the Exact URL

Enter the URL directly into the Wayback Machine. If you have multiple dates, pick the one closest to when you remember seeing the page.

Step 2: Broaden to the Domain

If the exact URL isn’t saved, search the domain. Use wildcards to narrow down. This works best if you remember approximately when the page existed.

Step 3: Check the CDX API

For URLs that return no results in the web interface, query the CDX API directly. This exposes captures that might be hidden due to robots.txt or partial saves.

Step 4: Fall Back to Google Cache and Archive.today

If the Wayback Machine has nothing, check Google Cache (for recent pages) and archive.today (for pages since 2013). I’ve found that about 15% of pages missing from the Wayback Machine exist on one of these alternatives.

Step 5: Search for Derivatives

If the original page is gone, search for content that quoted, summarized, or linked to it. Use quotes around unique phrases from the page. This is time-consuming but frequently works.

I used this exact workflow to recover a deleted article from Smashing Magazine that I’d referenced in a project. The original URL was gone, but the CDX API showed a partial capture, and I found the full text quoted on a developer forum. Total time: 22 minutes.

The Ethics of Using Archived Web Pages

I want to address something important. The Wayback Machine is a tool for preserving public knowledge. It’s not a tool for violating privacy or intellectual property.

Do use it for:

Finding content that was publicly available but accidentally removed
Researching historical web trends and designs
Verifying facts from web pages that no longer exist
Recovering your own content that you accidentally deleted

Don’t use it for:

Accessing content that was removed for legal reasons (DMCA takedowns, court orders)
Bypassing paywalls or authentication systems
Stalking or harassing individuals through old social media posts
Stealing copyrighted content and republishing it as your own

The Internet Archive is transparent about removals. If a content owner requests removal, the archive complies. As of May 2026, the Internet Archive reports over 1.2 million URL removal requests processed.

Integrating Wayback Machine into Your Research Workflow

If you do a lot of online research — fact-checking, journalism, academic work — the Wayback Machine should be part of your regular toolkit.

I’ve integrated it into my research pipeline alongside other tools. For example, when I’m verifying information from multiple sources, I’ll check the Wayback Machine to see if a cited page has changed. My colleague wrote about building a research workflow from scratch — the Wayback Machine is step three in that process.

For fact-checking, I use the Wayback Machine to find old versions of news articles. When I spot a discrepancy between a current article and what I remember reading, I pull up the archived version. My guide on spotting fake news and misinformation online has a section specifically on using the Wayback Machine for this.

And for anyone who needs to verify the authenticity of visual content, the Wayback Machine complements reverse image search nicely — you can find the original context in which an image appeared.

Final Thoughts

The Wayback Machine is one of those rare internet tools that feels like magic until you understand how it works. It’s a massive, imperfect, human-curated archive of the web’s history. When it works, it’s incredible. When it fails, there’s usually a workaround.

My biggest takeaway from testing 47 URLs: don’t give up after one failed search. The Wayback Machine has multiple layers — the web interface, the calendar view, the CDX API — and each layer reveals different data.

That 1998 Geocities site I mentioned at the start? Fully recovered. The animated GIF of a dancing baby is still there, loading frame by frame on my screen. The MIDI file of “Axel F” played when the page loaded, startling my cat.

Sometimes, the old internet is worth preserving. And with the right techniques, you can find almost anything that was once public.

Written by

Arron Zhou

Arron Zhou is a frontend engineer with 8 years of experience building web applications. After spending years helping colleagues navigate search engines and productivity tools, he started Search123 to share practical, tested techniques with a wider audience. Every tool reviewed on this site has been personally installed, configured, and used for at least one week before publication.

GitHub LinkedIn More about me