Backing up a page with Wayback Machine

It is annoying to search for an obscure term, getting a decade old result, only to find out that the page points to a link that no longer exists. Perhaps the page is removed, or its domain exchanged owners. An artifact of our civilization is vanished.

Wayback Machine is an initiative designed to prevent that from happening. It indexes webpages and archives them for future generations. I'll show you two ways that you can use in your projects:

With ArchiveLabs JSON API

ArchiveLabs offers an indirect way to request a snapshot. From the documentation, it seems like it is actually JSON API for citing a Wayback snapshot with annotations, but it works.

Try on your browser's console:

fetch('https://pragma.archivelab.org/', {
    method: 'post',
    headers: {
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        url: 'https://abdus.co/'
    })
})
.then(response => response.json())
.then(console.log)

or with PHP:

$endpoint = 'https://pragma.archivelab.org';
$payload = ['url' => 'https://abdus.co/'];

$options = array(
  'http' => [
        'method'  => 'POST',
        'content' => json_encode($payload),
        'header'=>  "Content-Type: application/json"
    ]
);

$context  = stream_context_create($options);
$result = file_get_contents($endpoint, false, $context);
$response = json_decode($result);
var_dump($response);

You'll get this response:

{
    "annotation_id": null,
    "domain": "abdus.co",
    "id": 27101,
    "path": "/",
    "protocol": "https",
    "wayback_id": "/web/20170816045837/https://abdus.co/"
}

To get the full URL for the backup, prepend http://web.archive.org to wayback_id. When you follow the link, you'll see the backed up version of the page.

One problem I've found is that the backups made using this method sometimes turn out with missing images, CSS, JS. Another disadvantage is that it doesn't return a meaningful error message:

{
    "error": "400: Bad Request"
}

I've gotten better results with the next method.

With an HTTP request to Wayback Machine

If you go to homepage of Wayback Machine, you'll see a form that you can use to submit links. And when you submit a link it redirects to https://web.archive.org/save/$url. Opening the Developer Console and inspecting the network requests, we can see Wayback Machine responds back with a number of custom headers, most important of which is Content-Location, the URL for the backup.

Content-Location: /web/20170816041527/http://example.com

What happens when the link submitted is not available? Submitting a non-existent URL, a different header is returned: X-Archive-Wayback-Runtime-Error and an explanation to why the process failed, such as:

X-Archive-Wayback-Runtime-Error: LiveDocumentNotAvailableException: https://abdus.abdus: live document unavailable: java.net.UnknownHostException: abdus.abdus: Name or service not known

By checking these headers, we can not only get a link submitted, but also get an explanation if it fails.

Using get_headers($url, $format) function, we can get the response headers formatted in an associative array.

$url = 'https://abdus.co/';
$headers = get_headers("http://web.archive.org/save/$url", 1);

if ($headers['Content-Location']) {
    $snapshotUrl = "" . $headers['Content-Location'];
} elseif ($headers['X-Archive-Wayback-Runtime-Error']) {
    $error = $headers['X-Archive-Wayback-Runtime-Error'];
}

Please note that if there are redirects, all header values from all subsequent responses are combined and returned in arrays. In that case, you may need to check if the value for a header is array and get the last element with array_pop().