all blog posts


Separating Search Engine Crawler Traffic with Lambda@Edge

Search engines use crawlers to scan websites for content. What these crawlers 'see' determines how well your website can be found. In some cases you might want to serve separate, crawler optimized content to the search engines. In this blog post, we will demonstrate how to achieve traffic separation with Lambda@Edge.

The process of optimizing your website for crawlers is called Search Engine Optimization (SEO). Let's assume you have a static website, which is not optimized for SEO. There are two kinds of traffic: human visitors and automated bots. Both connect to a single domain, which is served by a CloudFront distribution.

Simple CloudFront Setup

Because your website has bad SEO, you decide to make a crawler optimized version of your website and store it in a separate S3 bucket. You need to make sure that the crawlers access exactly the same URLs as humans do. Only the response needs to be different.

CloudFront with two Origins

The question is how to make sure that the bots use one bucket, and human visitors use the other. This is where Lambda@Edge comes in.

Lambda@Edge

AWS Lambda is Amazon's Function as a Service platform. It allows you to focus on writing code, selecting a runtime (eg. Python 3.6 or Node v10 and executing that code based on specific triggers. Lambda@Edge follows the same principles, but with a few major differences.

  1. Lambda@Edge does not get executed in a specific AWS Region, but at the CloudFront Edge locations (hence the name).
  2. Lambda@Edge only supports the Node runtime (v8 or v10).
  3. The only triggers for Lambda@Edge are CloudFront requests or responses.

The fact that the Lambda@Edge functions run whenever CloudFront receives a request or returns a response is what we will be focussing on in this blog post. In the Lambda@Edge documentation you can find this diagram:

Lambda@Edge triggers

It shows the four situations a Lambda@Edge function can be triggered:

  1. When the visitor requests data from CloudFront
  2. When CloudFront requests data the origin
  3. When the origin returns a response to CloudFront
  4. When CloudFront returns a response to the visitor

To intercept and redirect crawler traffic, we need to focus on the first two triggers: the viewer-request and the origin-request.

CloudFront Distribution with Lambda@Edge functions

Viewer request

Whenever any visitor of the website does a request at the CloudFront distribution, the headers, request details and optionally the body are fed into the Lambda function. An example of this event data:

{
    "Records": [
        {
            "cf": {
                "config": {
                    "distributionDomainName": "d2426qy9g49s0r.cloudfront.net",
                    "distributionId": "E1FLN0YVFSZXLA",
                    "eventType": "viewer-request",
                    "requestId": "b5sKNn2Tn9YLf83wrqmASNKiZ8QE1_N6YjNNmTmoB9BwNoW-4rKXuA=="
                },
                "request": {
                    "clientIp": "94.130.141.164",
                    "headers": {
                        "host": [
                            {
                                "key": "Host",
                                "value": "cloudbanshee.com"
                            }
                        ],
                        "user-agent": [
                            {
                                "key": "User-Agent",
                                "value": "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"
                            }
                        ],
                        "accept": [
                            {
                                "key": "Accept",
                                "value": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
                            }
                        ],
                        "accept-encoding": [
                            {
                                "key": "Accept-Encoding",
                                "value": "gzip,deflate"
                            }
                        ]
                    },
                    "method": "GET",
                    "querystring": "",
                    "uri": "/blog/ideal-ssh-bastion"
                }
            }
        }
    ]
}

Origin request

When the CloudFront cache does not have a cached object to return, it forwards the request to the origin. This is called the Origin Request. Its payload is slightly different:

{
    "Records": [
        {
            "cf": {
                "config": {
                    "distributionDomainName": "d2426qy9g49s0r.cloudfront.net",
                    "distributionId": "E1FLN0YVFSZXLA",
                    "eventType": "origin-request"
                },
                "request": {
                    "clientIp": "94.130.141.164",
                    "headers": {
                        "user-agent": [
                            {
                                "key": "User-Agent",
                                "value": "Amazon CloudFront"
                            }
                        ],
                        "via": [
                            {
                                "key": "Via",
                                "value": "1.1 redacted.cloudfront.net (CloudFront)"
                            }
                        ],
                        "accept-encoding": [
                            {
                                "key": "Accept-Encoding",
                                "value": "gzip"
                            }
                        ],
                        "x-forwarded-for": [
                            {
                                "key": "X-Forwarded-For",
                                "value": "94.130.141.164"
                            }
                        ],
                        "host": [
                            {
                                "key": "Host",
                                "value": "redacted.s3-eu-west-1.amazonaws.com"
                            }
                        ]
                    },
                    "method": "GET",
                    "origin": {
                        "s3": {
                            "authMethod": "origin-access-identity",
                            "customHeaders": {},
                            "domainName": "redacted.s3-eu-west-1.amazonaws.com",
                            "path": "",
                            "region": "eu-west-1"
                        }
                    },
                    "querystring": "",
                    "uri": "/blog/connecting-to-private-api-over-vpn-or-vpc-peering"
                }
            }
        }
    ]
}

Separating bot traffic from real visitors

From the payload examples above, you can tell that only the viewer-request has actionable data: it shows a user-agent for Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/. This is obviously a crawler we want to redirect. The origin-request only shows a user-agent header for Amazon CloudFront. This makes sense, because it's actually CloudFront making the request on our origin server.

So we need to do some magic in the viewer-request to determine whether a request comes from a crawler or not. We already know that the user-agent contains the data to do the determination. Let's deploy this function:

'use strict';

const regex = /aolbuild|baidu|bingbot|bingpreview|msnbot|duckduckgo|adsbot-google|googlebot|mediapartners-google|teoma|slurp|yandex|bot|crawl|spider/g;

exports.handler = (event, context, callback) => {
  const request = event.Records[0].cf.request;
  const user_agent = request['headers']['user-agent'][0]['value'].toLowerCase();
  if(user_agent !== undefined) {
    const found = user_agent.match(regex);

    request['headers']['is-crawler'] = [
      {
        key: 'is-crawler',
        value: `${found !== null}`
      }
    ]
  }

  callback(null, request);
};

This function uses regex to see if the inbound request belongs to a crawler, and stores the result in a new header named is-crawler. Then it returns the request for further processing by CloudFront.

Sending crawler traffic to a different origin

The viewer-request is not allowed to decide which origin will be used. The origin-request, however, is allowed to do this. Take a look at the source code for the origin-request function:

'use strict';

exports.handler = (event, context, callback) => {
  const request = event.Records[0].cf.request;

  let is_crawler = undefined;
  if ('is-crawler' in request['headers']) {
    is_crawler = request['headers']['is-crawler'][0].value.toLowerCase();
  }
  if (is_crawler === 'true') {
    request.origin = {
      s3: {
        authMethod: 'origin-access-identity',
        path: '',
        domainName: 'redacted.s3-eu-west-1.amazonaws.com',
        region: 'eu-west-1,
        customHeaders: {}
      }
    };
  }
  callback(null, request);
};

In this function we will read the is-crawler header that was set by the viewer-request function. If that header's value equals 'true', we override the original origin with our alternative origin.

Whitelisting the is-crawler header

The solution outlined above is completely functional and will send any crawler traffic to the correct backend. But this will only happen for requests that are not cached. If the data is cached, it's possible that crawler-specific pages are returned to human visitors and vice-versa. The solution is to whitelist the is-crawler header.

From the Amazon docs:

You can configure CloudFront to forward headers to the origin, which causes CloudFront to cache multiple versions of an object based on the values in one or more request headers. To configure CloudFront to cache objects based on the values of specific headers, you specify cache behavior settings for your distribution. For more information, see Cache Based on Selected Request Headers.

'Multiple versions of an object based on the values in one or more request headers' is exactly what we need. So go ahead and configure your CloudFront distribution to whitelist is-crawler.

Whitelist CloudFront Distribution

Confirming everything works

Let's do a request as a normal user (ie. not a crawler):

→ curl --silent https://cloudbanshee.com/robots.txt -X GET -I -H 'user-agent: normal-viewer'
HTTP/2 200 
content-type: text/plain
content-length: 24
x-amz-id-2: xlWUXLsnVaXUWkeOj6kBuCAZ8aD97noCtZzZnEbkh4Te1FOz3ii9TgGjcSXOfyh1EdZ1tXXduHY=
x-amz-request-id: CE08866EC17BAEF3
date: Sun, 16 Jun 2019 15:33:14 GMT
last-modified: Sun, 16 Jun 2019 08:35:02 GMT
etag: "b6216d61c03e6ce0c9aea6ca7808f7ca"
x-amz-server-side-encryption: AES256
cache-control: max-age=31557600,public
accept-ranges: bytes
server: AmazonS3
x-cache: Miss from cloudfront
via: 1.1 045e5b56f3f7e0d8f206766f7855c6f3.cloudfront.net (CloudFront)
x-amz-cf-pop: AMS1
x-amz-cf-id: 7XridXUAdvfAa7evMtDR2jiMGCqxeuesG3Iv_h9UIF3SLXBmlkUF0Q==

You can tell that the server is S3 and the request was not available in the cache. Let's do the same request again:

→ curl --silent https://cloudbanshee.com/robots.txt -X GET -I -H 'user-agent: normal-viewer'
HTTP/2 200 
content-type: text/plain
content-length: 24
x-amz-id-2: xlWUXLsnVaXUWkeOj6kBuCAZ8aD97noCtZzZnEbkh4Te1FOz3ii9TgGjcSXOfyh1EdZ1tXXduHY=
x-amz-request-id: CE08866EC17BAEF3
date: Sun, 16 Jun 2019 15:33:14 GMT
last-modified: Sun, 16 Jun 2019 08:35:02 GMT
etag: "b6216d61c03e6ce0c9aea6ca7808f7ca"
x-amz-server-side-encryption: AES256
cache-control: max-age=31557600,public
accept-ranges: bytes
server: AmazonS3
age: 16
x-cache: Hit from cloudfront
via: 1.1 045e5b56f3f7e0d8f206766f7855c6f3.cloudfront.net (CloudFront)
x-amz-cf-pop: AMS1
x-amz-cf-id: K0RaR1q5tNlH2TYeqp3MFRP1FnblkE-r3ssuna8EpayZpLzbg8eeTA==

The response still comes from S3, but the response was cached at CloudFront. Now let's try as a bot:

→ curl --silent https://cloudbanshee.com/robots.txt -X GET -I -H 'user-agent: googlebot'
HTTP/2 200 
content-type: text/plain; charset=UTF-8
content-length: 25
server: nginx/1.14.1
date: Sun, 16 Jun 2019 15:38:34 GMT
x-powered-by: Express
accept-ranges: bytes
cache-control: public, max-age=0
last-modified: Sun, 16 Jun 2019 15:38:31 GMT
etag: W/"19-16b60f0a1c7"
x-cache: Miss from cloudfront
via: 1.1 c0b5bcbd47f419797c2848b6172cc349.cloudfront.net (CloudFront)
x-amz-cf-pop: AMS1
x-amz-cf-id: JIYRGVU7m0ZYqY2fCmhiucUiszliLkIWe41_IdP6t54l69TFOn8O5w==

With the googlebot user-agent header, the request is served from NginX! And when we request it again:

→ curl --silent https://cloudbanshee.com/robots.txt -X GET -I -H 'user-agent: googlebot'
HTTP/2 200 
content-type: text/plain; charset=UTF-8
content-length: 25
server: nginx/1.14.1
x-powered-by: Express
accept-ranges: bytes
last-modified: Sun, 16 Jun 2019 15:38:31 GMT
date: Sun, 16 Jun 2019 15:39:10 GMT
cache-control: public, max-age=0
etag: W/"19-16b60f0a1c7"
x-cache: RefreshHit from cloudfront
via: 1.1 80d6ceec7d3cd9fa88dfa92002c593ab.cloudfront.net (CloudFront)
x-amz-cf-pop: AMS1
x-amz-cf-id: QgWwTgQu4uVFDc0SzQKldZ6nsPqQZKaPqKu97CIdC93u_yOxTxS-CQ==

The request is cached, but it still comes from NginX origin. This shows that requesting the same URL and file yields different results, depending on your user-agent. It also shows that caching is still functioning as it should, and there is no origin contamination for the different types of requesters.

If you have any questions or remarks, feel free to reach out on Twitter.


Related blog posts


all blog posts