Building a Serverless Reader View with Lambda and Chrome

// #aws#chrome#lambda#puppeteer#readability#serverless // 5 comments

Do you remember the Firefox Reader View? It's a feature that removes all unnecessary components like buttons, menus, images, and so on, from a website, focusing on the readable content of the page. The library powering this feature is called Readability.js, which is open source.

Motivation

For one of my personal projects, I needed an API that returns the readable content from a given URL. Initially, that seemed like a straightforward task: just fetch the HTML and feed it into the library. However, it turned out to be a bit more complicated due to the complexity of modern web pages filled with lots of JavaScript.

Firstly, to actually retrieve the real content of a page, a browser is needed to execute all scripts and render the page. And since we're talking serverless, it has to run on Lambda, of course. Sounds fun?

Stack

I'm usually a Serverless Framework guy, but for this project, I wanted to try something new. So I decided to give the CDK a try and I really liked the experience – more on that at the end. Let's walk through the interesting bits and pieces.

Lambda Layer

The most crucial question was, of course, how to run Chrome on Lambda. Fortunately, much of the groundwork for running Chrome on Lambda had been laid by others. I used the @sparticuz/chromium package to run Chromium in headless mode. However, Chromium is a rather big dependency, so to speed up deployments, I created a Lambda Layer.

const chromeLayer = new LayerVersion(this, "chrome-layer", { description: "Chromium v111.0.0", compatibleRuntimes: [Runtime.NODEJS_18_X], compatibleArchitectures: [Architecture.X86_64], code: Code.fromAsset("layers/chromium/chromium-v111.0.0-layer.zip"), });

The corresponding .zip file was downloaded as artifact from one of the releases.

Lambda Function

The function runs on Node.js v18 and is compiled via ESBuild from TypeScript. There are a few things to note here. I increased the memory to 1600 MB as recommended, and the timeout to 30 seconds to give Chromium enough space and time to start. I added a reserved concurrency of 1 to prevent this function from scaling out of control due to too many requests.

const handler = new NodejsFunction(this, "handler", { functionName: "lambda-readability", entry: "src/handler.ts", handler: "handler", runtime: Runtime.NODEJS_18_X, timeout: cdk.Duration.seconds(30), memorySize: 1600, reservedConcurrentExecutions: 1, environment: { NODE_OPTIONS: "--enable-source-maps --stack-trace-limit=1000", }, bundling: { externalModules: ["@sparticuz/chromium"], nodeModules: ["jsdom"], }, layers: [chromeLayer], }); const lambdaIntegration = new LambdaIntegration(handler);

When bundling this function, the @sparticuz/chromium package has to be excluded because we provide it as a Lambda Layer. On the other hand, the jsdom package can't be bundled, so it has to be installed as a normal node module.

REST API

The function is invoked by a GET request from a REST API and receives the URL as a query string parameter. The url request parameter is marked as mandatory. Moreover, I made use of the new defaultCorsPrefligtOptions to simplify the CORS setup.

const api = new RestApi(this, "lambda-readability-api", { apiKeySourceType: ApiKeySourceType.HEADER, defaultCorsPreflightOptions: { allowOrigins: Cors.ALL_ORIGINS, allowMethods: Cors.ALL_METHODS, allowHeaders: Cors.DEFAULT_HEADERS, }, }); api.root.addMethod("GET", lambdaIntegration, { requestParameters: { "method.request.querystring.url": true }, apiKeyRequired: true, });

Furthermore, I created an API key and assigned it to a usage plan to limit the maximum number of calls per day.

const key = api.addApiKey("lambda-readability-apikey"); const plan = api.addUsagePlan("lambda-readability-plan", { quota: { limit: 1_000, period: Period.DAY, }, throttle: { rateLimit: 10, burstLimit: 2, }, }); plan.addApiKey(key); plan.addApiStage({ api, stage: api.deploymentStage });

Implementation

Now let's first see the full implementation and then talk about the interesting pieces step by step:

let browser: Browser | undefined; export const handler: APIGatewayProxyHandlerV2 = async (event) => { let page: Page | undefined; try { const { url } = parseRequest(event); if (!browser) { browser = await puppeteer.launch({ args: chromium.args, defaultViewport: chromium.defaultViewport, executablePath: await chromium.executablePath(), headless: chromium.headless, ignoreHTTPSErrors: true, }); } page = await browser.newPage(); await page.goto(url); const content = await page.content(); const dom = new JSDOM(content, { url: page.url() }); const reader = new Readability(dom.window.document); const result = reader.parse(); return formatResponse({ result }); } catch (cause) { const error = cause instanceof Error ? cause : new Error("Unknown error", { cause }); console.error(error); return formatResponse({ error }); } finally { await page?.close(); } };

Firstly, we declare the browser outside of the handler function to be able to re-use the browser instance on subsequent invocations. The launch of a new instance on a cold start causes the majority of execution time.

We parse the url query string parameter from the API Gateway event and validate it to be a real URL. Then, we use Puppeteer to launch a new browser instance and open a new page. This new page is closed at the end of the function while the browser instance stays open until the Lambda is terminated.

Readability.js requires a DOM object to parse the readable content from a website. That's why we create a DOM object with JSDOM and provide the HTML from the page and its current URL. By the way, the browser may have had to follow HTTP redirects, so the current URL doesn't necessarily have to be the one we provided initially. The parse function of the library returns the following result:

type Result = { title: string; content: string; textContent: string; length: number; excerpt: string; byline: string; dir: string; siteName: string; lang: string; };

There's also some meta information available in the result object, but since we're returning raw HTML content we're only interested in the content property. However, we have to add the Content-Type header with text/html; charset=utf-8 to the response object to ensure the browser renders it correctly.

Application

Now comes the fun part. I have created a simple web app with React, Tailwind, and Vite to demonstrate this project. Strictly speaking, you could call the REST API directly from a browser as the Lambda function returns real HTML that renders just fine. However, I thought it would be nicer to use it as a real application.

The following articles are curated examples showcasing the Readability version on the left and the Original article on the right. Of course, you can also try your own article and start here: zirkelc.github.io/lambda-readability

So without further ado, let's read some articles:

Maker's Schedule, Manager's Schedule by Paul Graham.

Readability vs Original

/

Understanding AWS Lambda’s invoke throttling limits by Archana Srikanta on the AWS Compute Blog.

Readability vs Original

/

Advice for Junior Developers by Jeroen De Dauw on DEV.to

Readability vs Original

/

Cloud Development Kit

I've got to say, my initial dive into AWS CDK has been quite a pleasant surprise. What impresses me most is the ability to code up my infrastructure using good old JavaScript or TypeScript, the very languages I already use to develop my application. No more fumbling with meta languages or constantly referring to documentation just to figure out how to do this or that – CDK simplifies everything.

The beauty of it all is that I can utilize the fundamental building blocks: if-conditions and for-loops, objects and arrays, classes and functions. I can put my coding skills to work in the same way I always do, without the need for any special plugins or hooks. That’s what Infrastructure as Code should really feel like – a truly great developer experience.

Conclusion

It's pretty amazing how far the Serverless world has come, enabling us to effortlessly run a Chrome browser inside a Lambda function. If you are interested in the mechanics of this project, you can view the full source code on GitHub. I'd really appreciate your feedback, and if you like it, give it a star on GitHub!